onerss - A RSS Feed Merger, and a Practice of Unix Philosophy

DONG Yuxuan <https://www.dyx.name>

28 Mar 2023 (+0800)

The onerss program is a Unix program merging multiple RSS 2.0 feeds into one. It’s also a practice of Unix philosophy. The onerss program concentrates on its core task and is designed to be integrated with the Unix ecosystem. It results a program providing as much functions as possible with fewest options. This report contains a user’s guide, a discussion of the design and implementation of onerss.

The code is hosted at GitHub <https://github.com/dongyx/onerss>.

Table of Contents


1. Introduction

The onerss program takes filenames of the source feeds and prints the merged feed to the standard output. If there are no filename, onerss reads source feeds from the standard input. The following examples show how to use onerss to perform a series of merging tasks, from the simplest one to the most complicated one.

2. Design

2.1. Filters

Usually, feeds are hosted online instead of in the local disk. It would violate the “Do One Thing Well” rule [6] if we make onerss support HTTP, HTTPS, FTP, etc, because we already got curl [9]. However, it would be cumbersome for users if we request them to download feeds to temporary files first. Especially that requires file naming.

Recent shells like Bash and Zsh provide the process substitution feature [1] [2]. Process substitution allows users to create inline temporary files conveniently. The following call uses onerss and process substitution to merge remote feeds.

onerss <(curl example.com/a.xml) <(curl example.com/b.xml)

However, not all shells contain this feature, and the syntax may differ between different shells. In addition, files created by some process substitution implementations are not seek-able. This would limit our implementation. Most programs won’t check the type of files specified by command line arguments. Even we considered the difference, the programs called by our program may not. If we forward these arguments to the called program, it may fail.

Our solution is making onerss able to read feeds from the standard input. Although the standard input is usually in-seek-able either, it shows that explicitly. We can’t accidentally fail a called program by forwarding arguments. Additionally, this makes onerss work on shells without process substitution.

Then we could pipe any other program to onerss to support any protocol. Besides, onerss prints to the standard output. This makes it a filter [5] [8]. The Unix environment is friendly to filters. That’s the essential reason why onerss could be so powerful and so simple.

However, this feature depends on the nature of the input format. It’s not hard to split multiple RSS feeds from a single stream. For an input format that we can’t split multiple sources from a single stream, this approach won’t work.

For example, the diff program [4] compares two arbitrary text files. It must takes two arguments from the command line. If it reads two files from the standard input, it can’t know where the boundary is.

Some diff implementations require no seek-ability of input files. We could use process substitution to compare two temporarily generated files. The following call compares two remote files in Bash.

diff <(curl example.com/a) <(curl example.com/b)

Not all programs, however, require no seek-ability. And not all shells support process substitution. These programs and shells couldn’t make use of process substitution.

This problem can’t be solved by application design. It should be solved by shells. If a shell provides a more powerful process substitution feature which allows users to decide whether the generated file is seek-able, and standardlize the feature, many programs would become much more powerful without modification.

Although the Unix philosophy encourages small and simple programs, there are useful functions which are better to be available in a central service [7]. A collection of small programs is powerful, precisely because the environment provides useful common facilities.

2.2. Recursive Calls

The onerss program provides no option to rename source feeds. Instead, it encourages users to recursively call it with the -t option. This design significantly decreases the complexity of interfaces.

This approach works because onerss is a function mapping a set to its subset. Specifically, onerss reads a feed set and prints a feed which is also a kind of feed set, a feed set containing only one feed. This approach could be used in other programs with this property.

However, this approach is some kind of hacking. The way it exploits the -t option violates the original purpose of the option. That’s why we explained a lot in the Introduction chapter.

A more general way suggested by the Unix philosophy [6] is writing a new program to rename a feed. Let’s call this program rsscname. With rsscname, it’s not only that we don’t need to hack the -t option, but also that the -t option could be removed from onerss.

Supposing rsscname takes two arguments. One is the channel title. Another is the filename. If the filename is omitted, rsscname reads the feed from the standard input. Figure 2 shows the command using rsscname to replace figure 1.

(       rsscname 'NewName1' feed1.xml
        curl example.com/feed2.xml | rsscname 'NewName2'
) | onerss -p | rsscname 'Merged Name'

Figure 2.

Figure 2 doesn’t complicate the command much more than figure 1, but it’s more self-explained. In addition, this approach works on a wider range of programs.

Other options of onerss may also be separated as independent programs. However, that would produce too many special-purpose programs. The Unix philosophy encourages making simple but expressive programs [8]. These small programs, including rsscname, are too special-purose, not expressive.

A possible solution is developing a general RSS or XML manipulator to work with or even replace onerss. However, it’s hard to design a concise interface for such a program. This type of programs often ship with a domain-specific language [8] like CSS selectors [10], or XPath [3].

3. Implementation

The Unix philosophy encourages process separation [8], using small programs, especially those fundamental ones like grep, seq, and awk, glued together to form a larger program. Shell is born for this type of gluing and onerss is written in shell.

However, to make this easy, the small programs and the file formats must conform to some conventions. These conventions are called “Good Files and Good Filters” by [5]. We only discuss the “Good Files” part, because we are able to choose what filters we use.

“Good Files” are line-based text files. Each line is an object of interest. When more information is present for each object, the file is still line-by-line, but columnated into fields separated by blanks or tabs.

XML files are clearly not “Good Files”. Fortunately, W3C developed HTML-XML-utils [11] which contains the hxpipe and hxunpipe programs to convert between XML files and “Good Files”. The onerss program is basically a wrapper of the following pipeline.

hxpipe | awk | hxunpipe

This makes the implementation very concise. At the moment this report was writing, it contained no more than 160 effective lines of shell, including tedious work like parsing arguments, and printing the help and version information.

The onerss program, like any shell script piping multiple processes, has a poor performance.

No matter how low the cost of process creation is, each process requires application-level initialization code. For example, the grep program needs to compile the regular expression. Even we save compiled regular expressions, loading them costs time.

Although in our practices, the performance bottleneck is always curl for it accesses remote files. The onerss program could become the bottleneck when merging numerous feeds with concurrent curl calls.

4. Conclusions

We’ve introduced the onerss program and discussed several advantages and flaws of its design and implementation. The onerss program is a simple, expressive, consicely implemented, but hacky and slow program.

Basically, most advantages are from the great facilities of the Unix system. Some of the flaws are due to the author’s mistakes. The others are due to the essential imperfection of the Unix philosophy.

References

[1]
Process substitution. Bash Reference Manual. Retrieved from https://www.gnu.org/software/bash/manual/bash.html#Process-Substitution
[2]
Process substitution. The Z Shell Manual. Retrieved from https://zsh.sourceforge.io/Doc/Release/Expansion.html#Process-Substitution
[3]
James Clark, Steve DeRose, and others. 1999. XML path language (XPath).
[4]
James Wayne Hunt and M Douglas MacIlroy. 1976. An algorithm for differential file comparison. Bell Laboratories Murray Hill.
[5]
Brian W Kernighan and Rob Pike. 1984. The UNIX programming environment. Prentice-Hall Englewood Cliffs, NJ.
[6]
M McIlroy, EN Pinson, and BA Tague. 1978. UNIX time-sharing system. Bell System Technical Journal 57, 6 (1978), 1899–1904.
[7]
Rob Pike and Brian Kernighan. 1984. Program design in the UNIX environment. AT&T Bell Laboratories Technical Journal 63, 8 (1984), 1595–1605.
[8]
Eric S Raymond. 2003. The art of unix programming. Addison-Wesley Professional.
[9]
Daniel Stenberg. 1998. Curl - command line tool and library for transferring data with URLs. Retrieved from https://curl.se
[10]
W3C. Selectors level 3. Retrieved from https://www.w3.org/TR/selectors-3/
[11]
W3C. HTML-XML-utils. Retrieved from https://www.w3.org/Tools/HTML-XML-utils/