onerss
- A RSS Feed Merger, and a Practice of Unix
PhilosophyThe onerss
program is a Unix program merging multiple
RSS 2.0 feeds into one. It’s also a practice of Unix philosophy. The
onerss
program concentrates on its core task and is
designed to be integrated with the Unix ecosystem. It results a program
providing as much functions as possible with fewest options. This report
contains a user’s guide, a discussion of the design and implementation
of onerss
.
The code is hosted at GitHub <https://github.com/dongyx/onerss>.
Table of Contents
The onerss
program takes filenames of the source feeds
and prints the merged feed to the standard output. If there are no
filename, onerss
reads source feeds from the standard
input. The following examples show how to use onerss
to
perform a series of merging tasks, from the simplest one to the most
complicated one.
Merge local feeds
onerss feed1.xml feed2.xml
or
cat feed1.xml feed2.xml | onerss
Merge remote feeds
curl example.com/feed1.xml example.com/feed2.xml | onerss
Merge mixed feeds
( curl example.com/feed1.xml
cat feed2.xml
) | onerss
Specify the title of the merged feed
The merged feed has a default channel title. The -t
options changes it.
onerss -t 'Merged News' feed1.xml feed2.xml feed3.xml
Prepend source channel titles to item titles
It’s natural that users want to distinguish between different source
feeds while reading the merged feed. The -p
option makes
the source channel title prepended to item titles in the merged feed.
The -c
option sets the category
field of items
in the merged feed to the source channel title. These would help users
and RSS agent programs to distinguish between different sources.
onerss -p feed1.xml feed2.xml
Rename source feeds before merging
It’s a common task that users would like to change source channel titles. For example, some source feeds may have very long channel titles. It becomes unreadable if we prepend them to item titles. Users may want a short alias for such a feed.
The onerss
program adds no option for the renaming task.
The -t
option renames the merged feed. If we call
onerss
on a single feed with the -t
option, we
rename that feed. Then we could pipe the renamed feed to another
onerss
process for merging. Figure 1 demonstrates the
approach.
( onerss -t Name1 feed1.xml
curl example.com/feed2.xml | onerss -t Name2
) | onerss -pt 'Merged News'
Figure 1.
Rename source feeds by group before merging
( onerss -t GroupName1 feed1a.xml feed1b.xml
onerss -t GroupName2 feed2a.xml feed2b.xml
) | onerss -pt 'Merged News'
Merge Atom feeds and RSS feeds
The onerss
program doesn’t support Atom naively.
However, we could pipe a Atom-to-RSS program to onerss
.
( curl example.com/atom-feed.xml | atom2rss
cat rss-feed.xml
) | onerss
Usually, feeds are hosted online instead of in the local disk. It
would violate the “Do One Thing Well” rule [6] if we make onerss
support
HTTP, HTTPS, FTP, etc, because we already got curl
[9]. However, it would be
cumbersome for users if we request them to download feeds to temporary
files first. Especially that requires file naming.
Recent shells like Bash and Zsh provide the process
substitution feature [1] [2]. Process substitution allows users to
create inline temporary files conveniently. The following call uses
onerss
and process substitution to merge remote feeds.
onerss <(curl example.com/a.xml) <(curl example.com/b.xml)
However, not all shells contain this feature, and the syntax may
differ between different shells. In addition, files created by some
process substitution implementations are not seek
-able.
This would limit our implementation. Most programs won’t check the type
of files specified by command line arguments. Even we considered the
difference, the programs called by our program may not. If we forward
these arguments to the called program, it may fail.
Our solution is making onerss
able to read feeds from
the standard input. Although the standard input is usually
in-seek
-able either, it shows that explicitly. We can’t
accidentally fail a called program by forwarding arguments.
Additionally, this makes onerss
work on shells without
process substitution.
Then we could pipe any other program to onerss
to
support any protocol. Besides, onerss
prints to the
standard output. This makes it a filter [5] [8]. The Unix environment is friendly to
filters. That’s the essential reason why onerss
could be so
powerful and so simple.
However, this feature depends on the nature of the input format. It’s not hard to split multiple RSS feeds from a single stream. For an input format that we can’t split multiple sources from a single stream, this approach won’t work.
For example, the diff
program [4] compares two arbitrary text files. It must
takes two arguments from the command line. If it reads two files from
the standard input, it can’t know where the boundary is.
Some diff
implementations require no
seek
-ability of input files. We could use process
substitution to compare two temporarily generated files. The following
call compares two remote files in Bash.
diff <(curl example.com/a) <(curl example.com/b)
Not all programs, however, require no seek
-ability. And
not all shells support process substitution. These programs and shells
couldn’t make use of process substitution.
This problem can’t be solved by application design. It should be
solved by shells. If a shell provides a more powerful process
substitution feature which allows users to decide whether the generated
file is seek
-able, and standardlize the feature, many
programs would become much more powerful without modification.
Although the Unix philosophy encourages small and simple programs, there are useful functions which are better to be available in a central service [7]. A collection of small programs is powerful, precisely because the environment provides useful common facilities.
The onerss
program provides no option to rename source
feeds. Instead, it encourages users to recursively call it with the
-t
option. This design significantly decreases the
complexity of interfaces.
This approach works because onerss
is a function mapping
a set to its subset. Specifically, onerss
reads a feed set
and prints a feed which is also a kind of feed set, a feed set
containing only one feed. This approach could be used in other programs
with this property.
However, this approach is some kind of hacking. The way it exploits
the -t
option violates the original purpose of the option.
That’s why we explained a lot in the Introduction chapter.
A more general way suggested by the Unix philosophy [6] is writing a new
program to rename a feed. Let’s call this program rsscname
.
With rsscname
, it’s not only that we don’t need to hack the
-t
option, but also that the -t
option could
be removed from onerss
.
Supposing rsscname
takes two arguments. One is the
channel title. Another is the filename. If the filename is omitted,
rsscname
reads the feed from the standard input. Figure 2
shows the command using rsscname
to replace figure 1.
( rsscname 'NewName1' feed1.xml
curl example.com/feed2.xml | rsscname 'NewName2'
) | onerss -p | rsscname 'Merged Name'
Figure 2.
Figure 2 doesn’t complicate the command much more than figure 1, but it’s more self-explained. In addition, this approach works on a wider range of programs.
Other options of onerss
may also be separated as
independent programs. However, that would produce too many
special-purpose programs. The Unix philosophy encourages making simple
but expressive programs [8]. These small programs, including
rsscname
, are too special-purose, not expressive.
A possible solution is developing a general RSS or XML manipulator to
work with or even replace onerss
. However, it’s hard to
design a concise interface for such a program. This type of programs
often ship with a domain-specific language [8] like CSS selectors [10], or XPath [3].
The Unix philosophy encourages process separation [8], using small programs,
especially those fundamental ones like grep
,
seq
, and awk
, glued together to form a larger
program. Shell is born for this type of gluing and onerss
is written in shell.
However, to make this easy, the small programs and the file formats must conform to some conventions. These conventions are called “Good Files and Good Filters” by [5]. We only discuss the “Good Files” part, because we are able to choose what filters we use.
“Good Files” are line-based text files. Each line is an object of interest. When more information is present for each object, the file is still line-by-line, but columnated into fields separated by blanks or tabs.
XML files are clearly not “Good Files”. Fortunately, W3C developed
HTML-XML-utils [11]
which contains the hxpipe
and hxunpipe
programs to convert between XML files and “Good Files”. The
onerss
program is basically a wrapper of the following
pipeline.
hxpipe | awk | hxunpipe
This makes the implementation very concise. At the moment this report was writing, it contained no more than 160 effective lines of shell, including tedious work like parsing arguments, and printing the help and version information.
The onerss
program, like any shell script piping
multiple processes, has a poor performance.
No matter how low the cost of process creation is, each process
requires application-level initialization code. For example, the
grep
program needs to compile the regular expression. Even
we save compiled regular expressions, loading them costs time.
Although in our practices, the performance bottleneck is always
curl
for it accesses remote files. The onerss
program could become the bottleneck when merging numerous feeds with
concurrent curl
calls.
We’ve introduced the onerss
program and discussed
several advantages and flaws of its design and implementation. The
onerss
program is a simple, expressive, consicely
implemented, but hacky and slow program.
Basically, most advantages are from the great facilities of the Unix system. Some of the flaws are due to the author’s mistakes. The others are due to the essential imperfection of the Unix philosophy.