Shsub is a template engine of the Shell language, implemented in C. This paper explains the implementation of Shsub 2.0.1 to help potential contributors understand the code quickly.
The code of Shsub is hosted at GitHub https://github.com/dongyx/shsub.
The following file notes.html.tpl
demonstrates a simple
template:
<ul>
<%for i in notes/*.md; do-%>
<% title="$(grep '^# ' "$i" | head -n1)"-%>
<li><%="$title"%></li>
<%done-%>
</ul>
Calling shsub notes.html.tpl
prints a HTML list
reporting titles of your Markdown notes.
The syntax of templates is:
Shell commands are surrounded by <%
and
%>
and are compiled to the commands themselves;
Shell expressions are surrounded by <%=
and
%>
and each <%=expr%>
is compiled to
printf %s expr
;
Ordinary text is compiled to the command printing that text;
A template can include other templates by
<%+filename%>
;
If -%>
is used instead of %>
, the
following newline character will be ignored;
<%%
and %%>
are compiled to
literal <%
and %>
.
While splitting tokens from the input stream, 3 characters are
required to determine if there is a keyword. Those characters may be
pushed back to the stream in some cases. However, the standard function
ungetc()
allows only one pushed-back character. Thus a
custom stack is defined. It’s called the character stack, or,
cstack
.
int cstack[3], *csp = cstack, lineno = 1;
int cpush(int c)
{
if (c == '\n')
--lineno;
return *csp++ = c;
}
int cpop(FILE *fp)
{
int c;
if (csp == cstack)
c = fgetc(fp);
else
c = *--csp;
if (c == '\n') {
if (lineno == INT_MAX)
err("Too many lines");
++lineno;
}
return c;
}
The csp
variable points to the top of the character
stack. The cpush()
function pushes a character to the
character stack. The cpop()
functions pops a character from
the character stack if the character stack is not empty; otherwise, it
reads the character from the stream. These functions are also
responsible for maintaining the lineno
variable which is
used in syntax error reporting.
There are 8 types of token in Shsub and they are defined by the
token
enumeration. Meanings of these token types are listed
in the following table.
Token | Lexeme |
---|---|
INCL |
<%+ |
CMDOPEN |
<% |
EXPOPEN |
<%= |
CLOSE |
%>
-%> |
ESCOPEN |
<%% |
ESCCLOSE |
%%> |
LITERAL |
non-keyword text |
END |
the sentinel for the end of the input stream |
%>
and -%>
are combined to a single
type CLOSE
. This makes the grammar simpler that only one
token looked ahead is required to determine which grammar rule should be
applied. The grammar will be discussed in later text.
Most lexical analyzers can be implemented by a state machine. Shsub
doesn’t build the state machine explicitly, but expresses the same
procedure in the control flow for brevity and readability. The
gettoken()
function returns the next token type from the
stream, and if it’s LITERAL
, the content is stored to the
literal
buffer.
#define MAXLITR 4096
char literal[MAXLITR];
enum token gettoken(FILE *fp)
{
char *p;
int c, h, r, trim = 0;
enum token kw = END;
p = literal;
while (kw == END) {
while (p - literal < MAXLITR - 1) {
if ((c = cpop(fp)) == EOF || strchr("<%-", c ))
break;
*p++ = c;
}
*p = '\0';
if (c == EOF)
return p > literal ? LITERAL : END;
if (p == literal + MAXLITR - 1)
return LITERAL;
h = cpop(fp);
r = cpop(fp);
if (c == '<' && h == '%') {
kw = CMDOPEN;
if (r == '=')
kw = EXPOPEN;
else if (r == '+')
kw = INCL;
else if (r == '%')
kw = ESCOPEN;
} else if (c == '%' && h == '>')
kw = CLOSE;
else if (c == '%' && h == '%' && r == '>')
kw = ESCCLOSE;
else if (c == '-' && h == '%' && r == '>') {
trim = 1;
kw = CLOSE;
} else {
cpush(r);
cpush(h);
*p++ = c;
*p = '\0';
}
}
if (p > literal) {
cpush(r);
cpush(h);
cpush(c);
return LITERAL;
}
if (kw == CMDOPEN || kw == CLOSE && !trim)
cpush(r);
if (trim && (c = cpop(fp)) != '\n')
cpush(c);
return kw;
}
The literal
buffer has a fixed size. If a piece of
literal text is too long, it will be split to multiple tokens. This
approach avoids dynamic resizing and it won’t limit the input length if
we design a proper grammar.
The semantic of -%>
is trimming the following newline
character. The trimming is directly executed in gettoken()
instead of in the subsequent process. Because the semantic is regarding
characters instead of tokens, it’s more natural to handle it in lexical
analysis.
The grammar of Shsub templates is very simple and is described by the following BNF.
<tmpl> := <END>
<text><tmpl>
<CMDOPEN><text><CLOSE><tmpl>
<EXPOPEN><text><CLOSE><tmpl>
<INCL><LITERAL><CLOSE><tmpl>
<text> := <END>
<LITERAL><text>
<ESCOPEN><text>
<ESCCLOSE><text>
There are two non-terminals: tmpl and text. The tmpl non-terminal represents a template. The text non-terminal represents a token sequence of literal text or text shall be escaped to literal text.
The filename in an including directive could only be a single
LITERAL
. This limits the length of the included filename,
and forbids the filename to contain Shsub keywords. The limit decreases
the complexity of the implementation at the acceptable cost: It’s rare
that a file has a path exceeding 1000 characters or has a name
containing ‘<%’, ‘%>’, and etc..
The global variable lookahead
is used to store the
current token. The function
void tmpl(FILE *in, FILE *ou)
parses tmpl from in
and prints the compiled
script to ou
. The function
void text(int esc, FILE *in, FILE *ou)
parses text from in
, optionally escaping the
text to the shell string representation, and prints the result to
ou
. The function
void match(enum token tok, FILE *fp)
checks if lookahead
is matched the desired token and
moves lookahead
to the next token.
enum token lookahead;
void tmpl(FILE *in, FILE *ou)
{
char *p;
while (lookahead != END)
switch (lookahead) {
case INCL:
/* Handle template including */
case CMDOPEN:
match(CMDOPEN, in);
text(0, in, ou);
fputc('\n', ou);
match(CLOSE, in);
break;
case EXPOPEN:
fputs("printf %s ", ou);
match(EXPOPEN, in);
text(0, in, ou);
fputc('\n', ou);
match(CLOSE, in);
break;
case LITERAL: case ESCOPEN: case ESCCLOSE:
fputs("printf %s '", ou);
text(1, in, ou);
fputs("'\n", ou);
break;
default:
parserr("Unexpected token");
}
}
void text(int esc, FILE *in, FILE *ou)
{
char *s;
for (;;)
switch(lookahead) {
case LITERAL:
if (!esc)
fputs(literal, ou);
else
for (s = literal; *s; ++s)
if (*s == '\'')
fputs("'\\''", ou);
else
fputc(*s, ou);
match(LITERAL, in);
break;
case ESCOPEN:
fputs("<%", ou);
match(ESCOPEN, in);
break;
case ESCCLOSE:
fputs("%>", ou);
match(ESCCLOSE, in);
break;
default:
return;
}
}
void match(enum token tok, FILE *fp)
{
if (lookahead != tok)
parserr("Lack of expected token");
lookahead = gettoken(fp);
}
The tmpl()
and text()
functions constitute
a recursive-descent parser, but recursive calls are transformed to
iterations. The code handling including in tmpl()
is
omitted. Recursive calls are used for that.
When an including directive is scanned, Shsub saves the related
global variables to a stack called the including stack, or,
istack
, and recursively parses the included file. Each
frame of the including stack is called an including frame, or,
iframe
.
#define MAXINCL 64
struct iframe {
FILE *in;
int lookahead, cstack[3], *csp, lineno;
char *tmplname;
} istack[MAXINCL], *isp = istack;
void ipush(FILE *in)
{
if (isp == istack + MAXINCL)
err("Including too deep");
isp->lookahead = lookahead;
memcpy(isp->cstack, cstack, sizeof cstack);
isp->csp = csp;
isp->in = in;
isp->tmplname = tmplname;
isp->lineno = lineno;
++isp;
}
FILE *ipop(void)
{
--isp;
lookahead = isp->lookahead;
memcpy(cstack, isp->cstack, sizeof cstack);
csp = isp->csp;
tmplname = isp->tmplname;
lineno = isp->lineno;
return isp->in;
}
Most elements of iframe
have been explained in former
text, except:
in
: The file pointer of the current input file;tmplname
: The filename of the current input file.The isp
variable points to the top of the including
stack. The ipush()
function pushes the current context to
the including stack. The ipop()
functions recovers the
context from the including stack.
The omitted part of the tmpl()
function can be completed
now.
void tmpl(FILE *in, FILE *ou)
{
char *p;
while (lookahead != END)
switch (lookahead) {
case INCL:
match(INCL, in);
if (lookahead != LITERAL)
parserr("Expect included filename");
if (!(p = strdup(literal)))
syserr();
match(LITERAL, in);
match(CLOSE, in);
ipush(in);
tmplname = p;
if (!(in = fopen(tmplname, "r")))
err("%s: %s",
tmplname, strerror(errno));
lineno = 1;
csp = cstack;
lookahead = gettoken(in);
tmpl(in, ou);
fclose(in);
free(tmplname);
in = ipop();
break;
/* Handling other tokens
* ......
* ......
*/
}
}
We can’t pipe the script to the shell, because the script itself may read the standard input. Thus Shsub creates a temporary file to save the script. Should it be a named pipe (FIFO) or a regular file?
Shsub chooses regular file. If we use a named pipe, we must have two parallel processes, and some subtle questions arise. If a signal is delivered to Shsub, should it be forwarded to the shell process? If the shell process terminated, should Shsub aborts parsing and returns the exit status of the shell process? These issues complicate the code.
Shsub attempts to avoid these process and signal issues by calling
execv()
and execl()
without
fork()
. However, this causes a new issue: How do we clean
up the temporary file if we substitute the process image? The solution
of Shsub is prepending a command to the script that deletes itself.
char script[] = "/tmp/shsub.XXXXXX";
int main(int argc, char **argv)
{
FILE *in, *out;
char *sh = "/bin/sh";
int fd;
/*
Misc tasks
......
......
*/
if ((fd = mkstemp(script)) == -1 || atexit(rmscr))
syserr();
if (!(out = fdopen(fd, "w")))
syserr();
fprintf(out, "#!%s\nrm %s\n", sh, script);
tmpl(in, out);
if (fchmod(fd, S_IRWXU) == -1)
syserr();
if (fclose(out))
syserr();
if (argc > 0)
execv(script, argv);
else
execl(script, "-", NULL);
syserr();
return 0;
}
The main()
function also registers a clean-up function
rmscr()
with atexit()
. If a parsing error
happens after the temporary file is created and before the script is
executed, rmscr()
will delete the temporary file.
One interesting thing about Shsub is that the installation script of Shsub uses Shsub itself.
Shsub 2.0.0 is incompatible to earlier versions. To make parallel
installations possible, the makefile of Shsub supports the
name
variable for the install
pseudo target.
The following command installs Shsub with the name
shsub2
.
$ sudo make install name=shsub2
The name
variable doesn’t only affect the name of the
installed files but also the words used in the man page. Thus the man
page of Shsub is a Shsub template which is used to generate a real man
page with specific name
during installation.
all: shsub
install: all
name='$(name)' ./shsub shsub.1.tpl >shsub.1
mkdir -p $(bindir) $(mandir)/man1
$(INSTALL) shsub $(bindir)/$(name)
$(INSTALL) -m644 shsub.1 $(mandir)/man1/$(name).1
shsub: shsub.c
$(CC) $(CFLAGS) -o $@ $<
Notice that there isn’t a rule of shsub.1
, because a
makefile rule can’t depend on a variable. The real man page is always
regenerated while installation.