A Tutorial on Apache HTTPd

DONG Yuxuan https://www.dyx.name

15 Jan 2020 (+0800)

The Apache HTTP Server (httpd) was the most popular HTTP server in the world. Nginx takes its position nowadays. However, for some occasions, httpd still works better, especially for small, internal web sites. This text discusses how to configure httpd 2.4+ and provides several solutions for common occasions.

Paths of httpd configuration files are different with different ways of installation. In macOS-default-installed httpd, it’s /etc/apache2/httpd.conf. In apt-installed httpd, it’s /etc/apache2/apache2.conf. As the Include directive is supported, the configuration can be organized into multiple files. The apt-installed version organizes it the best in my opinion. /etc/apache2/apache2.conf is the entrance of the configuration. Other parts of the configuration are in different files and are referenced by the entrance using Include directives. The following is a typical structure of the /etc/apache2 directory of an apt-installed httpd.

# In Ubuntu with apt-installed `httpd`

root@localhost:/etc/apache2# ll
total 88
drwxr-xr-x  8 root root  4096 Dec 28 05:19 ./
drwxr-xr-x 94 root root  4096 Dec 24 14:26 ../
-rw-r--r--  1 root root  7224 Jul 10 08:27 apache2.conf
drwxr-xr-x  2 root root  4096 Jul  6 08:26 conf-available/
drwxr-xr-x  2 root root  4096 Jul  6 08:26 conf-enabled/
-rw-r--r--  1 root root  1781 Jul  6 09:30 envvars
-rw-r--r--  1 root root 31063 Oct 10  2018 magic
drwxr-xr-x  2 root root 12288 Jul  6 09:29 mods-available/
drwxr-xr-x  2 root root  4096 Aug 16 08:18 mods-enabled/
-rw-r--r--  1 root root   451 Jul 10 04:34 ports.conf
drwxr-xr-x  2 root root  4096 Dec 23 06:28 sites-available/
drwxr-xr-x  2 root root  4096 Dec 23 06:28 sites-enabled/

The basic rules of httpd config grammar are the following.

Most Apache installations have a default config. We usually just make some modifications to it. However, to make a good understanding, we need to learn how to write a config from scratch. Thus after introducing some basic rules, I will start by explaining a minimal config. Then discuss virtual hosts and some frequently-used directives. End with deploying Python WSGI applications.

Table of Contents


1. A Minimal Configuration

Let’s start with a minimal configuration supporting static files, directory listing, and CGI scripts. The environment is macOS with default-installed httpd. httpd modules are installed in /usr/libexec/apache2.

# A minimal configuration
# Supporting static files, directory listing, and CGI scripts

ServerRoot /usr

LoadModule unixd_module libexec/apache2/mod_unixd.so
LoadModule authz_core_module libexec/apache2/mod_authz_core.so
LoadModule autoindex_module libexec/apache2/mod_autoindex.so
LoadModule cgi_module libexec/apache2/mod_cgi.so

User _www
Group _www
Listen 80

ServerName default
DocumentRoot /var/www

ErrorLog /var/log/apache2/error_log

<Directory />
        Require all denied
        Options None
        AllowOverride None
</Directory>

<Directory /var/www>
        Require all granted
</Directory>

<Directory /var/www/cgi-bin>
        Options ExecCGI
        SetHandler cgi-script
</Directory>

<Directory /var/www/files>
        Options +Indexes
</Directory>

The ServerRoot directive sets the directory in which the server lives. We rarely modify the value after installation. Most directives use this value as the root of relative paths. Be careful, Directory uses the root of the file system instead of the server root as the root of relative paths.

Then we load the modules we need. LoadModule modname modfile loads modname module from file modfile. If modfile is a relative path, it uses the server root as the root, mentioned above.

The next two lines, specify which user and group the httpd daemon runs as.

Listen 80 tells httpd to listen on the port 80.

ServerName sets the name of the server. It’s not important in this example and you could use arbitrary value. But it becomes important when virtual hosts come into play. Virtual hosts will be discussed later.

DocumentRoot /var/www means mapping the root of the network URI to the local path /var/www. Visiting http://yourdomain/* will access the file /var/www/*.

ErrorLog directive, as its name implies, specifies where to put error logs.

Directives in configuration apply to the entire server. If you wish some directives apply to only a part of the server, you scope them by placing them in <Directory>, <DirectoryMath>, <Location>, <LocationMatch>, <Files>, <FilesMatch>, and <VirtualHost> blocks.

<Directory> and <Files> blocks mean what their names imply. <Location> means a network URI. <*Match> is the corresponding regex version.

We specify Require all denied to the directory / which is the root of the file system. Require all denied means that the server should reject all requests accessing the directory. Subdirectories inherit the configuration, so no one can access any file in the host. This is what people often do: Protect the whole file system first to avoid security issues, then open specified subdirectories for the web. By default, httpd allows the user to put a file named .htaccess in a directory to override the configuration of the directory and its subdirectories. We place AllowOverride None to forbid all possible overriding. The Options directive specifies some permissions of the directory and its subdirectories, like the permission to execute CGI scripts, the permission to list files, etc.. Thus we put Options None to forbidden all these behaviors.

As we specify /var/www as the document root, we need to grant people to visit the directory. That’s why we put Require all granted in the <Directory /var/www> block.

Visiting http://yourdomain/* returns the content of the file /var/www/* in default. However, we want to put CGI scripts in /var/www/cgi-bin and when visiting http://yourdomain/cgi-bin/*, the server should execute the script and return the output. We must give the directory permission to do it. That’s what Options ExecCGI does. Aside from the permission, we need to tell httpd to handle files in the directory as CGI programs instead of static files. So we put SetHandler cgi-script.

The final part of this example, Options +Indexes in the block <Directory /var/www/files>, is to permit users to see the file list of the directory. In this directory, we provide downloadable files for users. If a user visits http://yourdomain/files/ he or she will get the file list of /var/www/files.

You must have noticed a very subtle difference between Options ExecCGI and Options +Indexes, the + sign. Without +, options will override the inherited. With +, options will be added to the inherited. You may guess there’ll be a - sign. You’re right.

2. Virtual Hosts

httpd allows us to build multiple sites in one host. This is implemented by the <VirtualHost> block.

For example, you want to build two sites, www1.example.com and www2.example.com, with a similar structure to the above minimal configuration. Their document roots are /var/www/www1 and /var/www/www2.

ServerRoot /usr

LoadModule unixd_module libexec/apache2/mod_unixd.so
LoadModule authz_core_module libexec/apache2/mod_authz_core.so
LoadModule autoindex_module libexec/apache2/mod_autoindex.so
LoadModule cgi_module libexec/apache2/mod_cgi.so

User _www
Group _www
Listen 80
ErrorLog /var/log/apache2/error_log

<Directory />
        Require all denied
        AllowOverride None
</Directory>

<VirtualHost *:80>
        ServerName www1.example.com
        DocumentRoot /var/www/www1

        ErrorLog /var/log/apache2/www1_error_log

        <Directory /var/www/www1>
                Require all granted
        </Directory>

        <Directory /var/www/www1/cgi-bin>
                Options ExecCGI
                SetHandler cgi-script
        </Directory>

        <Directory /var/www/www1/files>
                Options +Indexes
        </Directory>
</VirtualHost>

<VirtualHost *:80>
        ServerName www2.example.com
        DocumentRoot /var/www/www2

        ErrorLog /var/log/apache2/www2

        <Directory /var/www/www2>
                Require all granted
        </Directory>

        <Directory /var/www/www2/cgi-bin>
                Options ExecCGI
                SetHandler cgi-script
        </Directory>

        <Directory /var/www/www2/files>
                Options +Indexes
        </Directory>
</VirtualHost>

After setting domains www1.example.com and www2.example.com pointing to the IP of your host in your DNS (It can be tested by modifying /etc/hosts), you can try to visit two domains, and you will find it gives you two sites, one is in /var/www/www1 and another is in var/www/www2.

Why do we put an ErrorLog in the global scope? Because some errors are about the entire server. For example, the httpd can’t lunch for some reason.

We can visit the host by two domains now. What if one visits the host by IP? Which site will be served? The answer is the first virtual host. To forbid visiting by IP, we can create an empty virtual host before all other virtual hosts.

<VirtualHost *:80>
        ServerName default
        DocumentRoot /var/www

        <Directory /var/www>
                Require all denied
        </Directory>
</VirtualHost>

If we visit the host by IP, httpd finds that no virtual hosts can match, so it matches the first one and the first one forbids accessing anything.

At last, we discuss the VirtualHost directive itself.

<VirtualHost addr[:port] [addr[:port]] ...> ... </VirtualHost>

addr is an IP of the host, port is what the name implies and it’s optional. A host can have multiple IPs, we can build each virtual host for each IP. In the above example, we use <VirtualHost *:80>. It means all requests to all IPs port 80 will be sent to this virtual host if the Host header of the request is the server name specified by ServerName.

Be careful, setting addr and port in VirtualHost can’t replace the Listen directive. All addrs and ports must be specified by the Listen directive in the global scope.

3. Organizing Configuration Files

We have a basic understanding of httpd configuration now. As we write a more and more complicated configuration, it will be unmaintainable if we write directives all in one file. We can split the configuration into multiple files and use the Include directive in the entrance file to include other parts.

Include file-path|directory-path|wildcard

A good example is the apt-installed httpd. Its configuration files and modules are all in /etc/apache2.

root@localhost:/etc/apache2# ll
total 88
drwxr-xr-x  8 root root  4096 Dec 31 09:43 ./
drwxr-xr-x 94 root root  4096 Dec 24 14:26 ../
-rw-r--r--  1 root root  7224 Jul 10 08:27 apache2.conf
drwxr-xr-x  2 root root  4096 Jul  6 08:26 conf-available/
drwxr-xr-x  2 root root  4096 Jul  6 08:26 conf-enabled/
-rw-r--r--  1 root root  1781 Jul  6 09:30 envvars
-rw-r--r--  1 root root 31063 Oct 10  2018 magic
drwxr-xr-x  2 root root 12288 Jan  2 10:02 mods-available/
drwxr-xr-x  2 root root  4096 Aug 16 08:18 mods-enabled/
-rw-r--r--  1 root root   451 Jul 10 04:34 ports.conf
drwxr-xr-x  2 root root  4096 Jan  1 13:53 sites-available/
drwxr-xr-x  2 root root  4096 Dec 23 06:28 sites-enabled/

apache2.conf is the entrance. All virtual hosts are in sites-available, one site per file. Not all sites (virtual hosts) are enabled. For each enabled site, a symbolic link at sites-enabled is created.

Just like virtual hosts, modules are organized with mods-available and mods-enabled directories. More modules are directly integrated with httpd in this version, like unixd_module, we don’t have to explicitly load.

To enable a site or a module, we don’t need to create symbolic links by ourselves. The apt-installed httpd provides two commands a2ensite and a2enmod. If you have a site in /etc/apache2/sites-available/example.conf, you could use a2ensite example to enable it. Also we could use a2enmod mod_proxy to enable mod_proxy.

All Listen directives are placed in ports.conf.

Let’s see how apache2.conf includes virtual hosts.

# Include the virtual host configurations:
IncludeOptional sites-enabled/*.conf

Be noticed, it uses IncludeOptional instead of Include. The difference between them is IncludeOptional will be silently ignored (instead of causing an error) if wildcards are used and they do not match any file or directory or if a file path does not exist on the file system.

If you’re not using the apt-installed httpd, the structure is also recommended by me.

The content about basic concepts of httpd configuration is over here. The next part is to introduce some frequently-used directives and common solutions. You can also stop reading this tutorial and go to the official documentation.

From the next chapter, I suppose we are using Ubuntu with apt-installed httpd.

4. Frequently-used Directives

Logging

We have seen the ErrorLog directive when we discuss the minimal configuration and virtual hosts. Besides recording errors, we also want to record accesses. To realize it, we need mod_log_config. The apt-installed version already integrated the module. If your installation had not, load it explicitly.

LoadModule log_config_module path-to-file

Two directives are here. CustomLog and LogFormat. CustomLog sets the path of the log file and the format. The format can be a C-style format string or a nickname. A nickname is of a format predefined by LogFormat.

# CustomLog with format nickname
LogFormat "%h %l %u %t \"%r\" %>s %b" common
CustomLog "logs/access_log" common

# CustomLog with explicit format string
CustomLog "logs/access_log" "%h %l %u %t \"%r\" %>s %b"

Here is the complete list of control characters.

Redirection

mod_alias provides Redirect and RedirectMatch directives which redirect one URL to another.

Redirect [status] [URL-path] URL

status sets how the redirect happens at the HTTP level. Is it a 301 permanent redirection or a 302 temporary redirection?

Redirect permanent /imgs http://your.cdn.com/imgs

With the above directive, any requests to /imgs/logo.gif will return a 301 redirection to http://your.cdn.com/imgs/logo.gif.

RedirectMatch is the regex version of Redirect. To implement the same function to the above directive, use

RedirectMatch permanent /imgs/(.*) http://your.cdn.com/$1

As we see, regex capture is supported through $1, $2, ... like Perl.

Let’s see a real example here. A site has the following domains:

What we want is to redirect all requests to www.example.com permanently. Have a look at the configuration of example.net.

<VirtualHost *:80>
        ServerName example.net
        ErrorLog ${APACHE_LOG_DIR}/example.net.log
        CustomLog ${APACHE_LOG_DIR}/example.net.log combined

        Redirect permanent / http://www.example.com/
</VirtualHost>

Be careful here, I use Redirect permanent / http//www.example.com/. The target URL has the suffix slash. If it doesn’t, visiting http://example.net/abc will redirect to http://www.example.comabc instead of http://www.example.com/abc. This only happens when redirecting the root URI.

You must find I used ${APACHE_LOG_DIR}$. This is also a demonstration of how to use environment variables in an httpd configuration. Any environment variable can be used by ${envvar_name}. In apt-instead httpd, httpd-specified environment variables are defined in /etc/apache2/envvars.

root@localhost:/etc/apache2# grep export envvars
export APACHE_RUN_USER=www-data
export APACHE_RUN_GROUP=www-data
export APACHE_PID_FILE=/var/run/apache2$SUFFIX/apache2.pid
export APACHE_RUN_DIR=/var/run/apache2$SUFFIX
export APACHE_LOCK_DIR=/var/lock/apache2$SUFFIX
export APACHE_LOG_DIR=/var/log/apache2$SUFFIX
export LANG=C
export LANG
#export APACHE_LYNX='www-browser -dump'
#export APACHE_ARGUMENTS=''
#export APACHE2_MAINTSCRIPT_DEBUG=1

Rewriting

mod_alias provides simple redirection functions. However, sometimes we need complicated behaviors. For example, supose we have an SPA (Single Page Application) and our requirements are the following.

This is different from redirections. We don’t send a 30x response to the browser but handle within the server. This complicated behavior is called rewriting and it needs mod_rewrite.

<VirtualHost *:80>
        ...

        RewriteEngine On
        <Location />
                RewriteBase /
                RewriteRule ^index\.html$ - [L]
                RewriteCond %{REQUEST_FILENAME} !-f
                RewriteCond %{REQUEST_FILENAME} !-d
                RewriteRule . /index.html [L]
        </Location>
</VirtualHost>

Let’s explain them line by line.

Gateway

Considering that we’re developing a web application which needs to fetch data from an external site. We can’t fetch data in JavaScript because of the same origin policy. Thus we need to create a service on our own server as a proxy. Requests to this service will be be sent to the external site through our server and responses from the external site will be sent to the client through our server. This behavior is called a gateway or reverse proxy. httpd creates gateways by mod_proxy.

ProxyPass /data http://www.external.com/data

Because we proxy to an HTTP server, so we must ensure mod_proxy_http is enabled too.

Now http://yourdomain.com/data/* will be forward to http://www.external.com/data/*. However, if the external site sends a redirection, the browser will directly visit the original site which is not what we want. Thus we need the ProxyPassReverse directive.

ProxyPass /data http://www.external.com/data
ProxyPassReverse /data http://www.external.com/data

ProxyPassReverse lets httpd adjust the URL in the Location, Content-Location and URI headers on HTTP redirect responses. This is essential when httpd is used as a reverse proxy (or gateway) to avoid bypassing the reverse proxy because of HTTP redirects on the backend servers which stay behind the reverse proxy.

5. Common Solutions

Authentication

Supposing you’re responsible for setting up a blog or an online document system for your department, what is the simplest way? Actually we don’t need to write any “real code”. Make a directory in the server and serve it by httpd with mod_autoindex. Documents are just placed in the directory. A readonly document system is built. Read the documentation of mod_autoindex and you will find you can beatify the UI in many ways. OK, how to allow users to write? Set up a FTP serving the directory. You can even get an authentication system of writing in this way.

The only problem is to set up an authentication system of reading. httpd can do this with authn/z modules.

Enter the /etc/apache2 directory. Execute htpasswd -c authusers smith and type a password.

root@localhost:/etc/apache2# htpasswd -c authusers smith
New password:
Re-type new password:
Adding password for user smith

After input and confirm the password, the file authusers is created in /etc/apache2. It records a user named smith and his encrypted password.

root@localhost:/etc/apache2# cat authusers
smith:$apr1$5316jbpB$7gD0bbTHrpG6ydUsMM.2l.

Then we created a new virtual host in /etc/apache2/sites-available/papers.your.com.conf and write the following.

<VirtualHost *:80>
        ServerName papers.your.com

        DocumentRoot /var/www/papers.your.com

        <Directory /var/www/papers.your.com>
                Options +Indexes
                AuthType Basic
                AuthName YourCompany
                AuthUserFile authusers
                Require valid-user
        </Directory>
</VirtualHost>

Directives are so intuitive that I don’t think there is a need for more explanation.

Using a2ensite papers.your.com to enable the site. After DNS is set, you could visit the site and you will find the browser requires you to input a username and a password.

Let users put their papers in /var/www/papers.yours.com via FTP and your document system is online now.

How to add more users?

root@localhost:/etc/apache2# htpasswd authusers username

This command will add a user if not exists. If the user does exist, it updates the password. Do not use -c or your authusers will be overridden.

Deleting a user is also simple.

root@localhost:/etc/apache2# htpasswd -D authusers username

More usages of htpasswd are here.

Authentication/Authorization modules of httpd are not just I demonstrated, you can even set up user groups. Integrating with <Directory> and <Location> block can give us a very flexible authentication and authorization system. Check here to see more.

HTTPS

When you did the authentication thing above. You may find the browser warns you “Your password will be sent unencrypted.”. So, let’s add HTTPS on.

Make sure mod_ssl is loaded in httpd. Place your certificates and keys in the server, for example, /etc/ca-certificates/papers.your.com.cert and /etc/ca-certificates/papers.your.com.key.

Copy papers.your.com.conf to papers.your.com-ssl.conf and modify it to the following.

<VirtualHost *:443>
        ServerName papers.your.com

        DocumentRoot /var/www/papers.your.com

        SSLEngine on
        SSLCertificateFile /etc/ca-certificates/papers.your.com.cert
        SSLCertificateKeyFile /etc/ca-certificates/papers.your.com.key

        <Directory /var/www/papers.your.com>
                Options +Indexes
                AuthType Basic
                AuthName YourCompany
                AuthUserFile authusers
                Require valid-user
        </Directory>
</VirtualHost>

The HTTPS site is online now but we’re not finished. When one visits the HTTP site we wish he will be redirected to the HTTPS. To realize it, we can modify the papers.your.com.conf to use a Redirect directive. This is acceptable but here we use Rewrite and the reason will be explained later.

<VirtualHost *:80>
        ServerName papers.your.com

        RewriteEngine on
        RewriteRule ^ https://%{SERVER_NAME}%{REQUEST_URI} [END,NE,R=permanent]
</VirtualHost>

We rewrite all requests to the HTTPS site. Because Rewrite can reference %{SERVER_NAME}, we can write the domain only once. We choose Rewrite instead of Redirect for the maintainability.

The R=permanent flag indicates the rewriting is actually a 301 redirection. Other flags are not that important. Check the documentation of mod_rewrite to see details.

WSGI

Your document system satisfies you and me. However, your boss may think it’s not cool and a real web application is required. Thus you write one with Python. Now you need to deploy this WSGI application.

Let’s write a simple Flask application as an example.

from flask import Flask

app = Flask(__name__)

@app.route('/hi/<name>')
def hi(name):
        return f'Hi, {name}.'

The easiest method is to deploy WSGI as CGI. Write the following script yourapp in your cgi-bin directory.

#!/usr/bin/env python3

from wsgiref.handlers import CGIHandler
from yourwsgiapp import app

CGIHandler().run(app)

Visiting http://yourdomain/cgi-bin/yourapp/hi/there, you will see “Hi, there.”.

Our WSGI application may be installed in a virtual environment. To make your CGI run in the virtual environment, we can just modify the sharp-bang comment line.

#!/path-to-your-venv/bin/python

from wsgiref.handlers import CGIHandler
from yourwsgiapp import app

CGIHandler().run(app)

Besides CGI, you can use mod_wsgi. See its documentation.

Both CGI and mod_wsgi are very old technologies. Nowadays many people choose standalone WSGI containers like Gunicorn. A standalone WSGI container often runs on a non-80 port. We can use mod_proxy to forward requests of dynamic resources to the standalone WSGI container and serve static resources by httpd itself.

Supposing our standalone WSGI container is running on port 8000, we could write the following in the configuration.

<VirtualHost *:80>
        ...

        ProxyPass /dynamic http://localhost:8000
        ProxyPassReverse /dynamic http://localhost:8000

        ...
        # Directives servers static files
        ...
</VirtualHost>