The Number of Nodes in a Trie of English Words

dyx
2022-01-06 +0800

This document records an experiment on how the number of nodes in a trie increases with the number of strings which are English words.

The Implementation of the Trie

The trie used in this experiment is a simple C implementation. A node is represented by a struct trienode. The next field of struct trienode is an array of pointers to children and the stop field denotes where there is a string stops at the node. The APLSIZE macro represents the size of the alphabet. I defined APLSIZE to be 0x100 (256). This is too large for English words but it won’t effect the number of nodes and brings brevity to the code. The global variable root points to the root of the trie and nnodes stores the number of the nodes.

#define APLSIZE 0x100

struct trienode {
	struct trienode *next[APLSIZE];
	int stop;
} *root;

int nnodes = 1;

The only operation of the trie I need is the triepush function, which inserts a string into the trie and returns the new number of the nodes.

int triepush(unsigned char *key)
{
	struct trienode *p;

	for (p = root; *key != '\0'; p = p->next[*key++])
		if (NULL == p->next[*key]) {
			p->next[*key] = malloc(sizeof(struct trienode));
			memset(p->next[*key], 0, sizeof(struct trienode));
			++nnodes;
		}
	p->stop = 1;
	return nnodes;
}

The Data Set of Words

The file /usr/share/dict/words shipped with most UNIX-compatible systems contains English words one per line. In macOS 12.1, it’s a symbolic link to /usr/share/dict/web2, which according to the README file, contains all 234,936 words1 from Webster’s Second International. This file is the dataset I used in this experiment.

The /usr/share/dict/web2 file is nearly lexicographically sorted2 in a manner of case insensibility. Although the order of words doesn’t affect the final number of the nodes, it does affect the growth curve. Thus the file was shuffled in the experiment. I named the shuffled file web2, without the leading path.

% head -n3 web2
Teleut
biliously
palaeofauna

% tail -n3 web2
bass
unsatirized
sulphobismuthite

Conclusions

The result is shown in the following graph. The X-axis is the number of words. The Y-axis is the number of non-root nodes. The solid curve represents the raw data. The dashed curve represents the liner-regressed result.

nnodes.png

Denoting the number of words as x, the number of non-root nodes as y, the liner regression shows that:

y ~= 3.64 * x

Appendix I: Downloadable Source Code and Data Sets

A gzipped tarball, which contains the test code, the shuffled web2, and a Makefile, is available. Executing make will generate:

To make make work, the following dependencies are required:

To make sure the same results can be generated, the versions of the dependencies used in this experiment are listed in Appendix II.

Appendix II: The Versions of the Utilities

% cc --version
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: x86_64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

% awk --version
awk version 20200816

% graph --version
graph (GNU plotutils) 2.6
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Robert S. Maier.

% convert --version
Version: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
Copyright: (C) 1999-2021 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules 
Delegates (built-in): bzlib djvu fftw fontconfig freetype gslib heic jbig jng jp2 jpeg lcms ltdl lzma openexr png ps raw tiff webp x xml zlib

Footnotes

  1. There are actually 235886 lines of the file, according to the output of the wc command. 

  2. The only exception is that pliers is put after plies, according to the report of diff and sort