The Number of Nodes in a Trie of English Words

DYX

This document records an experiment on how the number of nodes in a trie increases with the number of strings which are English words.

The Implementation of the Trie

The trie used in this experiment is a simple C implementation. A node is represented by a struct trienode. The next field of struct trienode is an array of pointers to children and the stop field denotes where there is a string stops at the node. The APLSIZE macro represents the size of the alphabet. I defined APLSIZE to be 0x100 (256). This is too large for English words but it won’t effect the number of nodes and brings brevity to the code. The global variable root points to the root of the trie and nnodes stores the number of the nodes.

#define APLSIZE 0x100

struct trienode {
	struct trienode *next[APLSIZE];
	int stop;
} *root;

int nnodes = 1;

The only operation of the trie I need is the triepush function, which inserts a string into the trie and returns the new number of the nodes.

int triepush(unsigned char *key)
{
	struct trienode *p;

	for (p = root; *key != '\0'; p = p->next[*key++])
		if (NULL == p->next[*key]) {
			p->next[*key] = malloc(sizeof(struct trienode));
			memset(p->next[*key], 0, sizeof(struct trienode));
			++nnodes;
		}
	p->stop = 1;
	return nnodes;
}

The main function reads words from the standard input until meets the end of the file. For every read word, the program inserts it to the trie and prints three numbers.

#define MAXWORDSZ 0x200

int main(void)
{
	char word[MAXWORDSZ];
	int nwords;

	root = malloc(sizeof *root);
	memset(root, 0, sizeof *root);
	nwords = 0;
	while (fgets(word, sizeof word, stdin)) {
		++nwords;
		word[strcspn(word, "\n")] = '\0';
		triepush((unsigned char *)word);
		printf("%d %d %f\n", nwords, nnodes, (double)nwords / nnodes);
	}
	return 0;
}

The Data Set of Words

The file /usr/share/dict/words shipped with most UNIX-based systems contains English words one per line. In macOS, it’s a symbolic link to /usr/share/dict/web2, which according to the README file, contains all 234,936 words1 from Webster’s Second International. It’s a good data set for this experiment.

The /usr/share/dict/web2 file is nearly lexicographically sorted2 in a manner of case insensibility. Although the order of words doesn’t affect the final number of the nodes, it does affect the growth curve. Thus the file was shuffled in the experiment. I named the shuffled file web2, without the leading path.

% head -n3 web2
Teleut
biliously
palaeofauna

% tail -n3 web2
bass
unsatirized
sulphobismuthite

Feeding web2 to the test program generated the result table restab.

% head -n3 restab
1 7 0.142857
2 16 0.125000
3 27 0.111111

% tail -n3 restab
235884 792767 0.297545
235885 792768 0.297546
235886 792777 0.297544

Experiment Results

The restab file was turned into several charts.

nnodes.png

loadfac.png

loadfac-log.png

Conclusions

The 235886 words generates 792777 nodes. The load factor ranges from about 0.1 to 0.3. In another word, the number of the nodes ranges from about 3.3 to 10 times of the number of the words. The number of nodes increases at most linearly and the load factor increases nearly monotonically. The load factor increases very fast at the beginning, and it slows down to a approximately linear curve after tens of thousands of words are inserted.

Appendix I: Downloadable Source Code and Data Sets

The gzipped tarball, which contains the test code ntrienodes.c, the data set web2, and Makefile, is available. Executing make will generate restab and all the charts. A C compiler, the graph utility in plotutils, and the convert utility in ImageMagick are required. To make sure the same results can be generated, the versions of the utilities used in this experiment are listed in Appendix II.

Appendix II: The Versions of the Utilities

% cc --version
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: x86_64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

% graph --version
graph (GNU plotutils) 2.6
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Robert S. Maier.

% convert --version
Version: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
Copyright: (C) 1999-2021 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules 
Delegates (built-in): bzlib djvu fftw fontconfig freetype gslib heic jbig jng jp2 jpeg lcms ltdl lzma openexr png ps raw tiff webp x xml zlib

Footnotes

  1. There are actually 235886 lines of the file, according to the output of the wc command. 

  2. The only exception is that pliers is put after plies, according to the report of diff and sort