dyx
2022-01-06 +0800
This document records an experiment on how the number of nodes in a trie increases with the number of strings which are English words.
The trie used in this experiment is a simple C implementation. A node is represented by a struct trienode
. The next
field of struct trienode
is an array of pointers to children and the stop
field denotes where there is a string stops at the node. The APLSIZE
macro represents the size of the alphabet. I defined APLSIZE
to be 0x100
(256). This is too large for English words but it won’t effect the number of nodes and brings brevity to the code. The global variable root
points to the root of the trie and nnodes
stores the number of the nodes.
#define APLSIZE 0x100
struct trienode {
struct trienode *next[APLSIZE];
int stop;
} *root;
int nnodes = 1;
The only operation of the trie I need is the triepush
function, which inserts a string into the trie and returns the new number of the nodes.
int triepush(unsigned char *key)
{
struct trienode *p;
for (p = root; *key != '\0'; p = p->next[*key++])
if (NULL == p->next[*key]) {
p->next[*key] = malloc(sizeof(struct trienode));
memset(p->next[*key], 0, sizeof(struct trienode));
++nnodes;
}
p->stop = 1;
return nnodes;
}
The file /usr/share/dict/words
shipped with most UNIX-compatible systems contains English words one per line. In macOS 12.1, it’s a symbolic link to /usr/share/dict/web2
, which according to the README
file, contains all 234,936 words1 from Webster’s Second International. This file is the dataset I used in this experiment.
The /usr/share/dict/web2
file is nearly lexicographically sorted2 in a manner of case insensibility. Although the order of words doesn’t affect the final number of the nodes, it does affect the growth curve. Thus the file was shuffled in the experiment. I named the shuffled file web2
, without the leading path.
% head -n3 web2
Teleut
biliously
palaeofauna
% tail -n3 web2
bass
unsatirized
sulphobismuthite
The result is shown in the following graph. The X-axis is the number of words. The Y-axis is the number of non-root nodes. The solid curve represents the raw data. The dashed curve represents the liner-regressed result.
Denoting the number of words as x, the number of non-root nodes as y, the liner regression shows that:
y ~= 3.64 * x
A gzipped tarball, which contains the test code, the shuffled web2
, and a Makefile
, is available. Executing make
will generate:
restab
: the raw result datannodes.png
: the chartk
: the value of k for the regression to y = kxTo make make
work, the following dependencies are required:
To make sure the same results can be generated, the versions of the dependencies used in this experiment are listed in Appendix II.
% cc --version
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: x86_64-apple-darwin21.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
% awk --version
awk version 20200816
% graph --version
graph (GNU plotutils) 2.6
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Robert S. Maier.
% convert --version
Version: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
Copyright: (C) 1999-2021 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules
Delegates (built-in): bzlib djvu fftw fontconfig freetype gslib heic jbig jng jp2 jpeg lcms ltdl lzma openexr png ps raw tiff webp x xml zlib