This paper will appear in Information Processing Letters. This website is designed to support the paper, by providing additional information concerning the hashing and splay tree code used in the paper.
In the paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most efficient.
We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably effective: as we show in the paper, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequently-accessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more efficient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average.
Justin Zobel is the author of the code, and permission to use this code is freely granted, provided that the permission statement at the top of each file is kept.
To experiment with publically-available sources of words, we suggest downloading books and other documents from:
Hugh Williams, 15 May 2001.