In-memory Hash Tables for Accumulating Text Vocabularies


Justin Zobel, Steffen Heinz and Hugh E. Williams
Department of Computer Science, RMIT University
GPO Box 2476V, Melbourne 3001, Australia
jz@cs.rmit.edu.au, steffen@mds.rmit.edu.au, hugh@cs.rmit.edu.au

This paper will appear in Information Processing Letters. This website is designed to support the paper, by providing additional information concerning the hashing and splay tree code used in the paper.

A Brief Abstract

In the paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most efficient.

We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably effective: as we show in the paper, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequently-accessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more efficient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average.

Code Used in the Paper

The code used for the experiments described in the paper is written in C, and experiments were carried out on an Intel-based machine running the Linux operating system. We provide here segments of that code, including:

Justin Zobel is the author of the code, and permission to use this code is freely granted, provided that the permission statement at the top of each file is kept.

Experimental Data

The experiments we performed involved the search and insertion of words extracted from large text collections. We used the Text REtrieval Conference (TREC)data in our experiments, which can only be obtained through participation in TREC or partially obtained by becoming a member of the Linguistic Data Consortium.

To experiment with publically-available sources of words, we suggest downloading books and other documents from:

Hugh Williams, 15 May 2001.