Zettair Index Build

Zettair can build inverted indexes by parsing different types of source collections. Please read the format descriptions to understand fully how an index is constructed from the given data. Currently, the following index types are supported:

Usage: zet -i file1 ... fileN

Index construction options

Sample Command Line:

zet -i -f disk45 -c /research/zettair/config/parser_settings.trec -t TREC /research/TREC/disk45/fbis /research/TREC/disk45/fr /research/TREC/disk45/ft /research/TREC/disk45/latimes

This command will use the TREC parser to create an inverted index from the four listed files. You should then find the following index files:

Index Types

HTML Format

The HTML parser treats each file as one document in HTML format. Text is extracted from HTML documents according to the parser settings file, documented above.

TREC Format

It is often advantageous to combine several (thousand) documents in one file and be able to index and search on one single file rather than a few thousand files. This can be done by writing the information of several files into one file and formatting the one file in such a way that original document boundaries can be detected by the parser. The parser will extract words from the given file in much the same way as in HTML mode. Additionally, the TREC parser looks for tags: <DOC> and </DOC> to signal the beginning or end of a document, and identifies the documents via their TREC document number, which is found between a <DOCNO> and </DOCNO> tags. The TREC format is named as such because it is the format used by the Text Retrieval Conference (TREC) for experimental data.

The following excerpt from the Bible represents, for instance, 8 documents (of which 4 documents contain only one word).

<DOC> And the sons of Noah, that went forth of the ark, were Shem, and Ham,
and Japheth: and Ham is the father of Canaan. </DOC>
<DOC> genesis </DOC>
<DOC> These are the three sons of Noah: and of them was the whole earth overspread.</DOC>
<DOC> genesis </DOC>
<DOC> And Noah began to be an husbandman, and he planted a vineyard:</DOC>
<DOC> genesis </DOC>
<DOC> And he drank of the wine, and was drunken; and he was uncovered within his tent.</DOC>
<DOC> genesis </DOC>
TABLE 1: An excerpt of a TREC file