Skip to main content Accessibility help
Internet Explorer 11 is being discontinued by Microsoft in August 2021. If you have difficulties viewing the site on Internet Explorer 11 we recommend using a different browser such as Microsoft Edge, Google Chrome, Apple Safari or Mozilla Firefox.

Update 13th September 2024: Our systems are now restored following recent technical disruption, and we’re working hard to catch up on publishing. We apologise for the inconvenience caused. Find out more 

Chapter 2: The term vocabulary and postings lists

Chapter 2: The term vocabulary and postings lists

pp. 18-44

Authors

, Stanford University, California, , Google, Inc., , Universität Stuttgart
Resources available Unlock the full potential of this textbook with additional resources. There are Instructor restricted resources available for this textbook. Explore resources
  • Add bookmark
  • Cite
  • Share

Summary

Recall the major steps in inverted index construction:

  • Collect the documents to be indexed.

  • Tokenize the text.

  • Do linguistic preprocessing of tokens.

  • Index the documents that each term occurs in.

  • In this chapter, we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms that a system uses (Section 2.2). Tokenization is the process of chopping character streams into tokens; linguistic preprocessing then deals with building equivalence classes of tokens, which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4. Then we return to the implementation of postings lists. In Section 2.3, we examine an extended postings list data structure that supports faster querying, and Section 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web.

    Document delineation and character sequence decoding

    Obtaining the character sequence in a document

    Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters.

    About the book

    Access options

    Review the options below to login to check your access.

    Purchase options

    eTextbook
    US$71.99
    Hardback
    US$71.99

    Have an access code?

    To redeem an access code, please log in with your personal login.

    If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

    Also available to purchase from these educational ebook suppliers