Skip to main content Accessibility help
Internet Explorer 11 is being discontinued by Microsoft in August 2021. If you have difficulties viewing the site on Internet Explorer 11 we recommend using a different browser such as Microsoft Edge, Google Chrome, Apple Safari or Mozilla Firefox.

13th August 2024: Online ordering is currently unavailable due to technical issues, however alternative purchasing options are available.
As we resolve the issues resulting from this, we are also experiencing some delays to publication. We are working hard to restore services as soon as possible and apologise for the inconvenience. For further updates please visit our website .

Chapter 5: Index compression

Chapter 5: Index compression

pp. 78-99

Authors

, Stanford University, California, , Google, Inc., , Universität Stuttgart
Resources available Unlock the full potential of this textbook with additional resources. There are Instructor restricted resources available for this textbook. Explore resources
  • Add bookmark
  • Cite
  • Share

Summary

Chapter 1 introduced the dictionary and the inverted index as the central data structures in information retrieval (IR). In this chapter, we employ a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems.

One benefit of compression is immediately clear. We need less disk space. As we will see, compression ratios of 1:4 are easy to achieve, potentially cutting the cost of storing the index by 75%.

There are two more subtle benefits of compression. The first is increased use of caching. Search systems use some parts of the dictionary and the index much more than others. For example, if we cache the postings list of a frequently used query term t, then the computations necessary for responding to the one-term query t can be entirely done in memory. With compression, we can fit a lot more information into main memory. Instead of having to expend a disk seek when processing a query with t, we instead access its postings list in memory and decompress it. As we will see below, there are simple and efficient decompression methods, so that the penalty of having to decompress the postings list is small. As a result, we are able to decrease the response time of the IR system substantially. Because memory is a more expensive resource than disk space, increased speed owing to caching – rather than decreased space requirements – is often the prime motivator for compression.

About the book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook
US$71.99
Hardback
US$71.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers