Introduction to Information Retrieval

In this chapter, we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms that a system uses (Section 2.2). Tokenization is the process of chopping character streams into tokens; linguistic preprocessing then deals with building equivalence classes of tokens, which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4. Then we return to the implementation of postings lists. In Section 2.3, we examine an extended postings list data structure that supports faster querying, and Section 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web.

Document delineation and character sequence decoding

Obtaining the character sequence in a document

Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters.

About the book

Chapter DOI https://doi.org/10.1017/CBO9780511809071.003
Book DOI https://doi.org/10.1017/CBO9780511809071
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval
Format: Hardback
- Publication date: 07 July 2008
- ISBN: 9780521865715
Format: Digital
- Publication date: 05 June 2012
- ISBN: 9780511809071
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$71.99

Hardback

US$71.99

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers

Introduction to Information Retrieval

Chapter 2: The term vocabulary and postings lists

Authors

Summary

About the book

Access options

Personal login

Purchase options

Have an access code?