Introduction to Information Retrieval

Christopher D. Manning; Prabhakar Raghavan; Hinrich Schütze

doi:10.1017/CBO9780511809071

Chapter 20: Web crawling and indexes

pp. 405-420

Christopher D. Manning

, Stanford University, California,

Prabhakar Raghavan

, Google, Inc.,

Hinrich Schütze

, Universität Stuttgart

Get access

Add bookmark
Cite
Share

Summary

Overview

Web crawling is the process by which we gather pages from the Web to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19, we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter, we study the resulting difficulties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider.

The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substantial research projects. We begin (Section 20.1.1) by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is addressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisfies these features. Section 20.3 discusses distributing indexes across many machines for a web-scale implementation.

Features a crawler must provide

We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide.

About the book

Chapter DOI https://doi.org/10.1017/CBO9780511809071.021
Book DOI https://doi.org/10.1017/CBO9780511809071
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval
Format: Hardback
- Publication date: 07 July 2008
- ISBN: 9780521865715
Format: Digital
- Publication date: 05 June 2012
- ISBN: 9780511809071
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$76.00

Hardback

US$76.00

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers