Book contents
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
15 - Distributed Access Structures
from Part 3 - Building Web Scale Applications
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
Summary
In large-scale file systems presented in Chapter 14, search operations are based on a sequential scan that accesses the whole data set. When it comes to finding a specific object, typically a tiny part of the data volume, direct access is much more efficient than a linear scan. The object is directly obtained using its physical address that may simply be the offset of the object's location with respect to the beginning of the file, or possibly a more sophisticated addressing mechanism.
An index on a collection C is a structure that maps the key of each object in C to its (physical) address. At an abstract level, it can be viewed as a set of pairs (k,a), called entries, where k is a key and a the address of an object. For the purpose of this chapter, an object is seen as raw (unstructured) data, its structure being of concern to the Client application only. You may want to think, for instance, of a relational tuple, an XML document, a picture or a video file. It may be the case that the key uniquely determines the object, as for keys in the relational model.
An index we consider here supports at least the following operations that we thereafter call the dictionary operations:
Insertion insert(k,a),
Deletion delete(k),
Key search search(k): a.
If the keys can be linearly ordered, an index may also support range queries of the form range(k1,k2) that retrieves all the keys (and their addresses) in that range.
- Type
- Chapter
- Information
- Web Data Management , pp. 310 - 338Publisher: Cambridge University PressPrint publication year: 2011