Book contents
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
17 - Putting into Practice: Full-Text Indexing with Lucene
from Part 3 - Building Web Scale Applications
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Introduction
- Part 1 Modeling Web Data
- Part 2 Web Data Semantics and Integration
- Part 3 Building Web Scale Applications
- 13 Web Search
- 14 An Introduction to Distributed Systems
- 15 Distributed Access Structures
- 16 Distributed Computing with MapReduce and Pig
- 17 Putting into Practice: Full-Text Indexing with Lucene
- 18 Putting into Practice: Recommendation Methodologies
- 19 Putting into Practice: Large-Scale Data Management with Hadoop
- 20 Putting into Practice: CouchDB, a JSON Semistructured Database
- Bibliography
- Index
Summary
Lucene is an open-source tunable indexing platform often used for full-text indexing of Web sites. It implements an inverted index, creating posting lists for each term of the vocabulary. This chapter proposes some exercises to discover the Lucene platform and test its functionalities through its Java API.
PRELIMINARY: A LUCENE SANDBOX
We provide a simple graphical interface that lets you capture a collection of Web documents (from a given Web site), index it, and search for documents matching a keyword query. The tool is implemented with Lucene (surprise!) and helps to assess the impact of the search parameters, including ranking factors.
You can download the program from our Web site. It consists of a Java archive that can be executed right away (provided you have a decent Java installation on your computer). Figure 17.1 shows a screenshot of the main page. It allows you to
Download a set of documents collected from a given URL (including local addresses),
Index and query those documents,
Consult the information used by Lucene to present ranked results.
Use this tool as a preliminary contact with full text search and information retrieval. The projects proposed at the end of the chapter give some suggestions to realize a similar application.
INDEXING PLAIN TEXT WITH LUCENE – A FULL EXAMPLE
We embark now in a practical experimentation with Lucene. First, download the Java packages from the Web site http://lucene.apache.org/java/docs/.
- Type
- Chapter
- Information
- Web Data Management , pp. 364 - 373Publisher: Cambridge University PressPrint publication year: 2011