Introduction to Information Retrieval

Christopher D. Manning; Prabhakar Raghavan; Hinrich Schütze

doi:10.1017/CBO9780511809071

Chapter 16: Flat clustering

pp. 321-345

Christopher D. Manning

, Stanford University, California,

Prabhakar Raghavan

, Google, Inc.,

Hinrich Schütze

, Universität Stuttgart

Get access

Add bookmark
Cite
Share

Summary

Clustering algorithms group a set of documents into subsets or clusters. The algorithms' goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.

Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.

The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13, page 237): Our goal is to replicate a categorical distinction that a human supervisor imposes on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.

The key input to a clustering algorithm is the distance measure. In Figure 16.1, the distance measure is distance in the two-dimensional (2D) plane.

About the book

Chapter DOI https://doi.org/10.1017/CBO9780511809071.017
Book DOI https://doi.org/10.1017/CBO9780511809071
Subjects Computer Science,Data Science, Databases, Data Mining, and Information Retrieval
Format: Hardback
- Publication date: 07 July 2008
- ISBN: 9780521865715
Format: Digital
- Publication date: 05 June 2012
- ISBN: 9780511809071
Find out more details about this book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook

US$76.00

Hardback

US$76.00

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers