Book contents
- Frontmatter
- Contents
- Preface
- 1 Data Mining
- 2 MapReduce and the New Software Stack
- 3 Finding Similar Items
- 4 Mining Data Streams
- 5 Link Analysis
- 6 Frequent Itemsets
- 7 Clustering
- 8 Advertising on the Web
- 9 Recommendation Systems
- 10 Mining Social-Network Graphs
- 11 Dimensionality Reduction
- 12 Large-Scale Machine Learning
- Index
- References
1 - Data Mining
Published online by Cambridge University Press: 05 December 2014
- Frontmatter
- Contents
- Preface
- 1 Data Mining
- 2 MapReduce and the New Software Stack
- 3 Finding Similar Items
- 4 Mining Data Streams
- 5 Link Analysis
- 6 Frequent Itemsets
- 7 Clustering
- 8 Advertising on the Web
- 9 Recommendation Systems
- 10 Mining Social-Network Graphs
- 11 Dimensionality Reduction
- 12 Large-Scale Machine Learning
- Index
- References
Summary
In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni's Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.
What is Data Mining?
The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.
1.1.1 Statistical Modeling
Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn't in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.
example 1.1 Suppose our data is a set of numbers. This data is much simpler than data that would be data-mined, but it will serve as an example. A statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution completely characterize the distribution and would become the model of the data. □
- Type
- Chapter
- Information
- Mining of Massive Datasets , pp. 1 - 18Publisher: Cambridge University PressPrint publication year: 2014
References
- 8
- Cited by