Data Mining

Jure Leskovec; Anand Rajaraman; Jeffrey David Ullman

doi:10.1017/CBO9781139924801.002

1 - Data Mining

Published online by Cambridge University Press: 05 December 2014

Jure Leskovec ,

Anand Rajaraman and

Jeffrey David Ullman

Show author details

Jure Leskovec: Affiliation:
Stanford University, California
Anand Rajaraman: Affiliation:
Milliways Laboratories, California
Jeffrey David Ullman: Affiliation:
Stanford University, California

Book contents

Get access

Summary

In this intoductory chapter we begin with the essence of data mining and a discussion of how data mining is treated by the various disciplines that contribute to this field. We cover “Bonferroni's Principle,” which is really a warning about overusing the ability to mine data. This chapter is also the place where we summarize a few useful ideas that are not data mining but are useful in understanding some important data-mining concepts. These include the TF.IDF measure of word importance, behavior of hash functions and indexes, and identities involving e, the base of natural logarithms. Finally, we give an outline of the topics covered in the balance of the book.

What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data. A “model,” however, can be one of several things. We mention below the most important directions in modeling.

1.1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn't in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

example 1.1 Suppose our data is a set of numbers. This data is much simpler than data that would be data-mined, but it will serve as an example. A statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian. The mean and standard deviation of this Gaussian distribution completely characterize the distribution and would become the model of the data. □

Type: Chapter
Information: Mining of Massive Datasets , pp. 1 - 18

DOI: https://doi.org/10.1017/CBO9781139924801.002 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] A., Broder, R., Kumar, F., Maghoul, P., Raghavan, S., Rajagopalan, R., Stata, A., Tomkins, and J., Weiner, “Graph structure in the web,” Computer Networks 33:1-6, pp. 309–320, 2000.Google Scholar

[2] M.M., Gaber, Scientific Data Mining and Knowledge Discovery – Principles and Foundations, Springer, New York, 2010.Google Scholar

[3] H., Garcia-Molina, J.D., Ullman, and J., Widom, Database Systems: The Complete Book Second Edition, Prentice-Hall, Upper Saddle River, NJ, 2009.Google Scholar

[4] D.E., Knuth, The Art of Computer Programming Vol. 3 (Sorting and Searching), Second Edition, Addison-Wesley, Upper Saddle River, NJ, 1998.Google Scholar

[5] C.P., Manning, P., Raghavan, and H., Schiitze, Introduction to Information Retrieval, Cambridge University Press, 2008.Google Scholar

[6] R.K., Merton, “The Matthew effect in science,” Science 159:3810, pp. 56–63, Jan. 5, 1968.Google Scholar

[7] P.-N., Tan, M., Steinbach, and V., Kumar, Introduction to Data Mining, Add-ison-Wesley, Upper Saddle River, NJ, 2005.Google Scholar

Book contents

1 - Data Mining

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive