Skip to main content Accessibility help
×
Hostname: page-component-77c89778f8-cnmwb Total loading time: 0 Render date: 2024-07-16T12:34:22.671Z Has data issue: false hasContentIssue false

9 - The Transform Regression Algorithm

from Part Two - Supervised and Unsupervised Learning Algorithms

Published online by Cambridge University Press:  05 February 2012

Ramesh Natarajan
Affiliation:
IBM Research, Yorktown Heights, NY, USA
Edwin Pednault
Affiliation:
IBM Research, Yorktown Heights, NY, USA
Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Affiliation:
Microsoft Research, Redmond, Washington
John Langford
Affiliation:
Yahoo! Research, New York
Get access

Summary

Massive training datasets, ranging in size from tens of gigabytes to several terabytes, arise in diverse machine learning applications in areas such as text mining of web corpora, multimedia analysis of image and video data, retail modeling of customer transaction data, bioinformatic analysis of genomic and microarray data, medical analysis of clinical diagnostic data such as functional magnetic resonance imaging (fMRI) images, and environmental modeling using sensor and streaming data. Provost and Kolluri (1999) in their overview of machine learning with massive datasets, emphasize the need for developing parallel algorithms and implementations for these applications.

In this chapter, we describe the Transform Regression (TReg) algorithm (Pednault, 2006), which is a general-purpose, non-parametric methodology suitable for a wide variety of regression applications. TReg was originally created for the data mining component of the IBM InfoSphere Warehouse product, guided by a challenging set of requirements:

  1. The modeling time should be comparable to linear regression.

  2. The resulting models should be compact and efficient to apply.

  3. The model quality should be reliable without any further tuning.

  4. The model training and scoring should be parallelized for large datasets stored as partitioned tables in IBM's DB2 database systems.

Requirements 1 and 2 were deemed necessary for a successful commercial algorithm, although this ruled out certain ensemble-based methods that produce highquality models but have high computation and storage requirements. Requirement 3 ensured that the chosen algorithm did not unduly compromise the concomitant model quality in view of requirements 1 and 2.

Type
Chapter
Information
Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 170 - 189
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adult, . 2009. Adult Census Data Set. http://archive.ics.uci.edu/ml/datasets/Adult.
Apte, C., Natarajan, R., Pednault, E. P. D., and Tipu, F. 2002. A Probabilistic Estimation Framework for Predictive Modeling Analytics. IBM Systems Journal, 41(3), 438–448.CrossRefGoogle Scholar
California. 2009. California Housing Data Set. http://lib.stat.cmu.edu/datasets/houses.zip.
Dorneich, A., Natarajan, R., Pednault, E., and Tipu, F. 2006. Embedded Predictive Modeling in a ParallelRelational Database. Pages 569–574 of: SAC'06: Proceedings of the 2006 ACMSymposium on Applied Computing. New York: ACM.CrossRefGoogle Scholar
Friedman, J. H. 1999. Stochastic Gradient Boosting. Computational Statistics and Data Analysis, 38, 367–378.CrossRefGoogle Scholar
Friedman, J. H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29, 1189–1232.CrossRefGoogle Scholar
Hand, D. 1997. Construction and Assessment of Classification Rules. New York: Wiley.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. NewYork: Springer.CrossRefGoogle Scholar
Hastie, T. J., and Tibshirani, R. J. 1990. Generalized Additive Models. London: Chapman & Hall.Google Scholar
Hecht-Nielsen, R. 1987. Kolmogorov Mapping Neural Network Existence Theorem. Pages 11–14 of: Proceedings of IEEE International Conference on Neural Networks, vol. 3.Google Scholar
,IBM Blue Gene Team. 2008. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52, 199–220.CrossRefGoogle Scholar
Li, B., and Goel, P. K. 2007. Additive regression trees and smoothing splines – predictive modeling and interpretation in data mining. Contemporary Mathematics, 443, 83–101.CrossRefGoogle Scholar
Natarajan, R., and Pednault, E. P. D. 2002. Segmented Regression Estimators for Massive Data Sets. In: Proceedings of the Second SIAM International Conference on Data Mining.Google Scholar
Pednault, E. P. D. 2006. Transform Regression and the Kolmogorov Superposition Theorem. In: Proceedings of the Sixth SIAM International Conference on Data Mining.Google Scholar
Provost, F. J., and Kolluri, V. 1999. A Survey of Methods for Scaling Up Inductive Learning Algorithms. Data Mining and Knowledge Discovery, 3, 131–169.CrossRefGoogle Scholar
Ridgeway, G. 2007. Generalized Boosted Models: A Guide to the GBM Package. http://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf.
,Spambase Data Set. 2009. Spambase Data Set. http://archive.ics.uci.edu/ml/datasets/Spambase.

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×