The Transform Regression Algorithm

Ramesh Natarajan; Edwin Pednault

doi:10.1017/CBO9781139042918.010

9 - The Transform Regression Algorithm

from Part Two - Supervised and Unsupervised Learning Algorithms

Published online by Cambridge University Press: 05 February 2012

Ramesh Natarajan and

Edited by

Mikhail Bilenko and

Ramesh Natarajan: Affiliation:
IBM Research, Yorktown Heights, NY, USA
Edwin Pednault: Affiliation:
IBM Research, Yorktown Heights, NY, USA
Ron Bekkerman: Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko: Affiliation:
Microsoft Research, Redmond, Washington
John Langford: Affiliation:
Yahoo! Research, New York

Book contents

Get access

Summary

Massive training datasets, ranging in size from tens of gigabytes to several terabytes, arise in diverse machine learning applications in areas such as text mining of web corpora, multimedia analysis of image and video data, retail modeling of customer transaction data, bioinformatic analysis of genomic and microarray data, medical analysis of clinical diagnostic data such as functional magnetic resonance imaging (fMRI) images, and environmental modeling using sensor and streaming data. Provost and Kolluri (1999) in their overview of machine learning with massive datasets, emphasize the need for developing parallel algorithms and implementations for these applications.

In this chapter, we describe the Transform Regression (TReg) algorithm (Pednault, 2006), which is a general-purpose, non-parametric methodology suitable for a wide variety of regression applications. TReg was originally created for the data mining component of the IBM InfoSphere Warehouse product, guided by a challenging set of requirements:

The modeling time should be comparable to linear regression.
The resulting models should be compact and efficient to apply.
The model quality should be reliable without any further tuning.
The model training and scoring should be parallelized for large datasets stored as partitioned tables in IBM's DB2 database systems.

Requirements 1 and 2 were deemed necessary for a successful commercial algorithm, although this ruled out certain ensemble-based methods that produce highquality models but have high computation and storage requirements. Requirement 3 ensured that the chosen algorithm did not unduly compromise the concomitant model quality in view of requirements 1 and 2.

Type: Chapter
Information: Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 170 - 189

DOI: https://doi.org/10.1017/CBO9781139042918.010 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adult, . 2009. Adult Census Data Set. http://archive.ics.uci.edu/ml/datasets/Adult.

Apte, C., Natarajan, R., Pednault, E. P. D., and Tipu, F. 2002. A Probabilistic Estimation Framework for Predictive Modeling Analytics. IBM Systems Journal, 41(3), 438–448.CrossRef Google Scholar

California. 2009. California Housing Data Set. http://lib.stat.cmu.edu/datasets/houses.zip.

Dorneich, A., Natarajan, R., Pednault, E., and Tipu, F. 2006. Embedded Predictive Modeling in a ParallelRelational Database. Pages 569–574 of: SAC'06: Proceedings of the 2006 ACMSymposium on Applied Computing. New York: ACM.CrossRef Google Scholar

Friedman, J. H. 1999. Stochastic Gradient Boosting. Computational Statistics and Data Analysis, 38, 367–378.CrossRef Google Scholar

Friedman, J. H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29, 1189–1232.CrossRef Google Scholar

Hand, D. 1997. Construction and Assessment of Classification Rules. New York: Wiley.Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. NewYork: Springer.CrossRef Google Scholar

Hastie, T. J., and Tibshirani, R. J. 1990. Generalized Additive Models. London: Chapman & Hall.Google Scholar

Hecht-Nielsen, R. 1987. Kolmogorov Mapping Neural Network Existence Theorem. Pages 11–14 of: Proceedings of IEEE International Conference on Neural Networks, vol. 3.Google Scholar

,IBM Blue Gene Team. 2008. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52, 199–220.CrossRef Google Scholar

Li, B., and Goel, P. K. 2007. Additive regression trees and smoothing splines – predictive modeling and interpretation in data mining. Contemporary Mathematics, 443, 83–101.CrossRef Google Scholar

Natarajan, R., and Pednault, E. P. D. 2002. Segmented Regression Estimators for Massive Data Sets. In: Proceedings of the Second SIAM International Conference on Data Mining.Google Scholar

Pednault, E. P. D. 2006. Transform Regression and the Kolmogorov Superposition Theorem. In: Proceedings of the Sixth SIAM International Conference on Data Mining.Google Scholar

Provost, F. J., and Kolluri, V. 1999. A Survey of Methods for Scaling Up Inductive Learning Algorithms. Data Mining and Knowledge Discovery, 3, 131–169.CrossRef Google Scholar

Ridgeway, G. 2007. Generalized Boosted Models: A Guide to the GBM Package. http://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf.

,Spambase Data Set. 2009. Spambase Data Set. http://archive.ics.uci.edu/ml/datasets/Spambase.

Book contents

9 - The Transform Regression Algorithm

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive