Parallel Large-Scale Feature Selection

Jeremy Kubica; Sameer Singh; Daria Sorokina

doi:10.1017/CBO9781139042918.018

17 - Parallel Large-Scale Feature Selection

from Part Three - Alternative Learning Settings

Published online by Cambridge University Press: 05 February 2012

Sameer Singh and

Edited by

Mikhail Bilenko and

Jeremy Kubica: Affiliation:
Google Inc., Pittsburgh, PA, USA
Sameer Singh: Affiliation:
University of Massachusetts
Daria Sorokina: Affiliation:
Yandex Labs, Palo Alto, CA, USA
Ron Bekkerman: Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko: Affiliation:
Microsoft Research, Redmond, Washington
John Langford: Affiliation:
Yahoo! Research, New York

Book contents

Get access

Summary

The set of features used by a learning algorithm can have a dramatic impact on the performance of the algorithm. Including extraneous features can make the learning problem more difficult by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, excluding useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity.

The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional datasets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we iteratively add many new features. Even those techniques that use relatively computationally inexpensive tests of a feature's value, such as mutual information, require at least linear time in the number of features being evaluated.

As a simple illustrative example, consider the task of classifying websites. In this case, the dataset could easily contain many millions of examples. Including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model.

Type: Chapter
Information: Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 352 - 370

DOI: https://doi.org/10.1017/CBO9781139042918.018 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abe, S. 2005. Modified backward feature selection by cross validation. In: 13th European Symposium on Artificial Neural Networks.Google Scholar

Asuncion, A., and Newman, D. 2007. UCI Machine Learning Repository.

Battiti, R. 1994. Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transactions on Neural Networks, 5, 537–550.CrossRef Google Scholar PubMed

Caruana, R., Karampatziakis, N., and Yessenalina, A. 2008. An Empirical Evaluation of Supervised Learning in High Dimensions. In: Proceedings of the 25th International Conference on Machine Learning (ICML 2008).Google Scholar

Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In: OSDI'04: Sixth Symposium on Operating System Design and Implementation.Google Scholar

Della Pietra, S., Della Pietra, V., and Lafferty, J. 1997. Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.CrossRef Google Scholar

Fleuret, F. 2004. Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research, 5, 1531–1555.Google Scholar

Friedman, J., Hastie, T., and Tibshirani, R. 2008. Regularized Paths for Generalized Linear Models via Coordinate Descent. http://www stat.stanford.edu/~hasti/Papers/glmnet.pdf.

Garcia, D., Hall, L., Goldgof, D., and Kramer, K. 2006. A parallel feature selection algorithm from random subsets. In: Proceedings of the 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases.Google Scholar

Genkin, A., Lewis, D., and Madigan, D. 2005. Sparse Logistic Regression for Text Categorization.

Guyon, I., and Elisseeff, A. 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3(March), 1157–1182.Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. New York: Springer.CrossRef Google Scholar

John, G., Kohavi, R., and Pfleger, K. 1994. Irrelevant Features and the Subset Selection Problem. Pages 121–129 of: Proceedings of the Eleventh International Conference on Machine Learning (ICML 1994). San Francisco, CA: Morgan Kauffmann.Google Scholar

Komarek, P., and Moore, A. 2005. Making Logistic Regression a Core Data Mining Tool with TR-IRLS. In: Proceedings of the 5th International Conference on Data Mining Machine Learning.Google Scholar

Krishnapuram, B., Carin, L., and Hartemink, A. 2004. Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data. Journal of Computational Biology, 11(2–3), 227–242.CrossRef Google Scholar PubMed

Lewis, D. 1992. Feature Selection and Feature Extraction for Text Categorization. Pages 212–217 of: Proceedings of the Workshop on Speech and Natural Language.CrossRef Google Scholar

Lewis, D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397.Google Scholar

López, F., Torres, M., Batista, B., Pérez, J., and Moreno-Vega, M. 2006. Solving Feature Subset Selection Problem by a Parallel Scatter Search. European Journal of Operational Research, 169(2), 477–489.CrossRef Google Scholar

McCallum, A. 2003. Efficiently Inducing Features of Conditional Random Fields. In: Conference on Uncertainty in Artificial Intelligence (UAI).Google Scholar

Perkins, S., Lacker, K., and Theiler, J. 2003. Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space. Journal of Machine Learning Research, 3, 1333–1356.Google Scholar

Singh, S., Kubica, J., Larsen, S., and Sorokina, D. 2009. Parallel Large Scale Feature Selection for Logistic Regression. In: SIAM International Conference on Data Mining (SDM).Google Scholar

Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1), 267–288.Google Scholar

Whitney, A. 1971. A Direct Method of Nonparametric Measurement Selection. IEEE Transactions on Computers, 20(9), 1100–1103.CrossRef Google Scholar

Book contents

17 - Parallel Large-Scale Feature Selection

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive