Machine Learning Predictions as Regression Covariates

Christian Fong; Matthew Tyler

doi:10.1017/pan.2020.38

Machine Learning Predictions as Regression Covariates

Published online by Cambridge University Press: 11 November 2020

Christian Fong and

Matthew Tyler

Show author details

Christian Fong: Affiliation:
Assistant Professor, Department of Political Science, University of Michigan, Ann Arbor, MI, USA. Email: cjfong@umich.edu
Matthew Tyler*: Affiliation:
Ph.D. Candidate, Department of Political Science, Stanford University, Stanford, CA, USA. Email: mdtyler@stanford.edu
*: Corresponding author Matthew Tyler

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In text, images, merged surveys, voter files, and elsewhere, data sets are often missing important covariates, either because they are latent features of observations (such as sentiment in text) or because they are not collected (such as race in voter files). One promising approach for coping with this missing data is to find the true values of the missing covariates for a subset of the observations and then train a machine learning algorithm to predict the values of those covariates for the rest. However, plugging in these predictions without regard for prediction error renders regression analyses biased, inconsistent, and overconfident. We characterize the severity of the problem posed by prediction error, describe a procedure to avoid these inconsistencies under comparatively general assumptions, and demonstrate the performance of our estimators through simulations and a study of hostile political dialogue on the Internet. We provide software implementing our approach.

Keywords

machine learning classification inference instrumental variables

Type: Article
Information: Political Analysis , Volume 29 , Issue 4 , October 2021 , pp. 467 - 484

DOI: https://doi.org/10.1017/pan.2020.38 [Opens in a new window]
Copyright: © The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Aigner, D. J. (1973). “Regression with a Binary Independent Variable Subject to Errors of Observation.” Journal of Econometrics 1(1):49–59.CrossRef Google Scholar

Anastasopoulos, J., Badani, D., Lee, C., Ginosar, S., and Ryland Williams, J. (2016). “Photographic Home Styles in Congress: A Computer Vision Approach.” https://arxiv.org/pdf/1611.09942.pdf.Google Scholar

Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.CrossRef Google Scholar

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., and Leisch, M. F. (2009). “Package ‘e1071’.” R Software package, http://cran.rproject.org/web/packages/e1071/index.html.Google Scholar

Fong, C., Malhotra, N., and Margalit, Y. (2019). “Political legacies: Understanding their singificance to contemporary political debates.” PS: Political Science & Politics 52(3):451–456.Google Scholar

Fong, C. and Tyler, M. (2020a). “Replication Data for: Machine Learning Predictions as Regression Covariates.” Code Ocean, V1. https://doi.org/10.24433/CO.3552504.v1.CrossRef Google Scholar

Fong, C. and Tyler, M. (2020b). “Replication Data for: Machine Learning Predictions as Regression Covariates.” https://doi.org/10.7910/DVN/QQHBHY, Harvard Dataverse, V1,UNF:6:vgF7Ffh39tB+eQJxHpax7A== [fileUNF].Google Scholar

Grimmer, J., Messing, S., and Westwood, S. J. (2012). “How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation.” American Political Science Review 106(4):1–17.CrossRef Google Scholar

Grumbach, J. M. and Sahn, A. (2020). “Race and representation in campaign finance.” American Political Science Review 114(1):206–221.CrossRef Google Scholar

Hopkins, D. J. and King, G. (2010). “A method of automated nonparametric content analysis for social science.” American Journal of Political Science 54(1):229–247.CrossRef Google Scholar

Ibrahim, J. G., Chen, M.-H., Lipsitz, S. R., and Herring, A. H. (2005). “Missing-data methods for generalized linear models: A comparative review.” Journal of the American Statistical Association 100(469):332–346.CrossRef Google Scholar

Imai, K. and Khanna, K. (2016). “Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records.” Political Analysis 24:263–272.CrossRef Google Scholar

Iyyer, M., Enns, P., Boyd-Graber, J., and Resnik, P. (2014). “Political ideology detection using recursive neural networks.” In Proceedings of the Association for Computational Linguistics, pp. 1–11.Google Scholar

Jamal, A. A., Keohane, R. O., Romney, D., and Tingley, D. (2015). “Anti-Americanism and Anti-Interventionism in Arabic Twitter Discourses.” Perspectives on Politics 13(1):55–73.CrossRef Google Scholar

Jerzak, C. T., King, G., and Strezhnev, A. (2018). “An Improved Method of Automated Nonparametric Content Analysis for Social Science.” https://gking.harvard.edu/files/gking/files/word.pdf.Google Scholar

Kane, T. J., Rouse, C. E., and Staiger, D. (1999). Estimating Returns to Schooling When Schooling Is Misreported. National Bureau of Economic Research.CrossRef Google Scholar

King, G., Pan, J., and Roberts, M. E. (2013). “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107(2):326–343.CrossRef Google Scholar

Munger, K. (2017). “Experimentally Reducing Partisan Incivility on Twitter.” http://kmunger.github.io/pdfs/jmp.pdf.Google Scholar

Mutz, D. C. and Reeves, B. (2005). “The new videomalaise: Effects of televised incivility on political trust.” American Political Science Review 99(1):1–15.CrossRef Google Scholar

Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ:John Wiley & Sons.Google Scholar

Socher, R., Perelygin, A., and Wu, J. (2013). “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, pp. 1631–1642. Seattle, Washington.Google Scholar

Stewart, B. M. and Zhukov, Y. M. (2009). “Use of Force and Civil–Military Relations in Russia: An Automated Content Analysis.” Small Wars & Insurgencies 20(2):319–343.CrossRef Google Scholar

Theocharis, Y., Barberá, P., Fazekas, Z., Popa, S. A., and Parnet, O. (2016). “A Bad Workman Blames His Tweets: The Consequences of Citizens’ Uncivil Twitter Use when Interacting with Party Candidates.” Journal of communication 66(6):1007–1031.CrossRef Google Scholar

Fong and Tyler Dataset

Dataset

https://doi.org/10.7910/DVN/QQHBHY

Link

Fong and Tyler supplementary material

Online Appendix

PDF 617.2 KB

Article contents

Machine Learning Predictions as Regression Covariates

Abstract

Keywords

Access options

Footnotes

References

Fong and Tyler Dataset

Fong and Tyler supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests