Machine Learning for Experiments in the Social Sciences

Jon Green; Mark H. White, II

doi:10.1017/9781009168236

Series: Elements in Experimental Political Science

Machine Learning for Experiments in the Social Sciences

Published online by Cambridge University Press: 21 March 2023

Jon Green and

Mark H. White, II

Show author details

Jon Green: Affiliation:
Northeastern University
Mark H. White, II: Affiliation:
Etsy, Inc.

Summary

Causal inference and machine learning are typically introduced in the social sciences separately as theoretically distinct methodological traditions. However, applications of machine learning in causal inference are increasingly prevalent. This Element provides theoretical and practical introductions to machine learning for social scientists interested in applying such methods to experimental data. We show how machine learning can be useful for conducting robust causal inference and provide a theoretical foundation researchers can use to understand and apply new methods in this rapidly developing field. We then demonstrate two specific methods – the prediction rule ensemble and the causal random forest – for characterizing treatment effect heterogeneity in survey experiments and testing the extent to which such heterogeneity is robust to out-of-sample prediction. We conclude by discussing limitations and tradeoffs of such methods, while directing readers to additional related methods available on the Comprehensive R Archive Network (CRAN).

Element contents

Summary
References

Get access

Keywords

experiments machine learning causal inference social science treatment effects

Type: Element
Information: Series: Elements in Experimental Political Science

DOI: https://doi.org/10.1017/9781009168236 [Opens in a new window]

Online ISBN: 9781009168236

Publisher: Cambridge University Press

Print publication: 13 April 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abramson, Scott F., Kocak, Korhan, Magazinnik, Asya, and Strezhnev, Anton. 2020. “Improving Preference Elicitation in Conjoint Designs Using Machine Learning for Heterogeneous Effects.” Working paper. www.korhankocak.com/publication/akms/.Google Scholar

Athey, Susan, and Imbens, Guido. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–7360.Google Scholar

Athey, Susan, Tibshirani, Julie, and Wager, Stefan. 2019. “Generalized Random Forests.” Annals of Statistics 47 (2): 1148–1178.CrossRef Google Scholar

Ballarini, Nicolas M., Thomas, Marius, Rosenkranz, Gerd K., and Bornkamp, Björn. 2021. “Subtee: An R Package for Subgroup Treatment Effect Estimation in Clinical Trials.” Journal of Statistical Software 99 (14): 1–17.CrossRef Google Scholar

Bates, Stephen, Hastie, Trevor, and Tibshirani, Robert. 2021. “Cross-Validation: What Does It Estimate and How Well Does It Do It?” Working paper. https://arxiv.org/abs/2104.00673.Google Scholar

Beebee, Helen, Hitchcock, Christopher, and Menzies, Peter. 2009. The Oxford Handbook of Causation. Oxford: Oxford University Press.Google Scholar

Beiser-McGrath, Janina, and Liam, Beiser-McGrath. 2020. “Problems with Products? Control Strategies for Models with Interaction and Quadratic Effects.” Political Science Research and Methods 8 (4): 707–730.Google Scholar

Blackwell, Matthew, and Olson, Michael. 2022a. Inters: Flexible Tools for Estimating Interactions. https://CRAN.R-project.org/package=inters.Google Scholar

Blackwell, Matthew, and Olson, Michael 2022b. “Reducing Model Misspecification and Bias in the Estimation of Interactions.” Political Analysis 30 (4): 495–514.CrossRef Google Scholar

Blair, Elizabeth. 2020. “‘Ugly,’ ‘Discordant’: New Executive Order Takes Aim at Modern Architecture.” NPR, December 21. www.npr.org/2020/02/13/805256707/just-plain-ugly-proposed-executive-order-takes-aim-at-modern-architecture.Google Scholar

Bon, Joshua J. 2022. Tidytreatment: Tidy Methods for Bayesian Treatment Effect Models. https://CRAN.R-project.org/package=tidytreatment.Google Scholar

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24: 123–140.CrossRef Google Scholar

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.CrossRef Google Scholar

Bryan, Christopher J., Tipton, Elizabeth, and Yeager, David S.. 2021. “Behavioural Science Is Unlikely to Change the World without a Heterogeneity Revolution.” Nature Human Behavior 5: 980–989.CrossRef Google Scholar PubMed

Burkov, Andriy. 2019. The Hundred-Page Machine Learning Book. Andriy Burkov.Google Scholar

Campbell, Donald T. 1973. “The Social Scientist As Methodological Servant of the Experimenting Society.” Policy Studies and the Social Sciences 2 (1): 27–32.Google Scholar

Chen, Shuai, Tian, Lu, Cai, Tianxi, and Yu, Menggang. 2017. “A General Statistical Framework for Subgroup Identification and Comparative Treatment Scoring.” Biometrics 73 (4): 1199–1209. https://doi.org/10.1111/biom.12676.Google Scholar

Chen, Tianqi, and Guestrin, Carlos. 2016. “XGBoost: A Scalable Tree Boosting System.” In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. New York: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.Google Scholar

Chen, Tianqi, Tong, He, Benesty, Michael et al. 2022. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.Google Scholar

Chernozhukov, Victor, Demirer, Mert, Duflo, Esther, and Fernandez-Val, Ivan. 2018. “Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, with an Application to Immunization in India.” National Bureau of Economic Research. Working Paper No. 24678.Google Scholar

Collaboration, Open Science. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.Google Scholar

Crandall, Christian S., Silvia, Paul J., N’Gbala, Ahogni Nicolas, Tsang, Jo-Ann, and Dawson, Karen. 2007. “Balance Theory, Unit Relations, and Attribution: The Underlying Integrity of Heiderian Theory.” Review of General Psychology 11 (1): 12–30.CrossRef Google Scholar

Cranmer, Skyler, and Desmarais, Bruce. 2017. “What Can We Learn from Predictive Modeling?” Political Analysis 25 (2): 145–166.CrossRef Google Scholar

Cronbach, Lee J. 1975. “Beyond the Two Disciplines of Scientific Psychology.” American Psychologist 30 (2): 116–127.CrossRef Google Scholar

Dusseldorp, Elise, Doove, Lisa, and van Mechelen, Iven. 2016. “Quint: An R Package for the Identification of Subgroups of Clients Who Differ in Which Treatment Alternative Is Best for Them.” Behavior Research Methods 48 (2): 650–663.Google Scholar

Dusseldorp, Elise, and Van Mechelen, Iven. 2014. “Qualitative Interaction Trees: A Tool to Identify Qualitative Treatment–Subgroup Interactions.” Statistics in Medicine 33 (2): 219–237.CrossRef Google Scholar

Ebersole, Charles R., Atherton, Olivia E., Belanger, Aimee L. et al. 2016. “Many Labs 3: Evaluating Participant Pool Quality across the Academic Semester via Replication.” Journal of Experimental Social Psychology 67: 68–82.Google Scholar

Ebersole, Charles R., Mathur, Maya B., Baranski, Erica et al. 2020. “Many Labs 5: Testing Pre-Data-Collection Peer Review As an Intervention to Increase Replicability.” Advances in Methods and Practices in Psychological Science 3 (3): 309–331.CrossRef Google Scholar

Fariss, Christopher, and Jones, Zachary. 2018. “Enhancing Validity in Observational Settings When Replication Is Not Possible.” Political Science Research and Methods 6 (2): 365–380.Google Scholar

Fokkema, Marjolein. 2020. “Fitting Prediction Rule Ensembles with R Package pre.” Journal of Statistical Software 92 (12): 1–30.Google Scholar

Fokkema, Marjolein, and Strobl, Carolin. 2020. “Fitting Prediction Rule Ensembles to Psychological Research Data: An Introduction and Tutorial.” Psychological Methods 25 (5): 636–652.Google Scholar

Foster, Jared C., Taylor, Jeremy M. G., and Ruberg, Stephen J.. 2011. “Subgroup Identification from Randomized Clinical Trial Data.” Statistics in Medicine 30 (24): 2867–2880.Google Scholar

Freund, Yoav, and Schapire, Robert E.. 1996. “Experiments with a New Boosting Algorithm.” In Saitta, Lorenza, ed., ICML ’96: Proceedings of the Thirteenth International Conference on Machine Learning, 148–156. San Francisco, CA: Morgan Kaufmann.Google Scholar

Friedman, Jerome. 2002. “Stochastic Gradient Boosting.” Computational Statistics and Data Analysis 38 (4): 367–378.Google Scholar

Gelman, Andrew. 2015. “The Connection between Varying Treatment Effects and the Crisis of Unreplicable Research: A Bayesian Perspective.” Journal of Management 41 (2): 632–643.Google Scholar

Gelman, Andrew, and Loken, Eric. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘P-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.” [Online]. www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf.Google Scholar

Gentzkow, Matthew, Jesse, Shapiro, and Taddy, Matthew. 2019. “Measuring Group Differences in High Dimensional Choices: Method and Application to Congressional Speech.” Econometrica 87 (4): 1307–1340.Google Scholar

Géron, Aurélien. 2019. Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol, CA: O’Reilly Media.Google Scholar

Glass, Gene V. 1976. “Primary, Secondary, and Meta-Analysis of Research.” Educational Researcher 5 (10): 3–8.CrossRef Google Scholar

Green, Donald, and Kern, Holger. 2012. “Modeling Heterogeneous Treatment Effects in Survey Experiments with Bayesian Additive Regression Trees.” Public Opinion Quarterly 76 (3): 491–511.CrossRef Google Scholar

Green, Donald P., and Gerber, Alan S.. 2004. Get Out the Vote! How to Increase Voter Turnout. Washington, DC: Brookings Institution Press.Google Scholar

Green, Jon, Schaffner, Brian, and Luks, Sam. 2023. “Strategic Discrimination in the 2020 Democratic Primary.” Public Opinion Quarterly nfac051. https://doi.org/10.1093/poq/nfac051.Google Scholar

Grimmer, Justin, Messing, Solomon, and Westwood, Sean J.. 2017. “Estimating Heterogeneous Treatment Effects and the Effects of Heterogeneous Treatments with Ensemble Methods.” Political Analysis 25 (4): 413–434.CrossRef Google Scholar

Ham, Dae Woong, Imai, Kosuke, and Janson, Lucas. 2022. “Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis.” arXiv. https://arxiv.org/abs/2201.08343.Google Scholar

Hare, Christopher, and Kutsuris, Mikayla. 2022. “Measuring Swing Voters with a Supervised Machine Learning Ensemble.” Political Analysis, 1–17. www.cambridge.org/core/journals/political-analysis/article/measuring-swing-voters-with-a-supervised-machine-learning-ensemble/145B1D6B0B2877FC454FBF446F9F1032.Google Scholar

Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer Science & Business Media.Google Scholar

Head, Megan L., Holman, Luke, Lanfear, Rob, Kahn, Andrew T, and Jennions, Michael D. 2015. “The Extent and Consequences of P-Hacking in Science.” PLoS Biology 13 (3): e1002106.Google Scholar

Heider, Fritz. 1958. The Psychology of Interpersonal Relations. New York: Wiley.Google Scholar

Hernàn, Miguel A., and VanderWeele, Tyler J.. 2011. “Compound Treatments and Transportability of Causal Inference.” Epidemiology 22 (3): 368–377.Google Scholar

Hoffman, Jake M., Sharma, Amit, and Watts, Duncan J.. 2021. “Prediction and Explanation in Social Systems.” Science 355 (6324): 486–488. https://science.sciencemag.org/content/355/6324/486.Google Scholar

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–960.Google Scholar

Huling, Jared D., and Yu, Menggang. 2021. “Subgroup Identification Using the personalized Package.” Journal of Statistical Software 98 (5): 1–60. https://doi.org/10.18637/jss.v098.i05.Google Scholar

Imai, Kosuke, and Ratkovic, Marc. 2013. “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation.” Annals of Applied Statistics 7 (1): 443–470.CrossRef Google Scholar

Imai, Kosuke, and Strauss, Aaron. 2011. “Estimation of Heterogeneous Treatment Effects from Randomized Experiments, with Application to the Optimal Planning of the Get-Out-the-Vote Campaign.” Political Analysis 19 (1): 1–19.Google Scholar

James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning. New York: Springer.CrossRef Google Scholar

Keele, Luke. 2015. “The Statistics of Causal Inference: A View from Political Methodology.” Political Analysis 23 (3): 313–335.Google Scholar

Kerr, Norbert L. 1998. “HARKing: Hypothesizing After the Results Are Known.” Personality and Social Psychology Review 2 (3): 196–217.Google Scholar

Klein, Richard A., Cook, Corey L., Ebersole, Charles R. et al. 2019. “Many Labs 4: Failure to Replicate Mortality Salience Effect with and without Original Author Involvement.” PsyArXiv. https://doi.org/10.31234/osf.io/vef2c.Google Scholar

Klein, Richard A., Vianello, Michelangelo, Hasselman, Fred et al. 2018. “Many Labs 2: Investigating Variation in Replicability across Samples and Settings.” Advances in Methods and Practices in Psychological Science 1 (4): 443–490.Google Scholar

Kuhn, Max, and Johnson, Kjell. 2013. Applied Predictive Modeling. Vol. 26. New York: Springer.Google Scholar

Kuhn, Max, and Silge, Julia. 2022. Tidy Modeling with R: A Framework for Modeling in the Tidyverse. Sebastopol, CA: O’Reilly Media.Google Scholar

Künzel, Sören R., Sekhon, Jasjeet S., Bickel, Peter J., and Bin, Yu. 2019. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–4165.Google Scholar

Lipkovich, Ilya, Dmitrienko, Alex, Denne, Jonathan, and Enas, Gregory. 2011. “Subgroup Identification Based on Differential Effect Search: A Recursive Partitioning Method for Establishing Response to Treatment in Patient Subpopulations.” Statistics in Medicine 30 (21): 2601–2621.Google Scholar

McClelland, Gary H., and Judd, Charles M.. 1993. “Statistical Difficulties of Detecting Interactions and Moderator Effects.” Psychological Bulletin 114 (2): 376.CrossRef Google Scholar PubMed

Montgomery, Jacob M., and Olivella, Santiago. 2018. “Tree-Based Models for Political Science Data.” American Journal of Political Science 62 (3): 729–744.Google Scholar

Nicholson, Stephen. 2012. “Polarizing Cues.” American Journal of Political Science 56 (1): 52–66.Google Scholar

Nicosia, Jessica, Cohen-Shikora, Emily R., and Balota, David A.. 2021. “Re-examining Age Differences in the Stroop Effect: The Importance of the Trees in the Forest (Plot).” Psychology and Aging 36 (2): 214–231.Google Scholar

Nie, Xinkun, and Wager, Stefan. 2021. “Quasi-Oracle Estimation of Heterogeneous Treatment Effects.” Biometrika 108 (2): 299–319.CrossRef Google Scholar

Nosek, Brian A., Ebersole, Charles R., Alexander, C. DeHaven, and Mellor, David T.. 2018. “The Preregistration Revolution.” Proceedings of the National Academy of Sciences 115 (11): 2600–2606.Google Scholar

Peterson, Andrew, and Spirling, Arthur. 2018. “Classification Accuracy As a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems.” Political Analysis 26 (1): 120–128.Google Scholar

Polley, Eric, LeDell, Erin, Kennedy, Chris, and van der Laan, Mark. 2021. SuperLearner: Super Learner Prediction. https://CRAN.R-project.org/package=SuperLearner.Google Scholar

Ratkovic, Marc. 2021. “Subgroup Analysis: Pitfalls, Promise, and Honesty.” In Druckman, James N. and Green, Donald P. (Eds.), Advances in Experimental Political Science, 271–288. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108777919.020.Google Scholar

Ratkovic, Marc, and Tingley, Dustin. 2017. “Sparse Estimation and Uncertainty with Application to Subgroup Analysis.” Political Analysis 25 (1): 1–40.Google Scholar

Ripley, Brian. 2021. Tree: Classification and Regression Trees. https://CRAN.R-project.org/package=tree.Google Scholar

Riviere, Marie-Karelle. 2021. SIDES: Subgroup Identification Based on Differential Effect Search. https://CRAN.R-project.org/package=SIDES.Google Scholar

Rosenthal, Robert. 1979. “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin 86 (3): 638.Google Scholar

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701.Google Scholar

Rubin, Donald B. 2008. “For Objective Causal Inference, Design Trumps Analysis.” Annals of Applied Statistics 2 (3): 808–840.Google Scholar

Rubin, Mark, and Donkin, Chris. 2022. “Exploratory Hypothesis Tests Can Be More Compelling Than Confirmatory Hypothesis Tests.” Philosophical Psychology. https://doi.org/10.1080/09515089.2022.2113771.Google Scholar

Seibold, Heidi, Zeileis, Achim, and Hothorn, Torsten. 2019. “Model4you: An R Package for Personalised Treatment Effect Estimation.” Journal of Open Research Software 7 (1). http://doi.org/10.5334/jors.219.Google Scholar

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3): 289–310.Google Scholar

Shrout, Patrick E., and Rodgers, Joseph L.. 2018. “Psychology, Science, and Knowledge Construction: Broadening Perspectives from the Replication Crisis.” Annual Review of Psychology 69 (1): 487–510. https://doi.org/10.1146/annurev-psych-122216-011845.Google Scholar

Silberzahn, Raphael, Uhlmann, Eric L., Martin, Daniel P. et al. 2018. “Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results.” Advances in Methods and Practices in Psychological Science 1 (3): 337–356.Google Scholar

Simmons, Joseph P., Nelson, Leif D., and Simonsohn, Uri. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything As Significant.” Psychological Science 22 (11): 1359–1366.Google Scholar

Simonsohn, Uri, Nelson, Leif D., and Simmons, Joseph P.. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534.Google Scholar

Soderberg, Courtney K., Errington, Timothy M., Schiavone et al, Sarah R.. 2021. “Initial Evidence of Research Quality of Registered Reports Compared with the Standard Publishing Model.” Nature Human Behaviour 5: 990–997. https://doi.org/10.1038/s41562-021-01142-4.CrossRef Google Scholar PubMed

Sparapani, Rodney, Spanbauer, Charles, and Robert, McCulloch. 2021. “Nonparametric Machine Learning and Efficient Computation with Bayesian Additive Regression Trees: The BART R Package.” Journal of Statistical Software 97 (1): 1–66. https://doi.org/10.18637/jss.v097.i01.Google Scholar

Stieger, James H. 1990. “Structural Model Evaluation and Modification: An Interval Estimation Approach.” Multivariate Behavioral Research 25 (2): 173–180.Google Scholar

Strobl, Carolin, Boulesteix, Anne-Laure, Kneib, Thomas, Augustin, Thomas, and Zeileis, Achim. 2008. “Conditional Variable Importance for Random Forests.” BMC Bioinformatics 9 (307). https://doi.org/10.1186/1471-2105-9-307.Google Scholar

Strobl, Carolin, Boulesteix, Anne-Laure, Zeileis, Achim, and Hothorn, Torsten. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (25). https://doi.org/10.1186/1471-2105-8-25.Google Scholar

Tibshirani, Julie, Athey, Susan, Sverdrup, Erik, and Wager, Stefan. 2021. Grf: Generalized Random Forests. https://CRAN.R-project.org/package=grf.Google Scholar

Vieille, Francois, and Foster, Jared. 2018. AVirtualTwins: Adaptation of Virtual Twins Method from Jared Foster. https://CRAN.R-project.org/package=aVirtualTwins.Google Scholar

Wager, Stefan, and Athey, Susan. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–1242.Google Scholar

Wang, Chenguang, Louis, Thomas A., Henderson, Nicholas C., Weiss, Carlos O., and Varadhan, Ravi. 2018. “Beanz: An R Package for Bayesian Analysis of Heterogeneous Treatment Effects with a Graphical User Interface.” Journal of Statistical Software 85 (7): 1–31.Google Scholar

Wright, Marvin N., and Ziegler, Andreas. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.Google Scholar

Yadlowsky, Steve, Fleming, Scott, Shah, Nigam, Brunskill, Emma, and Wager, Stefan. 2021. “Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects.” arXiv. https://arxiv.org/abs/2111.07966.Google Scholar

Yarkoni, Tal, and Westfall, Jacob. 2017. “Choosing Prediction over Explanation in Psychology: Lessons from Machine Learning.” Perspectives on Psychological Science 12 (6): 1100–1122.Google Scholar

Element contents

Machine Learning for Experiments in the Social Sciences

Summary

Keywords

Access options

References

Save element to Kindle

Save element to Dropbox

Save element to Google Drive