Text Analysis in Python for Social Scientists: Discovery and Exploration

Dirk Hovy

doi:10.1017/9781108873352

Series: Elements in Quantitative and Computational Methods for the Social Sciences

Text Analysis in Python for Social Scientists

Discovery and Exploration

Published online by Cambridge University Press: 14 December 2020

Dirk Hovy

Show author details

Dirk Hovy: Affiliation:
Bocconi University

Summary

Text is everywhere, and it is a fantastic resource for social scientists. However, because it is so abundant, and because language is so variable, it is often difficult to extract the information we want. There is a whole subfield of AI concerned with text analysis (natural language processing). Many of the basic analysis methods developed are now readily available as Python implementations. This Element will teach you when to use which method, the mathematical background of how it works, and the Python code to implement it.

Element contents

Summary
References

Get access

Keywords

text analysis natural language processing computational linguistics exploration

Type: Element
Information: Series: Elements in Quantitative and Computational Methods for the Social Sciences

DOI: https://doi.org/10.1017/9781108873352 [Opens in a new window]

Online ISBN: 9781108873352

Publisher: Cambridge University Press

Print publication: 21 January 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Antoniak, M., & Mimno, D. (2018). Evaluating the stability of embeddingbased word similarities. Transactions of the Association for Computational Linguistics, 6, 107–119.Google Scholar

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.Google Scholar

Bhatia, S. (2017). Associative judgment and vector space semantics. Psychological Review 124(1), 1.Google Scholar

Bianchi, F., Terragni, S., & Hovy, D. (2020). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974.Google Scholar

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.Google Scholar

Blodgett, S. L., Green, L., & O'Connor, B. (2016). Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. Stroudsburg, PA. (pp. 1119-1130).Google Scholar

Boyd-Graber, J., Mimno, D., & Newman, D. (2014). Careandfeedingoftopic models: Problems, diagnostics, and improvements. In Airoldi, E. M., Blei, D., Erosheva, E. A., & Fienberg, S. E. (Eds.), Handbook of mixed membership models and their applications. Boca Raton, FL: CRC Press, pp. 225–254.Google Scholar

Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. Paper presented at the 34th annual meeting of the Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/P96-1041 CrossRef Google Scholar

Chollet, F. (2017). Deep learning with Python. Manning, Shelter Island, NY.Google Scholar

Crystal, D. (2003). The Cambridge encyclopedia of the English language (3rd ed.). Cambridge, England: Cambridge University Press.Google Scholar

Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian LDA for topic mod-els with word embeddings. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Vol. 1. Long papers. Association for Computational Linguistics. Stroudsburg, PA. (pp. 795-804).Google Scholar

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.Google Scholar

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.Google Scholar

Denny, M.J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26(2), 168–189.Google Scholar

Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2019). Topic modeling in embedding spaces. arXiv preprint arXiv:1907.04907.Google Scholar

Eisenstein, J. (2019). Introduction to natural language processing. Cambridge, MA: MIT Press.Google Scholar

Evans, J. A., & Aceves, P. (2016). Machine translation: Mining text for social theory. Annual Review of Sociology, 42, 21–50.Google Scholar

Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis. Basil Blackwell, Oxford. pp 1-32. Volume 1Google Scholar

Fromkin, V., Rodman, R., & Hyams, N. (2018). An introduction to language. Cengage Learning. Wadsworth. Boston, MA.Google Scholar

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 115(16), E3635–E3644.Google Scholar

Gentzkow, M., Kelly, B. T., & Taddy, M. (2017). Text as data (technical report). Washington, DC: National Bureau of Economic Research.Google Scholar

Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.Google Scholar

Goldberg, Y. (2017). Neural network methods for natural language processing. Edited by Graeme Hirst. Morgan & Claypool. San Rafael, CA, Synthesis Lectures on Human Language Technologies 10(1), 1–309.Google Scholar

Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv: 1402.3722.Google Scholar

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages. Paper presented at the International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3), 267–297.Google Scholar

Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (pp. 1489-1501).Google Scholar

Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2018). Comparing automated text classification methods. Association for Computational Linguistics. Stroudsburg, PA. International Journal of Research in Marketing 36(1), pp. 20–38.Google Scholar

Hovy, D. (2010). An evening with: : : EM (technical report). University of Southern California. Online tech report.Google Scholar

Hovy, D., & Purschke, C. (2018). Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Stroudsburg, PA. (pp. 4383-4394).Google Scholar

Humphreys, A., & Wang, R. J.-H. (2017). Automated text analysis for consumer research. Journal of Consumer Research 44(6), 1274–1306.Google Scholar

Jagarlamudi, J., Daume, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. Stroudsburg, PA (pp. 204-213).Google Scholar

Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings Workshop Pattern Recognition in Practice (pp. 381-397).Google Scholar

Jurafsky, D. (2014). The language of food: A linguist reads the menu. North Holland Publishing Company, Amsterdam. New York: W. W. Norton.Google Scholar

Jurafsky, D.,& Martin, J. H. (2014). Speech and language processing (3rd ed.). London: Pearson.Google Scholar

Katz, S. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3), 400–401.Google Scholar

Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on the World Wide Web, Association for Computing Machinery. New York, NY. (pp. 625-635).Google Scholar

Labov, W. (1972). Sociolinguistic patterns. Philadelphia, PA: University of Pennsylvania Press.Google Scholar

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211–240.Google Scholar

Lang, S. (2012). Introduction to linear algebra. New York: Springer Science & Business Media.Google Scholar

Lau, J. H., &Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In (p. 78-86). Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics. Stroudsburg, PA.Google Scholar

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). Association for Computing Machinery. New York, NY. (pp. 1188-1196).Google Scholar

Loper, E., &Bird, S. (2002). NLTK: The Natural Language Toolkit. Paper presented at the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.Google Scholar

Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.Google Scholar

Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Google Scholar

Marsland, S. (2015). Machine learning: An algorithmic perspective (2nd ed.). New York: Chapman and Hall/CRC.Google Scholar

McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics: Vol. 2. Short Papers. Association for Com-putational Linguistics. Stroudsburg, PA. (pp. 92-97).Google Scholar

Mikolov, T, Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Paper presented at the 11th annual conference of the International Speech Communication Association.Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. Neural Information Processing Systems Foundation. San Diego, CA. (pp. 3111-3119).Google Scholar

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association forComputationalLinguistics. Stroudsburg, PA. (pp. 262-272).Google Scholar

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.Google Scholar

Niculae, V., Kumar, S., Boyd-Graber, J., & Danescu-Niculescu-Mizil, C. (2015). Linguistic harbingers of betrayal: A case study on an online strategy game. In Proceedings of the 53rd annual meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Vol. 1. Long Papers. Association for Computational Linguistics. Stroudsburg, PA. (pp. 1650-1659).Google Scholar

Nivre, J., Agic, Z., Aranzabe, M. J., Asahara, M., Atutxa, A., Ballesteros, M., et al. (2015). Universal Dependencies Consortium. No address: https://universaldependencies.org/ Universal dependencies 1.2.Google Scholar

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., etal. (2016, May). Universal dependencies v1 : A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 1659-1666). Portoroz, Slovenia: European Language Resources Association (ELRA). Retrieved from www.aclweb.org/anthology/L16-1262 Google Scholar

Pennebaker, J. W. (2011). The secret life ofpronouns: What our words say about us. New York: Bloomsbury Press.Google Scholar

Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguisticinquiryand word count: LIWC2001. Mahwah, NJ: Lawrence Erlbaum, 2001.Google Scholar

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics. Stroudsburg, PA. (pp. 1532-1543).Google Scholar

Petrov, S., Das, D., & McDonald, R. (2011). A universal part-of-speech tagset. In Proceedings ofLREC. European Language Resources Association. Paris.Google Scholar

Porter, M. F. (1980). An algorithm for suffix stripping. Program 14(3), 130–137.Google Scholar

Prabhakaran, V., Rambow, O., & Diab, M. (2012). Predicting overt display of power in written dialogs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. Stroudsburg, PA. (pp. 518-522).Google Scholar

Resnik, P., & Hardisty, E. (2010). Gibbssamplingfor the uninitiated (technical report). College Park, MD: University of Maryland Institute for Advanced Computer Studies.Google Scholar

Roberts, Molly Roberts, Brandon, Stewart, Dustin Tingley, Edoardo Airoldi M. E., Stewart, B. M., Tingley, D., Airoldi, E. M., et al. (2013). The structural topic model and applied social science. In Advances in neural information processing systems workshop on topic models: Computation, application, and evaluation. Neural Information Processing Systems Foundation. San Diego, CA. (pp. 1-20).Google Scholar

Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining. Association for Computing Machinery. New York, NY. (pp. 399-408).Google Scholar

Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.Google Scholar

Schwartz, H. A., Eichstaedt, J., Blanco, E., Dziurzynski, L., Kern, M., Ramones, S., et al. (2013). Choosing the right words: Characterizing and reducing error of the word count approach. In Second Joint Conference on Lexical and Computational Semantics (* SEM): Vol. 1. Proceedings of the main conference and the shared task: Semantic textual similarity. Association for Computational Linguistics. Stroudsburg, PA. (pp. 296-305).Google Scholar

Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21.Google Scholar

Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488.Google Scholar

Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012, July). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 952-961). Jeju Island, Korea: Association for Computational Linguistics. Retrieved from www.aclweb.org/anthology/D12-1087 Google Scholar

Trudgill, P. (2000). Sociolinguistics: An introduction to language and society. London: Penguin.Google Scholar

Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamic philology. Houghton Mifflin. Boston, MA.Google Scholar

Element contents

Text Analysis in Python for Social Scientists

Summary

Keywords

Access options

References

Save element to Kindle

Save element to Dropbox

Save element to Google Drive