Hostname: page-component-78c5997874-v9fdk Total loading time: 0 Render date: 2024-11-07T09:43:40.709Z Has data issue: false hasContentIssue false

Exploring the effectiveness of linguistic knowledge for biographical relation extraction

Published online by Cambridge University Press:  18 October 2013

MARCOS GARCIA
Affiliation:
Centro Singular de Investigación en Tecnoloxías da Información (CITIUS), University of Santiago de Compostela, Coruña, Spain e-mail: marcos.garcia.gonzalez@usc.es, pablo.gamallo@usc.es
PABLO GAMALLO
Affiliation:
Centro Singular de Investigación en Tecnoloxías da Información (CITIUS), University of Santiago de Compostela, Coruña, Spain e-mail: marcos.garcia.gonzalez@usc.es, pablo.gamallo@usc.es

Abstract

Machine learning techniques have been implemented to extract instances of semantic relations using diverse features based on linguistic knowledge, such as tokens, lemmas, PoS-tags, or dependency paths. However, there has been little work aiming to know which of these features works better in the relation extraction task, and less in languages other than English. In this paper, various features representing different levels of linguistic knowledge are systematically evaluated for biographical relation extraction. The effectiveness of these features was measured by training several supervised classifiers that only differ in the type of linguistic knowledge used to define their features. The experiments performed in this paper show that some basic linguistic knowledge (provided by lemmas and their combination in bigrams) behaves better than other complex features, such as those based on syntactic analysis. Furthermore, some feature combinations using different levels of analysis are proposed in order (i) to avoid feature overlapping as well as (ii) to evaluate the use of computationally inexpensive and widespread tools such as tokenization and lemmatization. This paper also describes two new freely available corpora for biographical relation extraction in Portuguese and Spanish, built by means of a distant-supervision strategy. Experiments were performed with five semantic relations and two languages, using these corpora.

Type
Articles
Copyright
Copyright © Cambridge University Press 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agichtein, E. 2005. Extracting Relations from Large Text Collections, PhD Thesis. New York: Columbia University.Google Scholar
Agichtein, E., and Gravano, L. 2000. Snowball: extracting relations from large plain-text collections. In Proceedings of the 5th Association for Computing Machinery Conference on Digital Libraries, San Antonio, TX, USA, pp. 8594.Google Scholar
Aguado de Cea, G., Gómez-Pérez, A., Montiel-Ponsoda, E., and Suárez-Figueroa, M. 2008. Natural language-based approach for helping in the reuse of ontology design patterns. In Knowledge Engineering: Practice and Patterns, pp. 3247. Berlin: Springer-Verlag.CrossRefGoogle Scholar
Akbik, A., and Broß, J. 2009. Wanderlust: extracting semantic relations from natural language text using dependency grammar patterns. In Proceedings of the Workshop on Semantic Search (SemSearch 2009) at the 18th International World Wide Web Conference (WWW 2009), Madrid, Spain, pp. 615.Google Scholar
Banko, M., Cafarella, M. J., Soderl, S., Broadhead, M., and Etzioni, O. 2007. Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, pp. 2670–6.Google Scholar
Brin, S. 1998. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th International Conference on Extending Database Technology (EDBT 1998), València, Spain, pp. 172–83.Google Scholar
Bruckschen, M., de Souza, J. G. C., Vieira, R., and Rigo, S. 2008. Sistema SeRELeP para o reconhecimento de relações entre entidades mencionadas. In Mota, C. and Santos, D. (eds.), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, pp. 247–60. Linguateca.Google Scholar
Bunescu, R. C., and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Human Language Technology Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada, pp. 724–31.Google Scholar
Bunescu, R. C., and Mooney, R. J. 2007. Learning to extract relations from the web using minimal supervision. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 576–83.Google Scholar
Cardoso, N. 2008. REMBRANDT - Reconhecimento de Entidades Mencionadas Baseado em Relações e ANálise Detalhada do Texto. In Mota, C. and Santos, D. (eds.), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, pp. 195211. Linguateca.Google Scholar
Chang, C., and Lin, C. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3): 127.CrossRefGoogle Scholar
Chaves, M. S. 2008. Geo-ontologias e padrões para reconhecimento de locais e de suas relações em textos: o SEI-Geo no Segundo HAREM. In Mota, C. and Santos, D. (eds.), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, pp. 231–45. Linguateca.Google Scholar
Costa, F., and Branco, A. 2012. Extracting temporal information from portuguese texts. In Proceedings of the 10th International Conference on Computational Processing of the Portuguese Language (PROPOR 2012), pp. 99105. Lecture Notes in Artificial Intelligence, vol. 7243. Berlin: Springer-Verlag.CrossRefGoogle Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A. M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2004. Web-scale information extraction in KnowItAll. In Proceedings of the 13th International Conference on World Wide Web (WWW 2004), New York, USA, pp. 100–10.CrossRefGoogle Scholar
Etzioni, O., Fader, A., Christensen, J., Soderland, S., and Center, M. T. 2011. Open information extraction: the second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Catalonia, Spain.Google Scholar
Finkelstein-Landau, M., and Morin, E. 1999. Extracting semantic relationships between terms: supervised vs. unsupervised methods. In Proceedings of International Workshop on Ontological Engineering on the Global Information Infrastructure, Dagstuhl Castle, Germany, pp. 71–80.Google Scholar
Fleischman, M., Hovy, E., and Echihabi, A. 2003. Offline strategies for online question answering: answering questions before they are asked. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo, Japan, pp. 1–7.Google Scholar
Gamallo, P., and González, I. 2013. A compressing strategy for dependency parsing. Under review for Revista Electrónica de Lingüística Aplicada.Google Scholar
Gamallo, P., Garcia, M., and Fernández-Lanza, S. 2012. Dependency-based open information extraction. In Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP (ROBUS-UNSUP 2012) at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France, pp. 1018.Google Scholar
Gamallo, P., and González, I. 2011. A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16 (1): 4571.CrossRefGoogle Scholar
Garcia, M., and Gamallo, P. 2011a. An exploration of the linguistic knowledge for semantic relation extraction in Spanish. In Saint-Dizier, P. and Mehta-Melkar, R. (eds.), Proceedings of the Joint Workshop FAM-LbR/KRAQ 2011. Learning by Reading and Its Applications in Intelligent Question-Answering at 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Catalonia, Spain, pp. 712.Google Scholar
Garcia, M., and Gamallo, P. 2011b. Dependency-based text compression for semantic relation extraction. In Nakov, P., Kozareva, Z., Ganchev, K., and Hobbs, J. (eds.), Proceedings of the Workshop on Information Extraction and Knowledge Acquisition (IEKA 2011) at 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, pp. 21–8.Google Scholar
Garera, N., and Yarowsky, D. 2009. Structural, transitive and latent models for biographic fact extraction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, pp. 300–8.Google Scholar
Grishman, R. 2010. The impact of task and corpus on event extraction systems. In Proceeding of 7th Language Resources and Evaluation Conference (LREC 2010), Valleta, Malta.Google Scholar
Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics 2: 539–45.CrossRefGoogle Scholar
Hoffmann, R., Zhang, C., and Weld, D. S. 2010. Learning 5000 relational extractors. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, pp. 286–95.Google Scholar
Jiang, J., and Zhai, C. 2007. A systematic exploration of the feature space for relation extraction. In Proceedings of the Human Language Technologies/The Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2007), Rochester, NY, USA, pp. 113–20.Google Scholar
Jijkoun, V., De Rijke, M., and Mur, J. 2004. Information extraction for question answering: improving recall through syntactic patterns. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 1284–90.CrossRefGoogle Scholar
Kambhatla, N. 2004. Combining lexical, syntactic and semantic features with maximum entropy models for extracting relations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Catalonia, Spain.CrossRefGoogle Scholar
Lin, D. 2003. Dependency-based evaluation of MINIPAR. Treebanks: Building and Using Parsed Corpora 20: 317–29.CrossRefGoogle Scholar
Liu, X., Nie, Z., Yu, N., and Wen, J. 2010. BioSnowball: automated population of Wikis. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010), Washington, DC, USA, pp. 969–78.CrossRefGoogle Scholar
Mann, G. S. 2002. Fine-grained proper noun ontologies for question answering. In Proceedings of the 2002 Workshop on Building and Using Semantic Networks (SemaNet 2002), Taipei, Taiwan, pp. 17.Google Scholar
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/AFNLP 2009), Singapore, pp. 1003–11.Google Scholar
Mota, C., and Santos, D. 2008. Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM. Linguateca.Google Scholar
Nagy, I., and Farkas, R. 2010. Person attribute extraction from the textual parts of web pages. In CLEF (Notebook Papers/LABs/Workshops), Padua, Italy.Google Scholar
Nguyen, D. P. T., Matsuo, Y., and Ishizuka, M. 2007. Relation extraction from Wikipedia using subtree mining. In Proceedings of the 22nd National Conference on Artificial Intelligence, Vancouver, Canada, vol. 2, pp. 1414–20.Google Scholar
Nguyen, T.-V. T., Moschitti, A., and Riccardi, G. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, vol. 3, pp. 1378–87.Google Scholar
Oliveira, H. G., and Gomes, P. 2010. Onto.PT: automatic construction of a lexical ontology for portuguese. In Proceedings of 5th European Starting AI Researcher Symposium (STAIRS 2010), Lisbon, Portugal, pp. 199211.Google Scholar
Oliveira, H. G., Santos, D., Gomes, P., and Seco, N. 2008. PAPEL: a dictionary-based lexical ontology for Portuguese. In Computational Processing of the Portuguese Language, pp. 31–40. Berlin: Springer-Verlag.CrossRefGoogle Scholar
Padró, Ll., Collado, M., Reese, S., Lloberes, M., and Castellón, I. 2010. FreeLing 2.1: five years of open-source language processing tools. In Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010), Valleta, Malta.Google Scholar
Pantel, P., and Pennacchiotti, M. 2006. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, NSW, Australia, pp. 113–20.Google Scholar
Pasca, M., Lin, D., Bigham, J., Lifchits, A., and Jain, A. 2006. Organizing and searching the world wide web of facts-step one: the one-million fact extraction challenge. In Proceedings of the National Conference on Artificial Intelligence, Boston, MA, USA, vol. 21, pp. 1400–5.Google Scholar
Ravichandran, D., and Hovy, E. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PA, USA, pp. 41–7.Google Scholar
Riedel, S., Yao, L., and McCallum, A. 2010. Modeling relations and their mentions without labeled text. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148163. Berlin: Springer-Verlag.Google Scholar
Ruiz-Casado, M., Alfonseca, E., and Castells, P. 2005. Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. In Proceedings of the Atlantic Web Intelligence Conference (AWIC 2005), pp. 380–6. Lecture Notes in Computer Science, vol. 3528. Berling: Springer-Verlag.Google Scholar
Sánchez-Cuadrado, S., Lloréns, J., Morato, J., and Hurtado, J. A. 2003. Extracción automática de relaciones semánticas. In 2da Conferencia Iberoamericana en Sistemas, Cibernética e Informática (CISCI 2003), Orlando, Florida, pp. 41–7.Google Scholar
Sierra, G., Alarcón, R., Aguilar, C., and Bach, C. 2008. Definitional verbal patterns for semantic relation extraction. Terminology 14 (1): 7498.Google Scholar
Snow, R., Jurafsky, D., and Ng, A. Y. 2005. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17: 1297–304.Google Scholar
Soares, S., Martins, B., and Calado, P. 2011. Extracting biographical sentences from textual documents. In Proceedings of the 15th Portuguese Conference on Artificial Intelligence (EPIA 2011), Lisbon, Portugal, pp. 718–30.Google Scholar
Soler, V., and Alcina, A. 2008. Patrones léxicos para la extracción de conceptos vinculados por la relación parte-todo en español. Terminology 14 (1): 99123.Google Scholar
Suchanek, F. M., Ifrim, G., and Weikum, G. 2006. LEILA: Learning to Extract Information by Linguistic Analysis. In Second Workshop on Ontology Population (OLP2) at the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, NSW, Australia.Google Scholar
Sun, A., Grishman, R., Xu, W., and Min, B. 2011. New York University 2011 system for KBP slot filling. In Proceedings of the Text Analytics Conference (TAC 2011), Gaithersburg, MD, USA.Google Scholar
Wan, X., Gao, J., Li, M., and Ding, B. 2005. Person resolution in person search results: WebHawk. In Proceedings of the 14th Association for Computing Machinery International Conference on Information and Knowledge Management (CIKM 2005), Bremen, Germany, pp. 163–70.Google Scholar
Wu, F., and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, pp. 118–27.Google Scholar
Yan, Y., Okazaki, N., Matsuo, Y., Yang, Z., and Ishizuka, M. 2009. Unsupervised relation extraction by mining Wikipedia texts using information from the web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/AFNLP 2009), Singapore, pp. 1021–9.Google Scholar
Zhao, S., and Grishman, R. 2005. Extracting relations with integrated information using kernel methods. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, MI, USA, pp. 419–26.Google Scholar
Zhang, M., Zhang, J., Su, J., and Zhou, G. 2006 A composite kernel to extract relations between entities with both flat and structured features. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, NSW, Australia, pp. 825–32.Google Scholar
Zhou, G., Su, J., Zhang, J., and Zhang, M. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, MI, USA, pp. 427–34.Google Scholar
Zhou, G., Zhang, M., Ji, D. H., and Zhu, Q. 2007. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pp. 728–36.Google Scholar