Hostname: page-component-77c89778f8-fv566 Total loading time: 0 Render date: 2024-07-19T22:24:55.777Z Has data issue: false hasContentIssue false

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes

Published online by Cambridge University Press:  27 September 2016

JAN KOCOŃ
Affiliation:
Department of Computational Intelligence, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław, Poland e-mails: jan.kocon@pwr.edu.pl, michal.marcinczuk@pwr.edu.pl
MICHAŁ MARCIŃCZUK
Affiliation:
Department of Computational Intelligence, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław, Poland e-mails: jan.kocon@pwr.edu.pl, michal.marcinczuk@pwr.edu.pl

Abstract

A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Information Extraction systems, such as question answering or event recognition. We prepared a broad specification of Polish timexes – PLIMEX. It is based on the state-of-the-art annotation guidelines for English, mainly TIMEX2 and TIMEX3 (a part of TimeML – Markup Language for Temporal and Event Expressions). We have expanded our specification for a description of the local meaning of timexes, based on LTIMEX annotation guidelines for English. Temporal description supports further event identification and extends event description model, focussing on anchoring events in time, events ordering and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues, and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines. We also adapted our Liner2 machine learning system to recognise Polish timexes and we propose two-phase method to select a subset of features for Conditional Random Fields sequence labelling method. This article presents the whole process of corpus annotation, evaluation of inter-annotator agreement, extending Liner2 system with new features and evaluation of the recognition models before and after feature selection with the analysis of statistical significance of differences. Liner2 with presented models is available as open source software under the GNU General Public License.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education.

References

Allen, J. 1995. Natural Language Understanding. Redwood City, CA, USA: Benjamin Cummings.Google Scholar
Andersen, P. M., Hayes, P. J., Huettner, A. K., Schmandt, L. M., Nirenburg, I. B., and Weinstein, S. P. 1992. Automatic extraction of facts from press releases to generate news stories. In Proceeding of the 3rd Conference on Applied Natural Language Processing, ANLC. Trento, Italy: Association for Computational Linguistics, pp. 170–7.Google Scholar
Benthem, J. 1983. The Logic of Time: A Model-Theoretic Investigation into the Varieties of Temporal Ontology and Temporal Discourse. Dordrecht, London, Boston: D. Reidel.CrossRefGoogle Scholar
Bethard, S. 2013. ClearTK-TimeML: A minimalist approach to TempEval 2013. In Second Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 10–14.Google Scholar
Blum, A. L. and Langley, P. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97 (1–2): 245–71.Google Scholar
Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., and Wardyński, A. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 3218–22.Google Scholar
Busemann, S., Declerck, T., Diagne, A. K., Dini, L., Klein, J., and Schmeier, S. 1997. Natural language dialogue service for appointment scheduling agents. In Proceedings of the 5th Conference on Applied Natural Language Processing, ANLC. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 25–32.Google Scholar
Chinchor, N. A. 1998. MUC-7 test scores introduction (Appendix B). In Proceedings of the 7th Message Understanding Conference, Fairfax, VA: Association for Computational Linguistics.Google Scholar
Daniel, N., Radev, D., and Allison, T. 2003. Sub-event based multi-document summarization. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop, HLT-NAACL-DUC. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 9–16.Google Scholar
Dietterich, T. G. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7): 1895–923.Google Scholar
Ferro, L. 2001. Instruction manual for the annotation of temporal expressions. MITRE Technical Report. MITRE Washington C3 Center, McLean, Virginia.Google Scholar
Filatova, E., and Hovy, E. 2001. Assigning time-stamps to event-clauses. In Proceedings of the Workshop on Temporal and Spatial Information Processing - Volume 13, TASIP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1–8.Google Scholar
Han, B., Gates, D., and Levin, L. 2006. Understanding temporal expressions in emails. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 136–43.Google Scholar
Hou, C. and Jiao, L. 2010. Selecting features of linear-chain conditional random fields via greedy stage-wise algorithms. Pattern Recognition Letters 31 (2): 151–62.Google Scholar
Hripcsak, G. and Rothschild, A. S. 2005. Agreement, the f-measure and reliability in information retrieval. Journal of the American Medical Informatics Association 12 (3): 296–8.Google Scholar
Kędzia, P., Piasecki, M., Kocoń, J., and Indyka-Piasecka, A. 2014. Distributionally extended network-based word sense disambiguation in semantic clustering of Polish texts. IERI Procedia 10 (1): 3844.CrossRefGoogle Scholar
Kocoń, J. and Marcińczuk, M. 2015. Recognition of Polish temporal expressions. In Proceedings of the Recent Advances in Natural Language Processing, RANLP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 282–90.Google Scholar
Kohavi, R. and John, G. H. 1997. Wrappers for feature subset selection. Artificial Intelligence 97 (1–2): 273324.Google Scholar
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, ICML. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 282–9.Google Scholar
Li, D., Kipper-Schuler, K., and Savova, G. 2008. Conditional random Fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP. Columbus, Ohio. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 94–95.Google Scholar
Llorens, H., Saquete, E. and Navarro-Colorado, B. 2010a. TimeML events recognition and classification: learning CRF models with semantic roles. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 725–33.Google Scholar
Llorens, H., Saquete, E. and Navarro-Colorado, B. 2010b. TIPSem (English and Spanish): evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 284–91.Google Scholar
Llorens, H., Saquete, E. and Navarro-Colorado, B. 2013. Applying semantic knowledge to the automatic processing of temporal expressions and events in natural language. Information Processing & Management 49 (1): 179197.CrossRefGoogle Scholar
Mani, I. and Wilson, G. 2000. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 69–76.Google Scholar
Marcińczuk, M., Kocoń, J. and Broda, B. 2012. Inforex – a web-based tool for text corpus management and semantic annotation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 224–30.Google Scholar
Marcińczuk, M., Kocoń, J. and Janicki, M. 2013. Liner2 – a customizable framework for proper names recognition for Polish. In Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Berlin: Springer Verlag, pp. 231–53.CrossRefGoogle Scholar
Marcińczuk, M. and Kocoń, J. 2013. Recognition of named entities boundaries in Polish texts. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 94–9.Google Scholar
Maziarz, M., Piasecki, M., Rudnicka, E. and Szpakowicz, S. 2013. Beyond the transfer-and-merge wordnet construction: plWordNet and a comparison with WordNet. In Proceedings of the Recent Advances in Natural Language Processing, RANLP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 443–52.Google Scholar
Mazur, P. 2012. Broad-Coverage Rule-Based Processing of Temporal Expressions. PhD Thesis. Wrocław: Politechnika Wrocławska.Google Scholar
Mizobuchi, S., Sumitomo, T., Fuketa, M. and Aoe, J.-I. 1998. A method for understanding time expressions. In IEEE International Conference on Systems, Man, and Cybernetics, SMC. San Diego, CA, pp. 1151–5.Google Scholar
Negri, M. and Marseglia, L. 2005. Recognition and normalization of time expressions: ITC-irst at TERN 2004. Technical Report. Developing Multilingual Web-scale Language Technologies.Google Scholar
Niemi, J. and Koskenniemi, K. 2007. Representing calendar expressions with finite-state transducers that bracket periods of time on a hierarchical timeline. In Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, NODALIDA. Estonia, Tartu: University of Tartu, pp. 355–62.Google Scholar
Piasecki, M., Maziarz, M., Szpakowicz, S. and Rudnicka, E. 2014. PlWordNet as the cornerstone of a toolkit of Lexico-semantic resources. In Proceedings of the 7th International Global Wordnet Conference, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 304–12.Google Scholar
Pustejovsky, J., Ingria, B., Sauri, R., Castano, J., Littman, J., Gaizauskas, R., Setzer, A., Katz, G., and Mani, I. 2005a The specification language TimeML. The Language of Time: A Reader, 545–57. Oxford University Press.Google Scholar
Pustejovsky, J., Knippen, R., Littman, J. and Saurí, R. 2005b. Temporal and event information in natural language text. Language Resources and Evaluation 39 (2–3): 123–64.CrossRefGoogle Scholar
Radziszewski, A., Maziarz, M. and Wieczorek, J. 2012. Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12 (1): 129–47.Google Scholar
Saquete, E., Muñoz, R., and Martínez-Barco, P. 2003. TERSEO: temporal expression resolution system applied to event ordering. In Preceedings of Text, Speech and Dialogue, Lecture Notes in Computer Science. Berlin: Springer Verlag, pp. 220–8.Google Scholar
Saurí, R., Littman, J., Gaizauskas, R., Setzer, A., and Pustejovsky, J. 2006. TimeML Annotation Guidelines, Version 1.2.1. http://www.timeml.org/site/publications/timeMLdocs/annguide_1.2.1.pdf Google Scholar
Schilder, F. 2004. Extracting meaning from temporal nouns and temporal prepositions. ACM Transactions on Asian Language Information Processing (TALIP) 3 (1): 3350.Google Scholar
Schilder, F. and Habel, C. 2001. From temporal expressions to temporal information: semantic tagging of news messages. In Proceedings of the ACL-2001 Workshop on Temporal and Spatial Information Processing, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 65–72.Google Scholar
Skukan, L., Glavas, G. and Snajder, J. 2014. HEIDELTIME.HR: extracting and normalizing temporal expressions in Croatian. In Proceedings of the 9th Slovenian Language Technologies Conferences, IS-LT. Slovenia, Ljubljana: Information Society, pp. 99–103.Google Scholar
Smith, C. S. 2010. Temporal structures in discourse. In Text, Time, and Context. Studies in linguistics and philosophy, vol. 87. Netherlands: Springer, pp. 285302.Google Scholar
Strötgen, J., Zell, J., and Gertz, M. 2013. HeidelTime: tuning english and developing Spanish resources for TempEval-3. In 2nd Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 15–19.Google Scholar
Strötgen, J. and Gertz, M. 2013. Multilingual and cross-domain temporal tagging. Language Resources and Evaluation 47 (2): 269–98.Google Scholar
Strötgen, J. and Gertz, M. 2015. A baseline temporal tagger for all languages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP. Lisbon, Portugal. Association for Computational Linguistics, pp. 541–547.Google Scholar
UzZaman, N., and Allen, J. 2010. TRIPS and TRIOS system for TempEval-2: extracting temporal information from text. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp. 276–283.Google Scholar
UzZaman, N., Llorens, H., Allen, J. F., Derczynski, L., Verhagen, M., and Pustejovsky, J. 2012. TempEval-3: evaluating events, time expressions and temporal relations. Computing Research Repository, abs/1206.5333.Google Scholar
UzZaman, N., Llorens, H., Derczynski, L., Verhagen, M., Allen, J., and Pustejovsky, J. 2013. SemEval-2013 Task 1: TEMPEVAL-3: evaluating time expressions, events and temporal relations. In 2nd Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 1–9.Google Scholar
Vicente-Diez, M. T., Samy, D., and Martinez, P. 2008. An empirical approach to a preliminary successful identification and resolution of temporal expressions in Spanish news corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC. European Language Resources Association (ELRA), pp. 2153–8.Google Scholar
Zhu, X. 2010. Conditional Random Fields. CS769 Advanced Natural Language Processing. http://pages.cs.wisc.edu/~jerryzhu/cs769/CRF.pdf Google Scholar