Exploiting extra-textual and linguistic information in keyphrase extraction

GÁBOR BEREND

doi:10.1017/S1351324914000126

Exploiting extra-textual and linguistic information in keyphrase extraction

Published online by Cambridge University Press: 30 September 2014

GÁBOR BEREND

Show author details

GÁBOR BEREND*: Affiliation:
University of Szeged, Department of Informatics, Árpád tér 2, Szeged, H6720, Hungary email: berendg@inf.u-szeged.hu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Keyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 1 , January 2016 , pp. 73 - 95

DOI: https://doi.org/10.1017/S1351324914000126 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Barker, K., and Cornacchia, N., 2000. Using noun phrase heads to extract document keyphrases. In Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence (AI ’00), London, UK, UK: Springer-Verlag, pp. 40–52.Google Scholar

Berend, G. 2011. Opinion expression mining by exploiting keyphrase extraction. In Proceedings of 5th International Joint Conference on Natural Language Processing. Chiang Mai, Thailand: Asian Federation of Natural Language Processing, pp. 1162–1170.Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3 (Mar.): 993–1022.Google Scholar

Bougouin, A., Boudin, F., and Daille, B. 2013. TopicRank: graph-based topic ranking for keyphrase extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. Nagoya, Japan: Asian Federation of Natural Language Processing, pp. 543–551.Google Scholar

Buckley, C., and Voorhees, E. M. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), New York, NY, USA: ACM, pp. 25–32.Google Scholar

Budanitsky, A., and Hirst, G., 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32 (1): 13–47.CrossRef Google Scholar

Ding, Z., Zhang, Q., and Huang, X. 2011. Keyphrase extraction from online news using binary integer programming. In Proceedings of 5th International Joint Conference on Natural Language Processing. Chiang Mai, Thailand: Asian Federation of Natural Language Processing, pp. 165–173.Google Scholar

Dunning, T., 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 61–74.Google Scholar

Eisterlehner, F., Hotho, A., and Jäschke, R. (eds). 2009 (Sept.). ECML PKDD Discovery Challenge 2009 (DC09), CEUR-WS.org, vol. 497.Google Scholar

Farkas, R., Berend, G., Hegedűs, I., Kárpáti, A., and Krich, B. 2010. Automatic free-text-tagging of online news archives. In Proceedings of the 2010 Conference on ECAI 2010: 19th European Conference on Artificial Intelligence. Amsterdam, The Netherlands, The Netherlands: IOS Press, pp. 529–534.Google Scholar

Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Language, Speech and Communication. Mit Press.CrossRef Google Scholar

Gabrilovich, E., and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611.Google Scholar

Hasan, K. S., and Ng, V. 2010. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (COLING ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 365–373.Google Scholar

Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP ’03), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 216–223.Google Scholar

Kim, S. N., and Kan, M.-Y. 2009. Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE ’09), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 9–16.Google Scholar

Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T. 2010. SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), Morristown, NJ, USA: ACL, pp. 21–26.Google Scholar

Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T., 2013. Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation 47 (3): 723–742.CrossRef Google Scholar

Landauer, T. K., and Dutnais, S. T. 1997. A solution to Platos problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 211–240.Google Scholar

Liu, F., Pennell, D., Liu, F., and Liu, Y. 2009a. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL ’09), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 620–628CrossRef Google Scholar

Liu, Z., Huang, W., Zheng, Y., and Sun, M. 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 366–376.Google Scholar

Liu, Z., Li, P., Zheng, Y., and Sun, M. 2009b (August). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 257–266.Google Scholar

Lopez, P., and Romary, L. 2010. HUMB: automatic key term extraction from scientific articles in GROBID. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 248–251.Google Scholar

Lopez, P.et al. 2010. GRISP: a massive multilingual terminological database for scientific and technical domains. In LREC 2010.Google Scholar

Mahdi, A. E., and Joorabchi, A., 2010. A citation-based approach to automatic topical indexing of scientific literature. Journal of Information Science 36 (6): 798–811.CrossRef Google Scholar

McCallum, A. K. 2002. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu.Google Scholar

Medelyan, O., and Witten, I. H. 2006. Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (JCDL ’06), New York, NY, USA: ACM, pp. 296–297.Google Scholar

Medelyan, O., Frank, E., and Witten, I. H. 2009. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 1318–1327.Google Scholar

Mihalcea, R., and Tarau, P. 2004. TextRank: bringing order into texts. In Proceedings of EMNLP, vol. 4. Barcelona, Spain, p. 275.Google Scholar

Mishne, G. 2006. AutoTag: a collaborative approach to automated tag assignment for weblog posts. In WWW ’06: Proceedings of the 15th International Conference on World Wide Web. New York, NY, USA: ACM Press, pp. 953–954.Google Scholar

Navigli, R., and Ponzetto, S. P. 2012. BabelRelate! a joint multilingual approach to computing semantic relatedness. In AAAI Conference on Artificial Intelligence.Google Scholar

Nguyen, T. D., and Kan, M.-Y. 2007. Keyphrase extraction in scientific publications. In Proceedings of the 10th International Conference on Asian Digital Libraries: Looking Back 10 Years and Forging New Frontiers (ICADL’07), Berlin, Heidelberg: Springer-Verlag, pp. 317–326.Google Scholar

Nguyen, T. D., and Luong, M.-T. 2010. WINGNUS: keyphrase extraction utilizing document logical structure. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 166–169.Google Scholar

Page, L., Brin, S., Motwani, R., and Winograd, T. 1999 (November). The PageRank Citation Ranking: Bringing Order to the Web. Previous number = SIDL-WP-1999-0120.Google Scholar

Pedersen, T., Patwardhan, S., and Michelizzi, J. 2004. WordNet: similarity: measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004. HLT-NAACL–Demonstrations ’04. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 38–41.Google Scholar

Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial IntelligenceIJCAI’95, vol. 1. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, pp. 448–453.Google Scholar

Sag, I. A., Baldwin, T., Bond, F., Copestake, A. A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing (CICLing ’02), London, UK, UK: Springer-Verlag, pp. 1–15.Google Scholar

Sood, S., Owsley, S., Hammond, K., and Birnbaum, L. 2007. TagAssist: automatic tag suggestion for blog posts. In Proceedings of the International Conference on Weblogs and Social Media (ICWSM 2007).Google Scholar

Strube, M., and Ponzetto, S. P. 2006. WikiRelate! computing semantic relatedness using Wikipedia. In AAAI’06: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 1419–1424.Google Scholar

Tatu, M., Srikanth, M., and D’Silva, T. 2008. RSDC’08: tag recommendations using bookmark content. In Proceedings of the ECML PKDD Discovery Challenge 2008.Google Scholar

Tomokiyo, T., and Hurst, M. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE ’03), vol. 18. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 33–40.Google Scholar

Toutanova, K., and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP ’00), Stroudsburg, PA, USA: ACL, pp. 63–70.Google Scholar

Turney, P., 2000. Learning algorithms for keyphrase extraction. Information Retrieval 2: 303–336.CrossRef Google Scholar

Turney, P. 2003. Coherent keyphrase extraction via web mining. In Proceedings of IJCAI ’03, pp. 434–439.Google Scholar

Voorhees, E. M. 1999. The TREC-8 question answering track report. In In Proceedings of TREC-8, pp. 77–82.Google Scholar

Wan, X., and Xiao, J. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08), vol. 2. AAAI Press pp. 855–860.Google Scholar

Wang, D. X., Gao, X., and Andreae, P. 2012. DIKEA: domain-independent keyphrase extraction algorithm. In Proceedings of the 25th Australasian Joint Conference on Advances in Artificial Intelligence (AI’12), Berlin, Heidelberg: Springer-Verlag, pp. 719–730.Google Scholar

Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. 1999. KEA: practical automatic keyphrase extraction. ACM DL, pp. 254–255.Google Scholar

Wu, Z., and Giles, C. L. 2013. Measuring term informativeness in context. In Proceedings of NAACL-HLT, pp. 259–269.Google Scholar

Yeh, E., Ramage, D., Manning, C. D., Agirre, E., and Soroa, A. 2009. WikiWalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. TextGraphs-4. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 41–49.Google Scholar

You, W., Fontaine, D., and Barthès, J.-P. A., 2013. An automatic keyphrase extraction system for scientific documents. Knowledge and Information Systems 34 (3): 691–724.CrossRef Google Scholar

Article contents

Exploiting extra-textual and linguistic information in keyphrase extraction

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests