Hostname: page-component-cd9895bd7-mkpzs Total loading time: 0 Render date: 2024-12-21T16:47:01.607Z Has data issue: false hasContentIssue false

A classification approach for detecting cross-lingual biomedical term translations

Published online by Cambridge University Press:  14 December 2015

H. HAKAMI
Affiliation:
Computer Science Department, Taif University, Saudi Arabia e-mail: hoda.h@tu.edu.sa
D. BOLLEGALA
Affiliation:
Department of Computer Science, The University of Liverpool, UK e-mail: danushka.bollegala@liverpool.ac.uk

Abstract

Finding translations for technical terms is an important problem in machine translation. In particular, in highly specialized domains such as biology or medicine, it is difficult to find bilingual experts to annotate sufficient cross-lingual texts in order to train machine translation systems. Moreover, new terms are constantly being generated in the biomedical community, which makes it difficult to keep the translation dictionaries up to date for all language pairs of interest. Given a biomedical term in one language (source language), we propose a method for detecting its translations in a different language (target language). Specifically, we train a binary classifier to determine whether two biomedical terms written in two languages are translations. Training such a classifier is often complicated due to the lack of common features between the source and target languages. We propose several feature space concatenation methods to successfully overcome this problem. Moreover, we study the effectiveness of contextual and character n-gram features for detecting term translations. Experiments conducted using a standard dataset for biomedical term translation show that the proposed method outperforms several competitive baseline methods in terms of mean average precision and top-k translation accuracy.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baroni, M., and Lenci, A. 2010. Distributional memory: a general framework for corpus-based semantics. Computational Linguistics 36 (4): 673721.Google Scholar
Bollegala, D., Maehara, T., and ichi Kawarabayashi, K., 2015. Embedding semantic relations into word representations. In Proceedings of IJCAI, Buenos Aires, Argentina: AAAI, pp. 1222–8.Google Scholar
Bollegala, D., Matsuo, Y., and Ishizuka, M., 2007. An integrated approach to measuring semantic similarity between words using information available on the web. In Proceedings of HTL-NAACL’07, Rochester, NY: ACL, pp. 340–7.Google Scholar
Boström, H. 2007. Estimating class probabilities in random forests. In International Conference on Machine Learning and Applications, pp. 211–6.Google Scholar
Breiman, L. 2001. Random forests. Machine Learning 45 (1): 532.Google Scholar
Chan, Y. S., and Ng, H. T. 2005. Word sense disambiguation with distribution estimation. In IJCAI’05, pp. 1010–5.Google Scholar
Chiao, Y.-C., and Zweigenbaum, P., 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan: ACL, pp. 15.Google Scholar
Claveau, V., 2008. Automatic translation of biomedical terms by supervised machine learning. In Proceedings of LREC, Marrakech, Morocco: European Language Resources Association, pp. 684–91.Google Scholar
Clopper, C. J., and Pearson, E. S. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26 (4): 404–13.Google Scholar
Dias, G., Moraliyski, R., Cordeiro, J., Doucet, A., and Ahonen-Myka, H. 2010. Automatic discovery of word semantic relations using paraphrase alignment and distributional lexical semantics analysis. Natural Language Engineering 16 (4): 439–67.Google Scholar
Díaz-Uriarte, R., and De Andres, S. A. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 (1): 113.Google Scholar
Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2009. Improving the extraction of bilingual terminology from wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5 (4): 131.CrossRefGoogle Scholar
Fan, J.-W., and Friedman, C. 2007. Semantic classification of biomedical concepts using distributional similarity. Journal of the American Medical Informatics Association 14 (4): 467–77.Google Scholar
Kontonatsios, G., Korkontzelos, I., Tsujii, J., and Ananiadou, S., 2014a. Combining string and context similarity for bilingual term extraction from comparable corpora. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: ACL, pp. 1701–12.Google Scholar
Kontonatsios, G., Korkontzelos, I., Tsujii, J., and Ananiadou, S., 2014b. Using a random forest classifier to compile bilingual dictionaries of technical terms from comparable corpora. In Proceedings of the European Chapter for the Association for Computational Linguistics (ACL), Gothenburg, Sweden: ACL, pp. 111–6.Google Scholar
Lin, D. 1998. Automatic retrieval and clustering of similar words. In ACL 1998, pp. 768–74.Google Scholar
Mcnamee, P., and Mayfield, J. 2004. Character n-gram tokenization for european language text retrieval. Information Retrieval 7 (1–2): 7397.Google Scholar
Mikolov, T., Chen, K., and Dean, J. 2013a. Efficient estimation of word representation in vector space. CoRR abs/1301.3781.Google Scholar
Mikolov, T., Tau Yih, W., and Zweig, G. 2013b. Linguistic regularities in continous space word representations. In NAACL’13, pp. 746–51.Google Scholar
Mitchell, J., and Lapata, M. 2008. Vector-based models of semantic composition. In ACL-HLT’08, pp. 236–44.Google Scholar
Nakov, P., and Tiedemann, J., 2012. Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of Annual Meeting of the Association for Computational Linguistics (short-papers), Jeju Island, South Korea: ACL, pp. 301–5.Google Scholar
Namer, F., and Baud, R., 2005. Predicting lexical relations between biomedical terms: towards a multilingual morphosemantics-based system. Studies in Health Technology and Informatics 116 : 793–8.Google Scholar
Rapp, R., 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, Maryland, USA: ACL, pp. 519–26.Google Scholar
Rapp, R. 2008. The automatic generation of thesauri of related words for english, french, german, and russian. International Journal of Speech Technology 11 (3–4): 147–56.Google Scholar
Saralegi, X., San Vicente, I., and Gurrutxaga, A., 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In Proceedings of Building and using Comparable Corpora Workshop, Marrakech, Morocco, pp. 2732.Google Scholar
Tiedemann, J., 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France: ACL, pp. 141–51.Google Scholar
Tiedemann, J., and Nakov, P., 2013. Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria: INCOMA, pp. 676–84.Google Scholar
Turney, P. D., and Pantel, P., 2010. From frequency to meaning: vector space models of semantics. Journal of Aritificial Intelligence Research 37 : 141–88.Google Scholar
Vilar, D., Peter, J.-T., and Ney, H., 2007. Can we translate letters?. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic: ACL, pp. 33–9.Google Scholar
Weeds, J., Dowdall, J., Schneider, G., Keller, B., and Weir, D. 2007. Using distributional similarity to organise biomedical terminology. Application-Driven Terminology Engineering 2 (97): 107–41.Google Scholar
Xi, N., Tang, G., Dai, X., Huang, S., and Chen, J. 2012. Enhancing statistical machine translation with character alignment. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Korea: ACL, 2: 285–90.Google Scholar