Hostname: page-component-cd9895bd7-jkksz Total loading time: 0 Render date: 2024-12-25T14:30:30.788Z Has data issue: false hasContentIssue false

Leveraging bilingual terminology to improve machine translation in a CAT environment*

Published online by Cambridge University Press:  30 May 2017

MIHAEL ARCAN
Affiliation:
Insight Centre for Data Analytics, National University of Ireland, Galway e-mail: mihael.arcan@insight-centre.org, paul.buitelaar@deri.org
MARCO TURCHI
Affiliation:
FBK- Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy e-mail: turchi@fbk.eu, satonelli@fbk.eu
SARA TONELLI
Affiliation:
FBK- Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy e-mail: turchi@fbk.eu, satonelli@fbk.eu
PAUL BUITELAAR
Affiliation:
Insight Centre for Data Analytics, National University of Ireland, Galway e-mail: mihael.arcan@insight-centre.org, paul.buitelaar@deri.org

Abstract

This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality. Therefore, we investigate several strategies to extract and align terminology across languages and to integrate it in an SMT system. We compare two terminology injection methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and cache-based model. We test the cache-based model on two different domains (information technology and medical) in English, Italian and German, showing significant improvements ranging from 2.23 to 6.78 BLEU points over a baseline SMT system and from 0.05 to 3.03 compared to the widely-used XML markup approach.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight).

References

Aker, A., Paramita, M., and Gaizauskas, R., 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 402–11.Google Scholar
Arcan, M., Federmann, C., and Buitelaar, P., 2012. Experiments with term translation. In Proceedings of the 24th International Conference on Computational Linguistics, Mumbai, India, pp. 6782.Google Scholar
Arcan, M., Giuliano, C., Turchi, M., and Buitelaar, P., 2014a. Identification of bilingual terms from monolingual documents for statistical machine translation. In Proceedings of the 4th International Workshop on Computational Terminology (Computerm), Dublin, Ireland, pp. 2231.Google Scholar
Arcan, M., Turchi, M., Tonelli, S., and Buitelaar, P., 2014b. Enhancing statistical machine translation with bilingual terminology in a CAT environment. In Association for Machine Translation in the Americas (AMTA), Vancouver, Canada, pp. 5468.Google Scholar
Arcan, M., McCrae, J. P., and Buitelaar, P., 2016. Expanding wordnets to new languages with multilingual sense disambiguation. In International Conference on Computational Linguistics (COLING), Osaka, Japan, pp. 97108.Google Scholar
Bentivogli, L., Bertoldi, N., Cettolo, M., Federico, M., Negri, M., and Turchi, M., 2016. On the evaluation of adaptive machine translation for human post-editing. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2): 388–99.CrossRefGoogle Scholar
Bertoldi, N., and Federico, M., 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the 4th Workshop on Statistical Machine Translation, Athens, Greece, pp. 182–9.Google Scholar
Bertoldi, N., Haddow, B., and Fouet, J.-B., 2009. Improved minimum error rate training in moses. Prague Bulletin of Mathematical Linguistics 91 : 716.Google Scholar
Bertoldi, N., Cettolo, M., and Federico, M., 2013. Cache-based online adaptation for machine translation enhanced computer assisted translation. In Proceedings of Machine Translation Summit XIV, Nice, France, pp. 3542.Google Scholar
Bouamor, D., Semmar, N., and Zweigenbaum, P., 2011. Improved statistical machine translation using multiword expressions. In Proceedings of the International Workshop on Using Linguistic Information for Hybrid Machine Translation (LIHMT 2011), Barcelona, Spain, pp. 1520.Google Scholar
Bouamor, D., Semmar, N., and Zweigenbaum, P., 2012. Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, pp. 674–9.Google Scholar
Clark, J. H., Dyer, C., Lavie, A., and Smith, N. A., 2011. Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, pp. 176–81.Google Scholar
Daille, B., Gaussier, E., and Langé, J.-M., 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 515–21.Google Scholar
Denkowski, M., Dyer, C., and Lavie, A., 2014. Learning from post-editing: online model adaptation for statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 395404.Google Scholar
Dice, L. R., 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297302.Google Scholar
Ehrmann, M., Turchi, M., and Steinberger, R., 2011. Building a multilingual named entity-annotated corpus using annotation projection. In Recent Advances in Natural Language Processing, (RANLP), Hissar, Bulgaria, pp. 118–24.Google Scholar
Federico, M., Cattelan, A., and Trombetti, M., 2012. Measuring user productivity in machine translation enhanced computer assisted translation. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas, San Diego, California, pp. 4456.Google Scholar
Federico, M., Bertoldi, N., Cettolo, M., Negri, M., Turchi, M., Trombetti, M., Cattelan, A., Farina, A., Lupinetti, D., Martines, A., Massidda, A., Schwenk, H., Barrault, L., Blain, F., Koehn, P., Buck, C., and Germann, U., 2014. The MateCat tool. In Proceedings of 25th International Conference on Computational Linguistics: System Demonstrations (COLING), Dublin, Ireland, pp. 129–32.Google Scholar
Green, S., Heer, J., and Manning, C. D., 2013. The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, pp. 439–48.Google Scholar
Haddow, B., and Koehn, P., 2012. Analysing the effect of out-of-domain data on SMT systems. In Proceedings of the 7th Workshop on Statistical Machine Translation, Montréal, Canada, pp. 422–32.Google Scholar
Heyn, M., 1996. Integrating machine translation into translation memory systems. In Proceedings of the EAMT Machine Translation Workshop, TKE’96, Vienna, Austria, pp. 113–26.Google Scholar
Itagaki, M., and Aikawa, T., 2008. Post-MT term swapper: supplementing a statistical machine translation system with a user dictionary. In Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 1584–8.Google Scholar
Kim, S. N., Baldwin, T., and Kan, M.-Y., 2009. An unsupervised approach to domain-specific term extraction. In Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 94–8.Google Scholar
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T., 2010. Semeval-2010 task 5: automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 21–6.Google Scholar
Koehn, P., 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 7986.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, pp. 177–80.Google Scholar
Läubli, S., Fishel, M., Massey, G., Ehrensberger-Dow, M., and Volk, M., 2013. Assessing post-editing efficiency in a realistic translation environment. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice, Nice, France, pp. 8391.Google Scholar
Levenberg, A., Callison-Burch, C., and Osborne, M., 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Los Angeles, California, pp. 394402.Google Scholar
Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.Google Scholar
Okita, T., and Way, A., 2010. Statistical machine translation with terminology. In Proceedings of the First Symposium on Patent Information Processing (SPIP), Tokyo, Japan, pp. 18.Google Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-Z., 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311–8.Google Scholar
Pianta, E., and Tonelli, S., 2010. KX: a flexible system for Keyphrase eXtraction. In Proceedings of SemEval 2010, Task 5: Keyword extraction from Scientific Articles, Uppsala, Sweden, pp. 170–3.Google Scholar
Pinnis, M., 2015. Dynamic terminology integration methods in statistical machine translation. In Proceedings of the 18th Annual Conference of the European Association for Machine Translation (EAMT 2015), Antalya, Turkey, pp. 8996.Google Scholar
Pinnis, M., and Skadins, R., 2012. MT adaptation for under-resourced domains - what works and what not. In Proceedings of the 5th International Conference Baltic Human Language Technologies - The Baltic Perspective, Tartu, Estonia, pp. 176–84.Google Scholar
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., and Gornostay, T., 2012. Term extraction, tagging, and mapping tools for under-resourced languages. In Proceedings of the Terminology and Knowledge Engineering (TKE2012) Conference, Jeju Island, Korea, pp. 91–6.Google Scholar
Ren, Z., , Y., Cao, J., Liu, Q., and Huang, Y. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, pp. 4754.Google Scholar
Salton, G., Wong, A., and Yang, C.-S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 613–20.CrossRefGoogle Scholar
Sparck Jones, K., 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28 (1): 1121.Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., and Varga, D., 2006. The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 2142–7.Google Scholar
Stolcke, A., 2002. SRILM-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, Denver, USA, pp. 901–4.Google Scholar
Thurmair, G. and Aleksić, V., 2012. Creating term and lexicon entries from phrase tables. In Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, Italy, pp. 253–60.Google Scholar
Tiedemann, J., 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In Proceeding of Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 237–48.Google Scholar
Vintar, S., and Fišer, D., 2008. Harvesting multi-word expressions from parallel corpora. In Proceedings of European Language Resources Association, Marrakech, Morocco, pp. 1091–6.Google Scholar
Weller, M., Fraser, A., and Heid, U., 2014. Combining bilingual terminology mining and morphological modeling for domain adaptation in SMT. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, Dubrovnik, Croatia, pp. 11–8.Google Scholar
Wu, C.-C., and Chang, J. S. 2004. Bilingual collocation extraction based on syntactic and statistical analyses. In Proceedings of the 15th Conference on Computational Linguistics and Speech Processing, Taiwan, pp. 120.Google Scholar
Xiong, D., Meng, F., and Liu, Q., 2016. Topic-based term translation models for statistical machine translation. Artificial Intelligence 232 : 5475.Google Scholar