Hostname: page-component-745bb68f8f-f46jp Total loading time: 0 Render date: 2025-01-25T09:49:36.920Z Has data issue: false hasContentIssue false

Measuring bilingual corpus comparability

Published online by Cambridge University Press:  15 January 2018

BO LI
Affiliation:
Department of Computer Science, Central China Normal University, Wuhan, China e-mail: libo@mail.ccnu.edu.cn
ERIC GAUSSIER
Affiliation:
CNRS-LIG/AMA, Université Grenoble Alpes, Grenoble, France e-mail: eric.gaussier@imag.fr
DAN YANG
Affiliation:
China Electric Power Research Institute, Wuhan, China e-mail: yangdan3@epri.sgcc.com.cn

Abstract

Comparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has been justified by several studies. However, researchers have not been able to establish a widely accepted and fully validated framework to measure corpus quality. We will thus investigate in this paper a comprehensive methodology to deal with the quality of comparable corpora. To be exact, we will propose several comparability measures and a quantitative strategy to test those measures. Our experiments show that the proposed comparability measure can capture gold-standard comparability levels very well and is robust to the bilingual dictionary used. Moreover, we will show in the task of bilingual lexicon extraction that the proposed measure correlates well with the performance of the real world application.

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdul-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 16–23.Google Scholar
Bahdanau, D., Cho, K., and Bengio, Y., 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, pp. 115.Google Scholar
Ballesteros, L., and Croft, W. B., 1997. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th ACM SIGIR, Philadelphia, Pennsylvania, USA, pp. 8491.Google Scholar
Blei, A., and Jordan, I., 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3 : 9931022.Google Scholar
Boyd-Graber, J., and Blei, D. M. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI-2009) , pp. 75–82.Google Scholar
Chebel, M., Latiri, C., and Gaussier, E., 2017. Bilingual lexicon extraction from comparable corpora based on closed concepts mining. In Proceedings of the 21st Pacific-Asia Conference on Knowledge Discovery and Data Mining, Jeju, Korea, pp. 586598.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391407.Google Scholar
Deshmukh, A., and Hegde, G., 2012. A literature survey on latent semantic indexing. International Journal of Engineering Inventions 1 (4): 15.Google Scholar
Fung, P., and Yee, L. Y., 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, pp. 414–20.Google Scholar
Gaussier, E., Renders, J. M., Matveeva, I., Goutte, C., and Déjean, H. D., 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 526–33.Google Scholar
Hazem, A., and Morin, E., 2016. Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3401–11.Google Scholar
Hermann, K. M., and Blunsom, P., 2014. Multilingual Models for Compositional Distributional Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Maryland, USA, pp. 5868.Google Scholar
Hewavitharana, S., and Vogel, S. 2008. Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. In Proceedings of the LREC 2008 Workshop on Comparable Corpora.Google Scholar
Ji, H. 2009. Mining name translations from comparable corpora by creating bilingual information networks. In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC-2009), pp. 34–7.Google Scholar
Kilgarriff, A., 2001. Comparing corpora. International Journal of Corpus Linguistics 6 : 97133.CrossRefGoogle Scholar
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of MT Summit 2005.Google Scholar
Li, B., and Gaussier, E., 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 644–52.Google Scholar
Li, B., Gaussier, E., and Aizawa, A., 2011. Clustering comparable corpora for bilingual lexicon extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 473–8.Google Scholar
Luong, T., Pham, H., and Manning, C. D. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the NAACL Workshop on Vector Space Modeling for NLP.Google Scholar
Markantonatou, S., Sofianopoulos, S., Spilioti, V., Tambouratzis, G., Vassiliou, M., and Yannoutsou, O. 2006. Using patterns for machine translation. In Proceedings of the European Association for Machine Translation, pp. 239–46.Google Scholar
Mathieu, B., Besancon, R., and Fluhr, C. 2004. Multilingual document clusters discovery. In Proceedings of RIAO. pp. 116–25.Google Scholar
Morin, E., Daille, B., Takeuchi, K., and Kageura, K., 2007. Bilingual terminology mining - using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 664–71.Google Scholar
Munteanu, D. S., Fraser, A., and Marcu, A., 2004. Improved machine translation performance via parallel sentence extraction from comparable corpora. In Proceedings of the HLT-NAACL 2004, Boston, MA., USA, pp. 265–72.Google Scholar
Munteanu, D. S., and Marcu, D., 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–8.Google Scholar
Ni, X., Sun, J. T., Hu, J., and Chen, Z. 2009. Mining multilingual topics from wikipedia. In Proceedings of the 18th International Conference on World Wide Web. WWW ’09, pp. 1155–6.Google Scholar
Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.Google Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–8.Google Scholar
Pekar, V., Mitkov, R., Blagoev, D., and Mulloni, A., 2006. Finding translations for low-frequency words in comparable corpora. Machine Translation 20 (4): 247–66.CrossRefGoogle Scholar
Rapp, R., 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–26.Google Scholar
Rayson, P., and Garside, R. 2000. Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pp. 1–6.Google Scholar
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., and Utsuro, T., 2006. Compiling French-Japanese terminologies from the web. In Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 225–32.Google Scholar
Salton, G., Wong, A., and Yang, C. S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 : 613–20.Google Scholar
Saralegi, X., SanVicente, I., and Gurrutxaga, A. 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In Proceedings of the 6th International Conference on Language Resources and Evaluations - Building and Using Comparable Corpora Workshop.Google Scholar
Schmid, H. 1995. Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop, pp. 47–50.Google Scholar
Shapiro, S. S., and Wilk, M. B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52 (3): 591611.Google Scholar
Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop, Louvain-la-Neuve.Google Scholar
Sharoff, S., Rapp, R., and Zweigenbaum, P. 2013. Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora. In Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.), Building and Using Comparable Corpora. Berlin: Springer-Verlag, pp. 117.Google Scholar
Skadina, I., Vasiljevs, A., Skadins, R., Gaizauskas, R., Tufis, D., and Gornostay, T. 2010. Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC-2010), pp. 6–14.Google Scholar
Talvensaari, T., Laurikkala, J., Järvelin, L., Juhola, M., and Keskustalo, H., 2007. Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems 25 (1): 4.Google Scholar
Upadhyay, S., Faruqui, M., Dyer, C., and Roth, D., 2016. Cross-lingual models of word embeddings: an empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 16611670.Google Scholar
Vulic, I., and Moens, M. F. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pp. 719–725.Google Scholar
Washtell, J. 2009. Co-dispersion: a windowless approach to lexical association. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 861–9.Google Scholar