Parallel Sentence Mining by Constrained Decoding

Nikolay Bogoychev
Nikolay Bogoychev
Faheem Kirefu
Faheem Kirefu

ACL, pp. 1672-1678, 2020.

Cited by: 0|Views40
EI
Weibo:
The high precision reflects the effectiveness of using neural machine translation as a sentence similarity scorer

Abstract:

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Having large and high-quality parallel corpora is critical for neural machine translation (NMT).
  • One way to create such a resource is to mine the web (Resnik and Smith, 2003).
  • Once texts are crawled from the web, they form large collections of data in different languages.
  • A natural way is to score sentence similarity between all possible sentence pairs and extract the topscoring ones.
  • This poses two major challenges: 1.
  • Determining the semantic similarity of a sentence pair in two languages
Highlights
  • Having large and high-quality parallel corpora is critical for neural machine translation (NMT)
  • The high precision reflects the effectiveness of using neural machine translation as a sentence similarity scorer
  • Maximising machine translation scores is biased towards finding machine translated text produced by a similar model
  • More research is needed on this problem given the prevalent usage of neural machine translation
  • We hypothesise that part of the success of dual conditional cross-entropy filtering (JunczysDowmunt, 2018) is checking that scores in both directions are approximately equal, whereas a machine translation would be characterised by a high score in one direction
Methods
  • NMT systems can assign a conditional translation probability to an arbitrary sentence pair.
  • The authors could score every pair of source and target sentences using a translation system in quadratic time, return pairs that score highly for further filtering.
  • The authors approximate this with beam search.
  • NMT naturally generates translations one token at a time from left to right, so it can follow the trie of target language sentences as it translates
Results
  • Experiments on the sample data in Table 1 show that pre-expansion pruning outperforms postexpansion by about 10 F1 points
  • This can be explained by the fact that the decoder has a better chance to generate the correct target sentence if the available vocabulary is constrained.
  • The authors notice that Bicleaner achieves a more balanced precision and recall, while filtering by per-word cross-entropy leads to very high precision but lower recall
  • The latter does better in terms of F1.
  • This implies that the two filtering techniques keep different sentence pairs
Conclusion
  • The authors bring a new insight into using NMT as a similarity scorer for sentences in different languages.
  • By constraining on a target side trie during decoding, beam search can approximate pairwise comparison between source and target sentences.
  • Overall the authors present an interesting way of finding parallel sentences through trie-constrained decoding.
  • Maximising machine translation scores is biased towards finding machine translated text produced by a similar model.
  • More research is needed on this problem given the prevalent usage of NMT.
  • The authors hypothesise that part of the success of dual conditional cross-entropy filtering (JunczysDowmunt, 2018) is checking that scores in both directions are approximately equal, whereas a machine translation would be characterised by a high score in one direction
Summary
  • Introduction:

    Having large and high-quality parallel corpora is critical for neural machine translation (NMT).
  • One way to create such a resource is to mine the web (Resnik and Smith, 2003).
  • Once texts are crawled from the web, they form large collections of data in different languages.
  • A natural way is to score sentence similarity between all possible sentence pairs and extract the topscoring ones.
  • This poses two major challenges: 1.
  • Determining the semantic similarity of a sentence pair in two languages
  • Methods:

    NMT systems can assign a conditional translation probability to an arbitrary sentence pair.
  • The authors could score every pair of source and target sentences using a translation system in quadratic time, return pairs that score highly for further filtering.
  • The authors approximate this with beam search.
  • NMT naturally generates translations one token at a time from left to right, so it can follow the trie of target language sentences as it translates
  • Results:

    Experiments on the sample data in Table 1 show that pre-expansion pruning outperforms postexpansion by about 10 F1 points
  • This can be explained by the fact that the decoder has a better chance to generate the correct target sentence if the available vocabulary is constrained.
  • The authors notice that Bicleaner achieves a more balanced precision and recall, while filtering by per-word cross-entropy leads to very high precision but lower recall
  • The latter does better in terms of F1.
  • This implies that the two filtering techniques keep different sentence pairs
  • Conclusion:

    The authors bring a new insight into using NMT as a similarity scorer for sentences in different languages.
  • By constraining on a target side trie during decoding, beam search can approximate pairwise comparison between source and target sentences.
  • Overall the authors present an interesting way of finding parallel sentences through trie-constrained decoding.
  • Maximising machine translation scores is biased towards finding machine translated text produced by a similar model.
  • More research is needed on this problem given the prevalent usage of NMT.
  • The authors hypothesise that part of the success of dual conditional cross-entropy filtering (JunczysDowmunt, 2018) is checking that scores in both directions are approximately equal, whereas a machine translation would be characterised by a high score in one direction
Tables
  • Table1: Precision, recall and F1 of our methods on BUCC sample set
  • Table2: F1 scores of our method and other methods on BUCC De-En train and test sets
Download tables as Excel
Related work
  • A typical parallel corpus mining workflow first aligns parallel documents to limit the search space for sentence alignment. Early methods rely on webpage structure (Resnik and Smith, 2003; Shi et al, 2006). Later, Uszkoreit et al (2010) translate all documents into a single language, and shortlist candidate document pairs based on TFIDF-weighted n-grams. Recently, Guo et al (2019) suggest a neural method to compare document embeddings obtained from sentence embeddings .

    With the assumption that matched documents are parallel (no cross-alignment), sentence alignment can be done by comparing sentence length in words (Brown et al, 1991) or characters (Gale and Church, 1993), which is then improved by adding lexical features (Varga et al, 2005). After translating texts into the same language, BLEU can also be used to determine parallel texts, by anchoring the most reliable alignments first (Sennrich and Volk, 2011). Most recently, Thompson and Koehn (2019) propose to compare bilingual sentence embeddings with dynamic programming in linear runtime.
Funding
  • This work has received funding from the European Union under grant agreement INEA/CEF/ICT/A2017/1565602 through the Connecting Europe Facility
Reference
  • Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2017. Guided open vocabulary image captioning with constrained beam search. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe and Holger Schwenk. 2019. Marginbased parallel corpus mining with multilingual sentence embeddings. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3197–3203. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andoni Azpeitia, Thierry Etchegoyhen, and Eva Martınez Garcia. 2018. Extracting parallel sentences from comparable corpora with STACC variants. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Muller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Houda Bouamor and Hassan Sajjad. 2018. H2@BUCC18: Parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL ’91, pages 169–176, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102.
    Google ScholarLocate open access versionFindings
  • Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Effective parallel corpus mining using bilingual sentence embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 165–176, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mandy Guo, Yinfei Yang, Keith Stevens, Daniel Cer, Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Hierarchical document encoder for parallel corpus mining. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 64–72, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Hokamp and Qun Liu. 2017. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535– 1546, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, Andre F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, pages 116– 121, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L. Forcada. 2018. Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chongman Leong, Derek F. Wong, and Lidia S. Chao. 2018. UM-pAligner: Neural network-based parallel sentence identification model. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Yijiang Lian, Zhijie Chen, Jinlong Hu, Kefeng Zhang, Chunwei Yan, Muchenxuan Tong, Wenying Han, Hanju Guan, Ying Li, Ying Cao, Yang Yu, Zhigang Li, Xiaochun Liu, and Yue Wang. 2019. An end-toend generative retrieval method for sponsored search engine–decoding efficiently into a closed target domain. arXiv preprint arXiv:1902.00592.
    Findings
  • Dragos Stefan Munteanu and Daniel Marcu. 2002. Processing comparable corpora with bilingual suffix trees. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 289–295. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380.
    Google ScholarLocate open access versionFindings
  • Vıctor M. Sanchez-Cartagena, Marta Banon, Sergio Ortiz-Rojas, and Gema Ramırez. 2018. Prompsit’s submission to WMT 2018 parallel corpus filtering shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955–962, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich and Martin Volk. 2011.
    Google ScholarFindings
  • Iterative, MTbased sentence alignment of parallel texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), pages 175– 182, Riga, Latvia. Northern European Association for Language Technology (NEALT).
    Google ScholarLocate open access versionFindings
  • pages 1342–1348, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarFindings
  • Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 1101–1109, Beijing, China. Coling 2010 Organizing Committee.
    Google ScholarLocate open access versionFindings
  • Daniel Varga, Peter Halacsy, Andras Kornai, Viktor Nagy, Laszlo Nemeth, and Viktor Tron. 2005. Parallel corpora for medium density languages. Proceedings of the RANLP 2005 Conference, pages 590– 596.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-Kirkpatrick. 2019. Simple and effective paraphrastic similarity from parallel translations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4602–4608, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60–67, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Lei Shi, Cheng Niu, Ming Zhou, and Jianfeng Gao. 2006. A DOM tree alignment model for mining parallel data from the web. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, pages 489–496, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Brian Thompson and Philipp Koehn. 2019. Vecalign: Improved sentence alignment in linear time and space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments