CCAligned: A Massive Collection of Cross Lingual Web Document Pairs

El-Kishky Ahmed
El-Kishky Ahmed
Guzman Francisco
Guzman Francisco

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views31
Other Links: arxiv.org
Weibo:
We evaluate the baseline scoring by aligning the documents from a subset of the 12 Common Crawl snapshots

Abstract:

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Small-scale efforts have been made to collect aligned document level data on a limited set of language-pairs such as English-German or on limited comparable collections such as Wiki...More

Code:

Data:

0
Introduction
  • Document alignment is the task that attempts to pair documents such that they are translations or near translations of each other.
  • There are a variety of tasks in natural language processing that require or benefit from parallel cross-lingual data.
  • Other tasks include cross-lingual information retrieval and cross-lingual document classification.
  • Cross-lingual data facilitates cross-lingual representations such as in the work of (Lample and Conneau, 2019) which has direct applications to zero-shot NLP tasks and internationalization of models.
  • The availability of high-quality datasets is necessary to both train and evaluate models across these many tasks
Highlights
  • Document alignment is the task that attempts to pair documents such that they are translations or near translations of each other
  • In the remainder of this section, we introduce some document pair scoring functions that attempt to capture the notion of cross-lingual document similarity
  • We evaluate the baseline scoring by aligning the documents from a subset of the 12 Common Crawl snapshots
  • Comparing Direct Document Embedding (DDE) which directly applies LASER to the entirety of the document content, we see that performance is significantly lower than Sentence Average Embedding (SAE) which averages the individual sentence embeddings
  • Our dataset contains document pairs from 92 different languages aligned with English
  • We further evaluate the URL-aligned documents in a downstream machine translation task by decomposing the aligned documents into aligned sentences, training machine translation models across all 92 directions
Methods
  • Low Mid High All

    The authors show the alignment results in Table 4.
  • Comparing DDE which directly applies LASER to the entirety of the document content, the authors see that performance is significantly lower than SAE which averages the individual sentence embeddings.
  • Higher-resource directions appear to align better than lower-resource directions
  • The authors believe this is a byproduct of the data LASER was trained on which is predominantly high-resource
Results
  • The authors evaluate the baseline scoring by aligning the documents from a subset of the 12 Common Crawl snapshots.
  • The authors score document pairs from the source and target languages within the same webdomain apply the greedy document alignment algorithm from Algorithm 1 to ensure the technique cannot align all pairs.
  • Recall is computed on a test-set consisting of pairs from the URL-aligned documents, which the authors verified have high-precision and the authors treat as the ground-truth test set
Conclusion
  • The authors apply URL-matching rules to curate a high-quality cross-lingual documents dataset from the commoncrawl corpus.
  • The authors' dataset contains document pairs from 92 different languages aligned with English.
  • The authors first directly evaluate the quality of the URL-aligned pairs using human annotators.
  • The authors further evaluate the URL-aligned documents in a downstream machine translation task by decomposing the aligned documents into aligned sentences, training machine translation models across all 92 directions.
Summary
  • Introduction:

    Document alignment is the task that attempts to pair documents such that they are translations or near translations of each other.
  • There are a variety of tasks in natural language processing that require or benefit from parallel cross-lingual data.
  • Other tasks include cross-lingual information retrieval and cross-lingual document classification.
  • Cross-lingual data facilitates cross-lingual representations such as in the work of (Lample and Conneau, 2019) which has direct applications to zero-shot NLP tasks and internationalization of models.
  • The availability of high-quality datasets is necessary to both train and evaluate models across these many tasks
  • Methods:

    Low Mid High All

    The authors show the alignment results in Table 4.
  • Comparing DDE which directly applies LASER to the entirety of the document content, the authors see that performance is significantly lower than SAE which averages the individual sentence embeddings.
  • Higher-resource directions appear to align better than lower-resource directions
  • The authors believe this is a byproduct of the data LASER was trained on which is predominantly high-resource
  • Results:

    The authors evaluate the baseline scoring by aligning the documents from a subset of the 12 Common Crawl snapshots.
  • The authors score document pairs from the source and target languages within the same webdomain apply the greedy document alignment algorithm from Algorithm 1 to ensure the technique cannot align all pairs.
  • Recall is computed on a test-set consisting of pairs from the URL-aligned documents, which the authors verified have high-precision and the authors treat as the ground-truth test set
  • Conclusion:

    The authors apply URL-matching rules to curate a high-quality cross-lingual documents dataset from the commoncrawl corpus.
  • The authors' dataset contains document pairs from 92 different languages aligned with English.
  • The authors first directly evaluate the quality of the URL-aligned pairs using human annotators.
  • The authors further evaluate the URL-aligned documents in a downstream machine translation task by decomposing the aligned documents into aligned sentences, training machine translation models across all 92 directions.
Tables
  • Table1: URL matching via language identifiers
  • Table2: Human evaluation of documents of different languages aligned to English. Languages are classified as high, medium or low resource based on the amount of mined documents. We report the precision of the matching rules based on human annotation as well as the inter-rater agreement
  • Table3: BLEU scores of NMT models trained on bitext data mined from aligned documents on TED Talk test sets. Volume given as number of distinct aligned sentence pairs
  • Table4: Content-based Document Alignment Recall
Download tables as Excel
Related work
  • In this section, we describe the data preparation process detailing how the dataset was created as well as the statistics on the resultant dataset.

    The concept of crawling and mining the web to identify sources of parallel data has been previously explored (Resnik, 1999). A large body of this work has focused on identifying parallel text from multilingual data obtained from a single source: for example the United Nations General Assembly Resolutions (Rafalovitch et al.; Ziemski et al, 2016) or European Parliament parallel corpus (Koehn, 2005). These parallel corpora were curated from specific, homogeneous sources by examining the content and deriving domainspecific rules for aligning documents.

    Other approaches have identified parallel documents in unstructured web corpora by relying on metadata. Some of these methods have focused on publication date and other temporal heuristics to aid in identifying parallel documents (Munteanu and Marcu, 2005, 2006; Udupa et al, 2009; Do et al, 2009; AbduI-Rauf and Schwenk, 2009). However, temporal features can be sparse, noisy, and unreliable. A different class of alignment methods rely on document structure (Resnik and Smith, 2003; Chen and Nie, 2000).
Study subjects and analysis
pairs: 30
We first select 6 languages from various language families, scripts, and levels of resource availability. For each language, we identify 30 pairs of URLs for a total of 180 pairs from the aligned dataset. To gather pairs from a diverse set of websites, each URL pair is selected from a distinct web domain

Reference
  • Sadaf AbduI-Rauf and Holger Schwenk. 2009. On the use of comparable corpora to improve smt performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 16–23. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikel Artetxe and Holger Schwenk. 2018. Marginbased parallel corpus mining with multilingual sentence embeddings. arXiv preprint arXiv:1811.01136.
    Findings
  • Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
    Google ScholarLocate open access versionFindings
  • Christian Buck, Kenneth Heafield, and Bas van Ooyen. 201N-gram counts and language models from the common crawl. In Proceedings of the Language Resources and Evaluation Conference, Reykjavík, Iceland.
    Google ScholarLocate open access versionFindings
  • Christian Buck and Philipp Koehn. 2016a. Findings of the wmt 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 554–563.
    Google ScholarLocate open access versionFindings
  • Christian Buck and Philipp Koehn. 2016b. Quick and reliable document alignment via tf/idf-weighted cosine distance. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 672–678.
    Google ScholarLocate open access versionFindings
  • Jiang Chen and Jian-Yun Nie. 2000. Parallel web text mining for cross-language ir. In Content-Based Multimedia Information Access-Volume 1, pages 62–7Le Centre de Haute Etudes Internationales d’Informatique Documentaire.
    Google ScholarLocate open access versionFindings
  • Aswarth Abhilash Dara and Yiu-Chang Lin. 2016. Yoda system for wmt16 shared task: Bilingual document alignment. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 679–684.
    Google ScholarLocate open access versionFindings
  • Thi-Ngoc-Diep Do, Viet-Bac Le, Brigitte Bigi, Laurent Besacier, and Eric Castelli. 200Mining a comparable text corpus for a vietnamese-french statistical machine translation system. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 165–172. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Luís Gomes and Gabriel Pereira Lopes. 2016. First steps towards coverage-based document alignment. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 697–702.
    Google ScholarLocate open access versionFindings
  • Mandy Guo, Yinfei Yang, Keith Stevens, Daniel Cer, Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Hierarchical document encoder for parallel corpus mining. In Proceedings of the Fourth Conference on Machine Translation, pages 64–72, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala– English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6100–6113, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Armand Joulin, Edouard Grave, and Piotr Bojanowski Tomas Mikolov. 2017. Bag of tricks for efficient text classification. EACL 2017, page 427.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86.
    Google ScholarLocate open access versionFindings
  • Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability.
    Google ScholarFindings
  • Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.
    Findings
  • James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38.
    Google ScholarLocate open access versionFindings
  • Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4):477–504.
    Google ScholarLocate open access versionFindings
  • Dragos Stefan Munteanu and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from nonparallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 81–88. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535.
    Google ScholarLocate open access versionFindings
  • Philip Resnik. 1999. Mining the web for bilingual text. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 527–534. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Philip Resnik and Noah A Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk. 2018. Filtering and mining parallel data in a joint multilingual space. arXiv preprint arXiv:1805.09822.
    Findings
  • Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmán. 2019. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791.
    Findings
  • Jason R Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the common crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1374–1383.
    Google ScholarLocate open access versionFindings
  • Raghavendra Udupa, K Saravanan, A Kumaran, and Jagadeesh Jagarlamudi. 2009. Mint: A method for effective and scalable mining of named entity transliterations from large comparable corpora. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 799–807. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations parallel corpus v1.
    Google ScholarFindings
  • 0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 3530–3534.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments