TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies

Kak Soky,Masato Mimura,Tatsuya Kawahara,Chenhui Chu,Sheng Li,Chenchen Ding,Sethserey Sam

International Journal of Asian Language Processing（2022）

引用 1|浏览5

暂无评分

摘要

This paper presents an extended work on the trilingual spoken language translation corpus of the Extraordinary Chambers in the Courts of Cambodia (ECCC), namely TriECCC. TriECCC is a simultaneously spoken language translation corpus with parallel resources of speech and text in three languages: Khmer, English, and French. This corpus has approximately [Formula: see text] thousand utterances, approximately [Formula: see text], [Formula: see text], and [Formula: see text] h in length of speech, and [Formula: see text], [Formula: see text] and [Formula: see text] million words in text, in Khmer, English, and French, respectively. We first report the baseline results of machine translation (MT), and speech translation (ST) systems, which show reasonable performance. We then investigate the use of the ROVER method to combine multiple MT outputs and fine-tune the pre-trained English–French MT models to enhance the Khmer MT systems. Experimental results show that the ROVER is effective for combining English-to-Khmer and French-to-Khmer systems. Fine-tuning from both single and multiple parents shows the effective improvement on the BLEU scores for Khmer-to-English/French and English/French-to-Khmer MT systems.

查看译文

关键词

trilingual corpus,cambodia,speech recognition,translation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要