TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies

International Journal of Asian Language Processing(2022)

引用 1|浏览5
暂无评分
摘要
This paper presents an extended work on the trilingual spoken language translation corpus of the Extraordinary Chambers in the Courts of Cambodia (ECCC), namely TriECCC. TriECCC is a simultaneously spoken language translation corpus with parallel resources of speech and text in three languages: Khmer, English, and French. This corpus has approximately [Formula: see text] thousand utterances, approximately [Formula: see text], [Formula: see text], and [Formula: see text] h in length of speech, and [Formula: see text], [Formula: see text] and [Formula: see text] million words in text, in Khmer, English, and French, respectively. We first report the baseline results of machine translation (MT), and speech translation (ST) systems, which show reasonable performance. We then investigate the use of the ROVER method to combine multiple MT outputs and fine-tune the pre-trained English–French MT models to enhance the Khmer MT systems. Experimental results show that the ROVER is effective for combining English-to-Khmer and French-to-Khmer systems. Fine-tuning from both single and multiple parents shows the effective improvement on the BLEU scores for Khmer-to-English/French and English/French-to-Khmer MT systems.
更多
查看译文
关键词
trilingual corpus,cambodia,speech recognition,translation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要