Enhancing Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems
arxiv(2024)
摘要
Machine translation focuses mainly on high-resource languages (HRLs), while
low-resource languages (LRLs) like Taiwanese Hokkien are relatively
under-explored. This study aims to address this gap by developing a dual
translation model between Taiwanese Hokkien and both Traditional Mandarin
Chinese and English. We employ a pre-trained LLaMA2-7B model specialized in
Traditional Mandarin Chinese to leverage the orthographic similarities between
Taiwanese Hokkien Han and Traditional Mandarin Chinese. Our comprehensive
experiments involve translation tasks across various writing systems of
Taiwanese Hokkien and between Taiwanese Hokkien and other HRLs. We find that
the use of a limited monolingual corpus also further improve the model's
Taiwanese Hokkien capabilities. We then utilize our translation model to
standardize all Taiwanese Hokkien writing systems into Hokkien Han, resulting
in further performance improvements. Additionally, we introduce an evaluation
method incorporating back-translation and GPT-4 to ensure reliable translation
quality assessment even for LRLs. The study contributes to narrowing the
resource gap for Taiwanese Hokkien and empirically investigates the advantages
and limitations of pre-training and fine-tuning based on LLaMA 2.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要