Chrome Extension
WeChat Mini Program
Use on ChatGLM

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Annual Meeting of the Association for Computational Linguistics(2024)

Cited 1|Views18
Abstract
Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.
More
Translated text
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:论文提出了一种名为多尺度上下文化(MSC)的方法,用于改进基于字节的神经机器翻译,该方法通过不同隐藏状态维度学习不同尺度上下文信息,并利用注意力模块动态整合这些信息,解决了子词划分在神经机器翻译中遇到的问题。

方法】:提出的方法是Multi-Scale Contextualization (MSC),它学习跨不同隐藏状态维度的多尺度上下文信息,并通过注意力模块动态整合这些信息。

实验】:实验结果显示,在多语言和跨领域场景中,MSC显著优于基于子词的方法和其他基于字节的模型。相关代码可在GitHub上找到。