Low-resource neural character-based noisy text normalization.
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS(2019)
摘要
User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing tools. In this paper we explore the state-of-the-art of the machine translation approach to normalize text under low-resource conditions. We also propose an auxiliary task for the sequence-to-sequence (seq2seq) neural architecture novel to the text normalization task, that improves the base seq2seq model up to 5%. This increase of performance closes the gap between statistical machine translation approaches and neural ones for low-resource text normalization.
更多查看译文
关键词
Noisy text,normalization,recurrent neural networks,low-resource,autoencoding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络