Learning Mutually Informed Representations for Characters and Subwords.
CoRR(2023)
摘要
Most pretrained language models rely on subword tokenization, which processes
text as a sequence of subword tokens. However, different granularities of text,
such as characters, subwords, and words, can contain different kinds of
information. Previous studies have shown that incorporating multiple input
granularities improves model generalization, yet very few of them outputs
useful representations for each granularity. In this paper, we introduce the
entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as
separate modalities, and it generates mutually informed representations for
both granularities as output. We evaluate our model on text classification,
named entity recognition, and POS-tagging tasks. Notably, the entanglement
model outperforms its backbone language models, particularly in the presence of
noisy texts and low-resource languages. Furthermore, the entanglement model
even outperforms larger pre-trained models on all English sequence labeling
tasks and classification tasks. Our anonymized code is available at
https://anonymous.4open.science/r/noisy-IE-A673
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要