Multi-Word Tokenization for Sequence Compression
EMNLP 2023(2024)
摘要
Large Language Models have proven highly successful at modelling a variety of
tasks. However, this comes at a steep computational cost that hinders wider
industrial uptake. In this pa005 per, we present MWT: a Multi-Word Tokenizer
that goes beyond word boundaries by representing frequent multi-word
expressions as single tokens. MWTs produce a more compact and efficient
tokenization that yields two benefits: (1) Increase in performance due to a
greater coverage of input data given a fixed sequence length and budget; (2)
Faster and lighter inference due to the ability to reduce the sequence length
with negligible drops in performance. Our results show that MWT is more robust
across shorter sequence lengths, thus allowing for major speedups via early
sequence truncation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要