TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

Rebecca Iglesias-Flores,Megha Mishra,Ajay Patel,Akanksha Malhotra,Reno Kriz,Martha Palmer,Chris Callison-Burch

AAAI 2021（2021）

引用 0|浏览43

暂无评分

摘要

Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要