Data Distribution Bottlenecks in Grounding Language Models to Knowledge Bases
CoRR(2023)
摘要
Language models (LMs) have already demonstrated remarkable abilities in
understanding and generating both natural and formal language. Despite these
advances, their integration with real-world environments such as large-scale
knowledge bases (KBs) remains an underdeveloped area, affecting applications
such as semantic parsing and indulging in "hallucinated" information. This
paper is an experimental investigation aimed at uncovering the robustness
challenges that LMs encounter when tasked with knowledge base question
answering (KBQA). The investigation covers scenarios with inconsistent data
distribution between training and inference, such as generalization to unseen
domains, adaptation to various language variations, and transferability across
different datasets. Our comprehensive experiments reveal that even when
employed with our proposed data augmentation techniques, advanced small and
large language models exhibit poor performance in various dimensions. While the
LM is a promising technology, the robustness of the current form in dealing
with complex environments is fragile and of limited practicality because of the
data distribution issue. This calls for future research on data collection and
LM learning paradims.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要