All-in-One Image-Grounded Conversational Agents

arxiv(2019)

引用 7|浏览206
暂无评分
摘要
As single-task accuracy on individual language and image tasks has improved substantially in the last few years, the long-term goal of a generally skilled agent that can both see and talk becomes more feasible to explore. In this work, we focus on leveraging existing individual language and image tasks, along with resources that incorporate both vision and language towards that objective. We explore architectures that combine state-of-the-art Transformer and ResNeXt modules fed into a multimodal module to produce a combined model trained on many tasks. We provide a thorough analysis of the components of the model, and transfer performance when training on one, some, or all of the tasks. Our final models provide a single system that obtains good results on all vision and language tasks considered, and improves the state of the art in image-grounded conversational applications.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要