Transfer Learning for Multimodal Dialog

Shruti Palaskar,Ramon Sanabria,Florian Metze

Computer Speech & Language（2020）

引用 8|浏览71

暂无评分

摘要

Audio-Visual Scene-Aware Dialog (AVSD) is best understood as an extension of Visual Question Answering, the task of generating a textual answer in response to a textual question on multi-media content. In AVSD, the answer-relevant “context” is expanded to include past dialog turns, which we view as a specialized form of extra textual knowledge (in addition to the standard video features). We have developed a framework that uses hierarchical attention to fuse contributions from different modalities, and had shown how it can be used to generate textual summaries from multi-modal sources, specifically videos with accompanying commentary. In this paper, we transfer the algorithmic approach, models, and data from this background corpus of 2000 h of how-to videos to the AVSD task, and report our findings. Our approach uses dialog context, but makes no assumption about the ordering of the history. Our system achieves the best performance in both automatic and human evaluations in the 7th Dialog State Tracking Challenge (AVSD).

查看译文

关键词

Multimodal dialog,Video question answering,Transfer learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要