DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset
arxiv(2022)
摘要
As sharing images in an instant message is a crucial factor, there has been
active research on learning an image-text multi-modal dialogue models. However,
training a well-generalized multi-modal dialogue model remains challenging due
to the low quality and limited diversity of images per dialogue in existing
multi-modal dialogue datasets. In this paper, we propose an automated pipeline
to construct a multi-modal dialogue dataset, ensuring both dialogue quality and
image diversity without requiring minimum human effort. In our pipeline, to
guarantee the coherence between images and dialogue, we prompt GPT-4 to infer
potential image-sharing moments - specifically, the utterance, speaker,
rationale, and image description. Furthermore, we leverage CLIP similarity to
maintain consistency between aligned multiple images to the utterance. Through
this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal
dialogue dataset that surpasses existing datasets in terms of quality and
diversity in human evaluation. Our comprehensive experiments highlight that
when multi-modal dialogue models are trained using our dataset, their
generalization performance on unseen dialogue datasets is significantly
enhanced. We make our source code and dataset publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要