Image Chat: Engaging Grounded Conversations

arxiv, 2020.

Cited by: 0|Bibtex|Views158
Other Links: arxiv.org
Weibo:
While our human evaluations were on short conversations, initial investigations indicate the model as is can extend to longer chats, see Appendix G, which should be studied in future work

Abstract:

To achieve the long-term goal of machines being able to engage humans in conversation, our models should captivate the interest of their speaking partners. Communication grounded in images, whereby a dialogue is conducted based on a given photo, is a setup naturally appealing to humans (Hu et al., 2014). In this work we study large-scal...More

Code:

Data:

0
Introduction
  • A key way for machines to exhibit intelligence is for them to be able to perceive the world around them – and to be able to communicate with humans in natural language about that world.
  • To speak naturally with humans it is necessary to understand the natural things that humans say about the world they live in, and to respond in kind.
  • This involves understanding what they perceive, e.g. the images they see, what those images mean semantically for humans, and how mood and style shapes the language and conversations derived from these observations.
  • The authors propose ways to fuse those modalities together and perform a detailed study including both automatic evaluations, ablations and human evaluations of the models using crowdworkers
Highlights
  • A key way for machines to exhibit intelligence is for them to be able to perceive the world around them – and to be able to communicate with humans in natural language about that world
  • Focusing on the case of chit-chatting about a given image, a naturally useful application for end-users of social dialogue agents, this work shows that our best proposed model can generate grounded dialogues that humans prefer over dialogues with other fellow humans almost half of the time (47.7%)
  • Our work shows that we are close to having models that humans can relate to in chit-chat conversations, which could set new ground for social dialogue agents
  • Our retrieval models outperformed their generative versions; closing that gap is an important challenge for the community
  • While our human evaluations were on short conversations, initial investigations indicate the model as is can extend to longer chats, see Appendix G, which should be studied in future work
Methods
  • The authors test the models on the IMAGE-CHAT and IGC datasets using automatic metrics and human evaluations.
  • Module Choices The authors first compare various module configurations of the TRANSRESNETRET model, and show the results for a simple information retrieval baseline, in which the candidates are ranked according to their weighted word overlap to the input message.
  • The average metrics indicate that using the ResNeXt-IG-3.5B image encoder features improves performance significantly across the whole task, as the authors obtain 50.3% R@1 for the best ResNeXt-IG-3.5B model and only 40.6%
Results
  • Evaluation Setup

    The authors use a set of 500 images from YFCC-100M that are not present in IMAGECHAT to build a set of three-round dialogues pairing humans with models in conversation.
  • TRANSRESNETGEN generates a response, whereas TRANSRESNETRET retrieves candidate utterances from the IMAGE-CHAT training set.
  • The latter is given a separate set of candidates corresponding to the round of dialogue – e.g. when producing a response to turn 1, the model retrieves from all possible round 1 utterances from the train set
Conclusion
  • This paper presents an approach for improving the way machines can generate grounded conversations that humans find engaging.
  • Focusing on the case of chit-chatting about a given image, a naturally useful application for end-users of social dialogue agents, this work shows that the best proposed model can generate grounded dialogues that humans prefer over dialogues with other fellow humans almost half of the time (47.7%).
  • This result is made possible by the creation of a new dataset IMAGE-CHAT3.
  • The challenge will be to combine this engagingness with other skills, such as world knowledge (Antol et al, 2015) relation to personal interests (Zhang et al, 2018), and task proficiency
Summary
  • Introduction:

    A key way for machines to exhibit intelligence is for them to be able to perceive the world around them – and to be able to communicate with humans in natural language about that world.
  • To speak naturally with humans it is necessary to understand the natural things that humans say about the world they live in, and to respond in kind.
  • This involves understanding what they perceive, e.g. the images they see, what those images mean semantically for humans, and how mood and style shapes the language and conversations derived from these observations.
  • The authors propose ways to fuse those modalities together and perform a detailed study including both automatic evaluations, ablations and human evaluations of the models using crowdworkers
  • Methods:

    The authors test the models on the IMAGE-CHAT and IGC datasets using automatic metrics and human evaluations.
  • Module Choices The authors first compare various module configurations of the TRANSRESNETRET model, and show the results for a simple information retrieval baseline, in which the candidates are ranked according to their weighted word overlap to the input message.
  • The average metrics indicate that using the ResNeXt-IG-3.5B image encoder features improves performance significantly across the whole task, as the authors obtain 50.3% R@1 for the best ResNeXt-IG-3.5B model and only 40.6%
  • Results:

    Evaluation Setup

    The authors use a set of 500 images from YFCC-100M that are not present in IMAGECHAT to build a set of three-round dialogues pairing humans with models in conversation.
  • TRANSRESNETGEN generates a response, whereas TRANSRESNETRET retrieves candidate utterances from the IMAGE-CHAT training set.
  • The latter is given a separate set of candidates corresponding to the round of dialogue – e.g. when producing a response to turn 1, the model retrieves from all possible round 1 utterances from the train set
  • Conclusion:

    This paper presents an approach for improving the way machines can generate grounded conversations that humans find engaging.
  • Focusing on the case of chit-chatting about a given image, a naturally useful application for end-users of social dialogue agents, this work shows that the best proposed model can generate grounded dialogues that humans prefer over dialogues with other fellow humans almost half of the time (47.7%).
  • This result is made possible by the creation of a new dataset IMAGE-CHAT3.
  • The challenge will be to combine this engagingness with other skills, such as world knowledge (Antol et al, 2015) relation to personal interests (Zhang et al, 2018), and task proficiency
Tables
  • Table1: IMAGE-CHAT dataset statistics
  • Table2: Module choices on IMAGE-CHAT. We compare different module variations for TRANSRESNETRET
  • Table3: Ablations on IMAGE-CHAT. We compare variants of our best TRANSRESNET generative and retrieval models (ResNeXt-IG-3.5B image encoder, and MM-Sum + separate text encoders for retrieval) where we remove modalities: image, dialogue history and style conditioning, reporting R@1/100 for retrieval and ROUGE-L for generation for dialogue turns 1, 2 and 3 independently, as well as the average over all turns
  • Table4: IGC Human Evaluation on responses from our TRANSRESNET MM-SUM model conditioned on various personalities. Responses were rated on a quality scale from 1 to 3, where 3 is the highest
  • Table5: Highly rated examples from the IGC dataset test split where TRANSRESNETRET MM-Sum responses were rated the highest (score of 3) by human evaluators
  • Table6: Low rated examples from the IGC dataset test split where TRANSRESNETRET MM-Sum responses were rated the lowest (score of 1) by human evaluators
  • Table7: Ablations on IMAGE-CHAT. We compare variants of our best TRANSRESNET generative model (ResNeXtIG-3.5B image encoder) where we remove modalities: image, dialogue history and style conditioning, reporting F1 and BLEU-4 for generation for dialogue turns 1, 2 and 3 independently, as well as the average over all turns
Download tables as Excel
Related work
  • The majority of work in dialogue is not grounded in perception, e.g. much recent work explores sequence-to-sequence models or retrieval models for goal-directed (Henderson et al, 2014) or chit-

    1http://parl.ai/projects/image_chat chat tasks (Vinyals and Le, 2015; Zhang et al, 2018). While these tasks are text-based only, many of the techniques developed can likely be transferred for use in multimodal systems, for example using state-of-the-art Transformer representations for text (Mazare et al, 2018) as a sub-component.

    In the area of language and vision, one of the most widely studied areas is image captioning, whereby a single utterance is output given an input image. This typically involves producing a factual, descriptive sentence describing the image, in contrast to producing a conversational utterance as in dialogue. Popular datasets include COCO (Chen et al, 2015) and Flickr30k (Young et al, 2014). Again, a variety of sequence-to-sequence (Vinyals et al, 2015; Xu et al, 2015; Anderson et al, 2018) and retrieval models (Gu et al, 2018; Faghri et al, 2018; Nam et al, 2016) have been applied. These tasks measure the ability of models to understand the content of an image, but not to carry out an engaging conversation grounded in perception. Some works have extended image captioning from being purely factual towards more engaging captions by incorporating style while still being single turn, e.g. (Mathews et al, 2018, 2016; Gan et al, 2017; Guo et al, 2019; Shuster et al, 2019). Our work also applies a style component, but concentrates on image-grounded dialogue, rather than image captioning.
Reference
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and vqa. CVPR.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Dan Bohus and Eric Horvitz. 2009. Models for multiparty engagement in open-world dialog. In Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–234. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
    Findings
  • Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. 2020. The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208. Springer.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 201Vse++: Improving visualsemantic embeddings with hard negatives.
    Google ScholarFindings
  • Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proc IEEE Conf on Computer Vision and Pattern Recognition, pages 3137–3146.
    Google ScholarLocate open access versionFindings
  • J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang. 2018.
    Google ScholarFindings
  • Look, imagine and match: Improving textual-visual crossmodal retrieval with generative models. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7181–7189.
    Google ScholarLocate open access versionFindings
  • Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. 2019. Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4204–4213.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
    Google ScholarLocate open access versionFindings
  • Matthew Henderson, Blaise Thomson, and Jason D Williams. 20The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272.
    Google ScholarLocate open access versionFindings
  • Yuheng Hu, Lydia Manikonda, and Subbarao Kambhampati. 2014. What we instagram: A first analysis of instagram photo content and user types. In Eighth International AAAI Conference on Weblogs and Social Media.
    Google ScholarLocate open access versionFindings
  • Bernd Huber, Daniel McDuff, Chris Brockett, Michel Galley, and Bill Dolan. 2018. Emotional dialogue generation using image-grounded language models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 277. ACM.
    Google ScholarLocate open access versionFindings
  • Margaret Li, Jason Weston, and Stephen Roller. 2019. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087.
    Findings
  • Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 20Exploring the limits of weakly supervised pretraining. In Computer Vision – ECCV 2018, pages 185–201, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Sebastien Marcel and Yann Rodriguez. 2010. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pages 1485–1488. ACM.
    Google ScholarLocate open access versionFindings
  • Alexander Mathews, Lexing Xie, and Xuming He. 2018. Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8591–8600.
    Google ScholarLocate open access versionFindings
  • Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI, pages 3574–3580.
    Google ScholarLocate open access versionFindings
  • Pierre-Emmanuel Mazare, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2775–2779, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston. 2017. Parlai: A dialog research software platform. In Empirical Methods in Natural Language Processing (EMNLP), pages 79–84.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conversations: Multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 462–472, Taipei, Taiwan. Asian Federation of Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2016. Dual attention networks for multimodal reasoning and matching. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2156–2164.
    Google ScholarLocate open access versionFindings
  • Ramakanth Pasunuru and Mohit Bansal. 2018. Gamebased video-context dialogue. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 125–136, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
    Google ScholarLocate open access versionFindings
  • Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015.
    Google ScholarFindings
  • Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164–174, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
    Google ScholarLocate open access versionFindings
  • Zhou Yu, Leah Nicolich-Henkin, Alan W Black, and Alexander Rudnicky. 2016. A wizard-of-oz study on a non-task-oriented dialog systems that reacts to user engagement. In Proceedings of the 17th annual meeting of the Special Interest Group on Discourse and Dialogue, pages 55–63.
    Google ScholarLocate open access versionFindings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In Proceedings of the 31st International Conference on Machine Learning, Deep Learning Workshop, Lille, France.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition.
    Google ScholarLocate open access versionFindings
  • S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He. 2017. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995.
    Google ScholarLocate open access versionFindings
  • Multiple Traits In the IGC human evaluation setup from (Mostafazadeh et al., 2017), human annotators were shown eight choices when rating the quality of responses to questions: seven responses from various models, and one human response. To mirror this setup as closely as possible, we chose seven of our highest performing style traits to condition on to display in addition to the human response. We show the results of each trait in Table 4.
    Google ScholarFindings
  • Automatic Evaluation In (Mostafazadeh et al., 2017), the authors provide BLEU scores for their models in an attempt to evaluate their effectiveness via automated metrics. The authors note that the scores are very low, “as is characteristic for tasks with intrinsically diverse outputs.” Additionally, it has been shown in (Shuster et al., 2019) that BLEU scores for image captioning retrieval models are generally far lower than those of generative models (as retrieval models do not optimize for such a metric), and yet human evaluations can show the complete opposite results. In fact, in that work retrieval models were shown to be superior to generative models in human evaluations, which is why we adopted them here. For these reasons we omit BLEU scores of our retrieval models on the IGC test set as uninteresting. We do however compare BLEU scores with our generative model in the main paper.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments