AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduce a new large scale Visual Procedure Telling dataset of 46k procedures and 340k image-text pairs comprising 10 categories

Reading Between the Lines: Exploring Infilling in Visual Narratives

EMNLP 2020, pp.1220-1229, (2020)

Cited by: 0|Views212
Full Text
Bibtex
Weibo

Abstract

Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding contexts. The general seq2seq training methods render the models shorthanded while attempting to bridge the ...More

Code:

Data:

0
Introduction
  • Humans process information from their surrounding contexts from multiple modalities.
  • Recent advances have seen a surge of interest in vision and language as source and target modalities respectively.
  • First cream the butter and the vanilla extract together with a hand mixer.
  • It only takes a few minutes
Highlights
  • Humans process information from their surrounding contexts from multiple modalities
  • We present a Visual Procedure Telling (ViPT) dataset similar to the Visual Storytelling (ViST) dataset with 46k procedures on various domains
  • Data Collection Process: We manually examined around 10 blogging websites with various user written text on several how-to activities. Among these we found that snapguide and instructables are consistent in the form of pairs of textual descriptions along with their images
  • INet: We re-implemented the model achieving the state of the art results (Hu et al, 2020) on the visual storytelling dataset
  • We introduce a new large scale ViPT dataset of 46k procedures and 340k image-text pairs comprising 10 categories
  • We conclusively show the higher significance of infilling based techniques in visual procedures compared to visual stories
Methods
  • Telling (ViPT)

    lifestyle technology styling fitness

    While there are several types of narratives such as literary, factual and persuasive, this paper looks into stories and procedures.
  • This section describes the new ViPT dataset and highlights the differences with ViST
Results
  • Results and Discussion

    the authors present the effects of infilling both during both training and inference on ViST and ViPT datasets.
  • Infilling during training: The overall performance of the models is presented in Table 3.
  • Both the infilling model variants achieve higher scores on the recipes while not decreasing their performances on stories.
  • The authors perform infilling at train time and at inference time to evaluate the ability of the model to bridge contexts when the corresponding image is absent and deal with real world data imputation scenarios.
Conclusion
  • The input to the model is provided with masked contexts and the model is optimized with the objective of masked span prediction
  • The authors hypothesize that this technique provides gains in narratives with higher extent of overlapping contexts, since this provides an opportunity to reconstruct the missing local context from the overall global context.
  • (2) addressing the underspecification problem by controlling the content in infilled image with explicit guidance; this is as opposed to the implicit content filling that the authors perform throough interpolation
  • These infilling techniques are immensely useful when dealing with data imputation with missing contexts and collaborative authoring in real world scenarios
Tables
  • Table1: Details of the ViST and Visual Procedure Telling Dataset broken down into 10 categories paper studies the effects of infilling techniques for visual narrative generation. An alternate stream of work to improve the context in stories include providing supporting information such as entities (<a class="ref-link" id="cClark_et+al_2018_a" href="#rClark_et+al_2018_a">Clark et al, 2018</a>; <a class="ref-link" id="cXu_et+al_2018_a" href="#rXu_et+al_2018_a">Xu et al, 2018</a>), latent templates (<a class="ref-link" id="cWiseman_et+al_2018_a" href="#rWiseman_et+al_2018_a">Wiseman et al, 2018</a>), knowledge graphs (<a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al, 2019</a>), etc., explicitly. In contrast to this, infilling provides an opportune platform to implicitly learn the contextual information. Our work is positioned in the intersection of infilling and multimodal language generation
  • Table2: Regrouping the categories in ViPT dataset form narratives <a class="ref-link" id="cChandu_et+al_2019_a" href="#rChandu_et+al_2019_a">Chandu et al (2019</a>). We extend this work to gather procedures or ‘how-to’ articles that have step by step instructions along with an associated pairwise image to each step in several domains. To facilitate multi-domain research with stronger interleaved contexts between surrounding steps, we present a large scale visual procedure telling dataset with 46k procedures comprising of 340k pairwise images and textual descriptions. It is carefully curated from a number of how-to blogging websites. Our dataset comprises of pairwise images and textual descriptions of the corresponding images, typically describing a step in a procedure. This means that each description of the step is tethered to an image. This makes it a visual narrative telling task. We categorized the dataset into 10 distinct domains including recipes, crafts, outdoors, lifestyle, technology, styling, fitness, hobbies, pets and miscellaneous. The category wise details of the dataset are presented in Table 1. As we can observe, the dataset is domainated by cooking recipes which are relatively of similar sizes with ViST compared to the rest of the domains
  • Table3: Performance of different models on stories (from ViST) and recipes (from ViPT) datasets
  • Table4: Performance of infilling during inference for recipes in Visual Procedure Telling of 5 steps and the cooking recipes are trucated to 5 steps to perform a fair comparison of the effect of the index being infilled. An overview of infilling based training is depicted in Figure 1. The underlying encoding and decoding stages are described here
  • Table5: Performance of infilling during inference for Visual Story Telling being used in each of the above strategy. As we can see, the contribution of the global features to reconstruct the local missing context is intuitively expected to perform well in the case of narratives with overlapping contexts. Hence, we hypothesize that the infilling technique that interpolates between steps that constitute words or phrases that are similar to those of the surrounding steps benefit from this technique. A ‘how-to’ style of narrative explaining a procedure is more in-domain as compared to the stories and hence hypothesize that our infilling based encoding approaches perform relatively better on procedures. We then use the encoded representation to decode each step of the procedure or story. The decoding strategy is explained next which is the same in all the three of the aforementioned models
Download tables as Excel
Related work
  • Multimodal Language: Language generation from visual modality has seen a steep rise in interest with the introduction of several large scale tasks such as image captioning (Hossain et al, 2019), visual question answering (Antol et al, 2015) and visual dialog (Das et al, 2017; Mostafazadeh et al, 2017; De Vries et al, 2017). While the task of generating a sentence from a single image i.e., image captioning has been well studied in the literature, generating a long form sequence of sentences from a sequence of images has been catching attention only in the recent past. Hence, the natural next step here is towards long form sequential generation in the form of stories, procedures etc., visual narrative telling.

    Visual Storytelling: Huang et al (2016) ventured into sequential step wise generation of stories by introducing visual storytelling (ViST). Recent methods have tackled ViST using adversarial learning, reinforcement learning (Wang et al, 2018; Huang et al, 2019; Hu et al, 2020), modalityfusion (Smilevski et al, 2018), traditional seq2seq models (Kim et al, 2018; Jung et al, 2020; Hsu et al, 2018) and explicit structures (Bosselut et al, 2016; Bisk et al, 2019). Chandu et al (2019) also proposed a dataset of 16k recipes in a similar form. While these are all cooking recipes, the ViPT dataset comprises a mixture of ten different domains. Also, our dataset is aboout 2.8 times larger than the storyboarding dataset with almost double the number of procedures in the domain of cooking recipes itself. Though the stories in ViST demonstrate a sense of continuity, the overarching sequential context is feeble. Procedures such as cooking recipes (Salvador et al, 2019; Wang et al, 2019) on the other hand, demonstrate this characteristic inviolably. This ensures a coherent underlying context and structure in the narrative. Hence, we present a large scale ViPT dataset to encourage research in this direction.
Funding
  • INet: We re-implemented the model achieving the state of the art results (Hu et al, 2020) on the visual storytelling dataset
Study subjects and analysis
categories paper studies: 10
. Details of the ViST and Visual Procedure Telling Dataset broken down into 10 categories paper studies the effects of infilling techniques for visual narrative generation. An alternate stream of work to improve the context in stories include providing supporting information such as entities (Clark et al, 2018; Xu et al, 2018), latent templates (Wiseman et al, 2018), knowledge graphs (Yang et al, 2019), etc., explicitly. In contrast to this, infilling provides an opportune platform to implicitly learn the contextual information. Our work is positioned in the intersection of infilling and multimodal language generation. Regrouping the categories in ViPT dataset form narratives Chandu et al (2019). We extend this work to gather procedures or ‘how-to’ articles that have step by step instructions along with an associated pairwise image to each step in several domains. To facilitate multi-domain research with stronger interleaved contexts between surrounding steps, we present a large scale visual procedure telling dataset with 46k procedures comprising of 340k pairwise images and textual descriptions. It is carefully curated from a number of how-to blogging websites. Our dataset comprises of pairwise images and textual descriptions of the corresponding images, typically describing a step in a procedure. This means that each description of the step is tethered to an image. This makes it a visual narrative telling task. We categorized the dataset into 10 distinct domains including recipes, crafts, outdoors, lifestyle, technology, styling, fitness, hobbies, pets and miscellaneous. The category wise details of the dataset are presented in Table 1. As we can observe, the dataset is domainated by cooking recipes which are relatively of similar sizes with ViST compared to the rest of the domains

Reference
  • Prithviraj Ammanabrolu, William Broniec, Alex Mueller, Jeremy Paul, and Mark O Riedl. 2019. Toward automated quest generation in text-adventure games. arXiv preprint arXiv:1909.06283.
    Findings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Yonatan Bisk, Jan Buys, Karl Pichotta, and Yejin Choi. 2019. Benchmarking hierarchical script knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 27, 2019, Volume 1 (Long and Short Papers), pages 4077–4085. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Antoine Bosselut, Jianfu Chen, David Warren, Hannaneh Hajishirzi, and Yejin Choi. 2016. Learning prototypical event structure from photo albums. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Khyathi Chandu, Eric Nyberg, and Alan W Black. 2019. Storyboarding of recipes: Grounded contextual generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6040–6046.
    Google ScholarLocate open access versionFindings
  • Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 2250–2260.
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose MF Moura, Devi Parikh, and Dhruv Batra. 201Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2.
    Google ScholarLocate open access versionFindings
  • Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR, volume 1, page 3.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 201Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Chris Donahue, Mina Lee, and Percy Liang. 2020. Enabling language models to fill in the blanks. arXiv preprint arXiv:2005.05339.
    Findings
  • Ruo-Ping Dong, Khyathi Raghavi Chandu, and Alan W Black. 2019. Induction and reference of entities in a visual story. arXiv preprint arXiv:1909.09699.
    Findings
  • John J Dudley, Keith Vertanen, and Per Ola Kristensson. 2018. Fast and precise touch-based text entry for head-mounted augmented reality with variable occlusion. ACM Transactions on Computer-Human Interaction (TOCHI), 25(6):30.
    Google ScholarLocate open access versionFindings
  • Angela Fan, Mike Lewis, and Yann N. Dauphin. 2019. Strategies for structuring story generation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2650–2660. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
    Google ScholarLocate open access versionFindings
  • Spandana Gella, Mike Lewis, and Marcus Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 968–974.
    Google ScholarLocate open access versionFindings
  • Aleksandra Hollingshead. 2018. Designing engaging online environments: Universal design for learning principles. In Cultivating diverse online classrooms through effective instructional design, pages 280– 298. IGI Global.
    Google ScholarFindings
  • MD Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6):118.
    Google ScholarLocate open access versionFindings
  • Chao-Chun Hsu, Szu-Min Chen, Ming-Hsun Hsieh, and Lun-Wei Ku. 2018. Using inter-sentence diverse beam search to reduce redundancy in visual storytelling. arXiv preprint arXiv:1805.11867.
    Findings
  • Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, and Graham Neubig. 20What makes A good story? designing composite rewards for visual storytelling. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7969–7976. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. 2019. Hierarchically structured reinforcement learning for topically coherent visual story generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8465–8472.
    Google ScholarLocate open access versionFindings
  • Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239.
    Google ScholarLocate open access versionFindings
  • Daphne Ippolito, David Grangier, Chris CallisonBurch, and Douglas Eck. 2019. Unsupervised hierarchical story infilling. In Proceedings of the First Workshop on Narrative Understanding, pages 37– 43.
    Google ScholarLocate open access versionFindings
  • Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, and In So Kweon. 2020. Hideand-tell: Learning to bridge photo streams for visual storytelling. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 11213–11220. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Taehyeong Kim, Min-Oh Heo, Seonil Son, KyoungWha Park, and Byoung-Tak Zhang. 2018. Glac net: Glocal attention cascading networks for multiimage cued story generation. arXiv preprint arXiv:1805.10973.
    Findings
  • Kiyoshi Kurihara, Atsushi Imai, Nobumasa Seiyama, Toshihiro Shimizu, Shoei Sato, Ichiro Yamada, Tadashi Kumano, Reiko Tako, Taro Miyazaki, Manon Ichiki, et al. 2019. Automatic generation of audio descriptions for sports programs. SMPTE Motion Imaging Journal, 128(1):41–47.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Xiaoxiao Liu, Qingyang Xu, and Ning Wang. 2019. A survey on deep neural network-based image captioning. The Visual Computer, 35(3):445–470.
    Google ScholarLocate open access versionFindings
  • Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P Spithourakis, and Lucy Vanderwende. 2017. Imagegrounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
    Findings
  • Marko Smilevski, Ilija Lalkovski, and Gjorgi Madzarov. 2018. Stories for images-in-sequence by using visual and narrative components. arXiv preprint arXiv:1805.05622.
    Findings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
    Findings
  • Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4):415–433.
    Google ScholarLocate open access versionFindings
  • Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-Peng Lim, and Steven C. H. Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 11572–11581. Computer Vision Foundation / IEEE.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018. No metrics are perfect: Adversarial reward learning for visual storytelling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 899–909.
    Google ScholarLocate open access versionFindings
  • Sam Wiseman, Stuart Shieber, and Alexander Rush. 2018. Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3174–3187.
    Google ScholarLocate open access versionFindings
  • Jingjing Xu, Xuancheng Ren, Yi Zhang, Qi Zeng, Xiaoyan Cai, and Xu Sun. 2018. A skeleton-based model for promoting coherence among sentences in narrative story generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4306–4315.
    Google ScholarLocate open access versionFindings
  • Pengcheng Yang, Fuli Luo, Peng Chen, Lei Li, Zhiyi Yin, Xiaodong He, and Xu Sun. 2019. Knowledgeable storyteller: a commonsense-driven generative model for visual storytelling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 5356–5362. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text infilling. arXiv preprint arXiv:1901.00158.
    Findings
  • Amaia Salvador, Michal Drozdzal, Xavier Giro-iNieto, and Adriana Romero. 2019. Inverse cooking: Recipe generation from food images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 10453–10462. Computer Vision Foundation / IEEE.
    Google ScholarLocate open access versionFindings
Author
Khyathi Raghavi Chandu
Khyathi Raghavi Chandu
Ruo-Ping Dong
Ruo-Ping Dong
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科