A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

ACL, pp. 4871-4884, 2020.

Cited by: 0|Bibtex|Views103
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We release a large-scale dataset constructed using this algorithm consisting of joint alignments between multiple text and video recipes along with useful commonsense information such as textual and visual paraphrases; and single-step to multi-step breakdown

Abstract:

Many high-level procedural tasks can be decomposed into sequences of instructions that vary in their order and choice of tools. In the cooking domain, the web offers many partially-overlapping text and video recipes (i.e. procedures) that describe how to make the same dish (i.e. high-level task). Aligning instructions for the same dish ...More
0
Introduction
  • Machine learning has seen tremendous recent success in challenging game environments such as Go (Schrittwieser et al, 2019), DOTA (OpenAI, 2019), and StarCraft (DeepMind, 2019), the authors have not seen similar progress toward algorithms that might one day help humans perform everyday tasks like assembling furniture, applying makeup, 1.
  • In a pot add 1 cup of rice and 2 cups of water cook for 15 min.
  • 2. Heat cooking fat in a large skillet on medium heat.
  • 3. Add onion, garlic, peas and carrots.
  • Crack an egg and scramble it in the same pan and mix it throughout vegetables
Highlights
  • Machine learning has seen tremendous recent success in challenging game environments such as Go (Schrittwieser et al, 2019), DOTA (OpenAI, 2019), and StarCraft (DeepMind, 2019), we have not seen similar progress toward algorithms that might one day help humans perform everyday tasks like assembling furniture, applying makeup, 1
  • Our goal is to find joint alignments between multiple text recipes and multiple video recipes for the same dish
  • We describe our graph algorithm, which derives joint alignments between multiple text and video recipes given the pairwise alignments
  • On text-video alignments Table 2 shows results of our pairwise alignment algorithm compared with baselines on 1,625 human aligned text-video recipe pairs from YouCook2
  • We introduce a novel two-stage unsupervised algorithm for aligning multiple text and multiple video recipes
  • We release a large-scale dataset constructed using this algorithm consisting of joint alignments between multiple text and video recipes along with useful commonsense information such as textual and visual paraphrases; and single-step to multi-step breakdown
Methods
  • Random Uniform alignment BM25 retrieval Textual Similarity

    Exact word match TF-IDF GloVe BERT RoBERTa HMM+IBM1 Nouns Nouns+Verbs All words

    BM25 retrieval The authors use BM25 (Robertson et al, 2009) as the information retrieval baseline.
  • Given a source and a target recipe pair, the authors construct a corpus using all instructions in the target recipe.
  • The authors use each source instruction as a query to retrieve the top most instruction from the target instruction corpus and align the source instruction to the retrieved target instruction.
  • Textual similarity Given a source recipe instruction and a target recipe instruction, the authors define a measure of textual similarity between the two instructions using the following five methods.
  • The authors define their embedding similarity as the cosine similarity of their embedding vectors
Results
  • The authors describe how the authors evaluate the pairwise alignment algorithm. The authors answer the following research questions using the experimentation: 1.
  • 4.1 Human Aligned Evaluation Set. The authors evaluate the pairwise alignment algorithm on the following two human annotated datasets: YouCook2 text-video recipe pairs The YouCook2 dataset (Zhou et al, 2018a) consists of 1,625 cooking videos paired with human-written descriptions for each video segment.
  • The authors evaluate the pairwise alignment algorithm on the following two human annotated datasets: YouCook2 text-video recipe pairs The YouCook2 dataset (Zhou et al, 2018a) consists of 1,625 cooking videos paired with human-written descriptions for each video segment
  • These span 90 different dishes.
  • Under ablations of the HMM+IBM1 model, using all words to learn alignments works best
Conclusion
  • The authors introduce a novel two-stage unsupervised algorithm for aligning multiple text and multiple video recipes.
  • The authors release a large-scale dataset constructed using this algorithm consisting of joint alignments between multiple text and video recipes along with useful commonsense information such as textual and visual paraphrases; and single-step to multi-step breakdown.
  • The authors envision extending this work by including audio and video features to enhance the quality of the alignment algorithm.
  • The authors believe this work will further the goal of building agents that can work with human collaborators to carry out complex tasks in the real world
Summary
  • Introduction:

    Machine learning has seen tremendous recent success in challenging game environments such as Go (Schrittwieser et al, 2019), DOTA (OpenAI, 2019), and StarCraft (DeepMind, 2019), the authors have not seen similar progress toward algorithms that might one day help humans perform everyday tasks like assembling furniture, applying makeup, 1.
  • In a pot add 1 cup of rice and 2 cups of water cook for 15 min.
  • 2. Heat cooking fat in a large skillet on medium heat.
  • 3. Add onion, garlic, peas and carrots.
  • Crack an egg and scramble it in the same pan and mix it throughout vegetables
  • Methods:

    Random Uniform alignment BM25 retrieval Textual Similarity

    Exact word match TF-IDF GloVe BERT RoBERTa HMM+IBM1 Nouns Nouns+Verbs All words

    BM25 retrieval The authors use BM25 (Robertson et al, 2009) as the information retrieval baseline.
  • Given a source and a target recipe pair, the authors construct a corpus using all instructions in the target recipe.
  • The authors use each source instruction as a query to retrieve the top most instruction from the target instruction corpus and align the source instruction to the retrieved target instruction.
  • Textual similarity Given a source recipe instruction and a target recipe instruction, the authors define a measure of textual similarity between the two instructions using the following five methods.
  • The authors define their embedding similarity as the cosine similarity of their embedding vectors
  • Results:

    The authors describe how the authors evaluate the pairwise alignment algorithm. The authors answer the following research questions using the experimentation: 1.
  • 4.1 Human Aligned Evaluation Set. The authors evaluate the pairwise alignment algorithm on the following two human annotated datasets: YouCook2 text-video recipe pairs The YouCook2 dataset (Zhou et al, 2018a) consists of 1,625 cooking videos paired with human-written descriptions for each video segment.
  • The authors evaluate the pairwise alignment algorithm on the following two human annotated datasets: YouCook2 text-video recipe pairs The YouCook2 dataset (Zhou et al, 2018a) consists of 1,625 cooking videos paired with human-written descriptions for each video segment
  • These span 90 different dishes.
  • Under ablations of the HMM+IBM1 model, using all words to learn alignments works best
  • Conclusion:

    The authors introduce a novel two-stage unsupervised algorithm for aligning multiple text and multiple video recipes.
  • The authors release a large-scale dataset constructed using this algorithm consisting of joint alignments between multiple text and video recipes along with useful commonsense information such as textual and visual paraphrases; and single-step to multi-step breakdown.
  • The authors envision extending this work by including audio and video features to enhance the quality of the alignment algorithm.
  • The authors believe this work will further the goal of building agents that can work with human collaborators to carry out complex tasks in the real world
Tables
  • Table1: Statistics of our recipe pairs data (2.3)
  • Table2: Results for text-video recipe alignments on YouCook2 dataset
  • Table3: Results for text-text recipe alignment on Common Crawl dataset
  • Table4: Three examples of single-step to multi-step breakdown from the pairwise alignments
  • Table5: Alignment between two text recipes of chocolate chip cookie with their sentence level probabilities
Download tables as Excel
Related work
  • Alignment Algorithms Our unsupervised alignment algorithm is based on Naim et al (2014), who propose a hierarchical alignment model using nouns and objects to align text instructions to videos. Song et al (2016) further build on this work to make use of action codewords and verbs. Bojanowski et al (2015) view the alignment task as a temporal assignment problem and solve it using an efficient conditional gradient algorithm. Malmaud et al (2015) use an HMM-based method to align recipe instructions to cooking video transcriptions that follow the same order. Our work contrasts with these works in two ways: we learn alignments between instructions that do not necessarily follow the same order; and our algorithm is trained on a much larger scale dataset.

    Multi-modal Instructional Datasets Marin et al (2019) introduce a corpus of 1 million cooking recipes paired with 13 million food images for the task of retrieving a recipe given an image. YouCook2 dataset (Zhou et al, 2018a) consists of 2,000 recipe videos with human written descriptions for each video segment. The How2 dataset (Sanabria et al, 2018) consists of 79,114 instructional videos with English subtitles and crowdsourced Portuguese translations. The COIN dataset (Tang et al, 2019) consists of 11,827 videos of 180 tasks in 12 daily life domains. YouMakeup (Wang et al, 2019) consists of 2,800 YouTube videos, annotated with natural language descriptions for instructional steps, grounded in temporal video range and spatial facial areas.
Reference
  • Sadaf Abdul-Rauf and Holger Schwenk. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 16–23, Athens, Greece. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised Learning from Narrated Instruction Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4575–4583.
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Remi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-Supervised Alignment of Video with Text. In Proceedings of the IEEE international conference on computer vision, pages 4462–4470.
    Google ScholarLocate open access versionFindings
  • Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263– 311.
    Google ScholarLocate open access versionFindings
  • https://deepmind.com/blog/article/alphastarmastering-real-time-strategy-game-starcraft-ii.
    Findings
  • Francis Gregoire and Philippe Langlais. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1442–1453, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jack Hessel, Bo Pang, Zhenhai Zhu, and Radu Soricut. 2019. A case study on combining ASR and visual features for generating instructional video captions. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 419–429, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Guy Lev, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, and David Konopnicki. 201TalkSumm: A dataset and scalable annotation method for scientific paper summarization based on conference talks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2125–2131, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly Localizing and Describing Events for Dense Video Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7492–7500.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nicholas Johnston, Andrew Rabinovich, and Kevin Murphy. 2015. What’s cookin’? interpreting cooking videos using text, speech and vision. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 143–152, Denver, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE Transactions on Pattern Analysis and Machine Intelligence.
    Google ScholarLocate open access versionFindings
  • Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4):477–504.
    Google ScholarLocate open access versionFindings
  • Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry Kautz, Jiebo Luo, and Daniel Gildea. 2014. Unsupervised Alignment of Natural Language Instructions with Video Segments. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • OpenAI. 2019. Dota 2 with Large Scale Deep
    Google ScholarFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lawrence R Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257– 286.
    Google ScholarLocate open access versionFindings
  • Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using Siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Stephen Robertson, Hugo Zaragoza, et al. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends R in Information Retrieval, 3(4):333–389.
    Google ScholarLocate open access versionFindings
  • Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning Cross-modal Embeddings for Cooking Recipes and Food Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loıc Barrault, Lucia Specia, and Florian Metze. 2018. How2: A Large-scale Dataset for Multimodal Language Understanding. In NeurIPS.
    Google ScholarFindings
  • Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. 2019. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.
    Google ScholarFindings
  • Ozan Sener, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2015. Unsupervised Semantic Parsing of Video Collections. In Proceedings of the IEEE International Conference on Computer Vision, pages 4480–4488.
    Google ScholarLocate open access versionFindings
  • Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 403–411, Los Angeles, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry A Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions. In IJCAI, pages 2025–2031.
    Google ScholarLocate open access versionFindings
  • Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. 2019. COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1207–1216.
    Google ScholarLocate open access versionFindings
  • Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Weiying Wang, Yongcheng Wang, Shizhe Chen, and Qin Jin. 2019. YouMakeup: A large-scale domainspecific multimodal dataset for fine-grained semantic comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 5133–5143, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018a. Towards Automatic Learning of Procedures from Web Instructional Videos. In Thirty-Second AAAI Conference on Artificial Intelligence, pages 7590– 7598.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018b. End-to-End Dense Video Captioning with Masked Transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8739– 8748.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the IEEE International Conference on Computer Vision, pages 19–27.
    Google ScholarLocate open access versionFindings
  • Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. 2019. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3537–3545.
    Google ScholarLocate open access versionFindings
  • We train our classifier using the YouCook2 dataset (Zhou et al., 2018a) of 1,500 videos across 90 dishes. This dataset was created by asking humans to identify segments of a video that correspond to an instruction and annotate each segment with an imperative statement describing the action being executed in the video segment. We make the assumption that the transcript sentences that are included within an annotated video segment are instructional whereas those that are not included within an annotated video segment are noninstructional. We first transcribe all 1,500 videos in the dataset using a commercial transcription web service. We split the transcription into sentences using a sentence tokenizer. We label a transcript sentence with the label 1 if the corresponding video segment was annotated and with the label 0 if it was not. We get a total of 90,927 labelled transcript sentences which we split by dishes into the training (73,728 examples), validation (7,767 examples) and test (9,432 examples) sets.
    Google ScholarFindings
  • We use an LSTM (long-short term memory) model (Hochreiter and Schmidhuber, 1997) with attention (Luong et al., 2015) to train a binary classifier on this data. We initialize (and freeze) our 300-dimensional word embeddings using GloVe (Pennington et al., 2014) vectors trained on 330 million tokens that we obtain by combining all text recipes and transcript sentences. We use the validation set to tune hyperparametrs of our LSTM classifier (hidden size: 64, learning rate: 0.00001, batch size: 64, number of layers: 1). Our chat/content classifier achieves 86.76 precision, 84.26 recall and 85.01 F1 score on the held out test set.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments