Learning to Segment Actions from Observation and Narration

ACL, pp. 2569-2588, 2020.

Cited by: 2|Views134
EI
Weibo:
We find that unsupervised action segmentation in naturalistic instructional videos is greatly aided by the inductive bias given by typical step orderings within a task, and narrative language describing the actions being done

Abstract:

We apply a generative segmental model of task structure, guided by narration, to action segmentation in video. We focus on unsupervised and weakly-supervised settings where no action labels are known during training. Despite its simplicity, our model performs competitively with previous work on a dataset of naturalistic instructional vi...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Finding boundaries in a continuous stream is a crucial process for human cognition (Martin and Tversky, 2003; Zacks and Swallow, 2007; Levine et al, 2019; Unal et al, 2019).
  • More than 70% of the frames in one of the YouTube instructional video datasets, CrossTask (Zhukov et al, 2019), consist of background regions, which do not correspond to any of the steps for the video’s task
  • These datasets are interesting because they provide (1) narrative language that roughly corresponds to the activities demonstrated in the videos and (2) structured task scripts that define a strong signal of the order in which steps in a task are typically performed.
  • How much do unsupervised models improve when given implicit supervision from task structure and language, and which types of supervision help most? Are discriminative or generative models better suited for the task? Does explicit structure modeling improve the quality of segmentation? To answer these questions, the authors compare two existing models with a generative hidden semi-Markov model, varying the degree of supervision
Highlights
  • Finding boundaries in a continuous stream is a crucial process for human cognition (Martin and Tversky, 2003; Zacks and Swallow, 2007; Levine et al, 2019; Unal et al, 2019)
  • On a challenging and naturalistic dataset of instructional videos (Zhukov et al, 2019), we find that our model and models from past work both benefit substantially from the weak supervision provided by task structure and narrative language, even on top of rich features from state-of-theart pretrained action and object classifiers
  • Our analysis shows that: (1) Generative models tend to do better than discriminative models of the same or similar model class at learning the full range of step types, which benefits action segmentation; (2) Task structure affords strong, featureagnostic baselines that are difficult for existing systems to surpass; (3) Reporting multiple metrics is necessary to understand each model’s effectiveness for action segmentation; we can devise feature-agnostic baselines that perform well on single metrics despite producing low-quality action segments
  • We find that unsupervised action segmentation in naturalistic instructional videos is greatly aided by the inductive bias given by typical step orderings within a task, and narrative language describing the actions being done
  • Our results illustrate the importance of strong baselines: without weak supervision from step orderings and narrative language, even state-of-the-art unsupervised action segmentation models operating on rich video features underperform featureagnostic baselines
  • While action segmentation in videos from diverse domains remains challenging – videos contain both a large variety of types of depicted actions, and high visual variety in how the actions are portrayed – we find that structured generative models provide a strong benchmark for the task due to their abilities to capture the full diversity of action types, and to benefit from weak supervision
Results
  • The authors first define several baselines based on dataset statistics (Sec. 7.1), which the authors will find to be strong in comparison to past work.
  • Ordering + Narration Supervision U6 ORDEREDDISCRIM U7 HSMM + Narr + Ord.
  • Table 2 shows baselines that do not use video features, but predict steps according to overall statistics of the training data.
  • These demonstrate characteristics of the data, and the importance of using multiple metrics.
  • Predict background (B1) Since most timesteps are background, a model that predicts background everywhere can obtain high overall label accuracy, showing the importance of using step label accuracy as a metric for action segmentation.
Conclusion
  • The authors find that unsupervised action segmentation in naturalistic instructional videos is greatly aided by the inductive bias given by typical step orderings within a task, and narrative language describing the actions being done.
  • The authors' results illustrate the importance of strong baselines: without weak supervision from step orderings and narrative language, even state-of-the-art unsupervised action segmentation models operating on rich video features underperform featureagnostic baselines.
  • Future work might explore methods for incorporating richer learned representations both of the diverse visual observations in videos, and their narrative descriptions, into such models
Summary
  • Introduction:

    Finding boundaries in a continuous stream is a crucial process for human cognition (Martin and Tversky, 2003; Zacks and Swallow, 2007; Levine et al, 2019; Unal et al, 2019).
  • More than 70% of the frames in one of the YouTube instructional video datasets, CrossTask (Zhukov et al, 2019), consist of background regions, which do not correspond to any of the steps for the video’s task
  • These datasets are interesting because they provide (1) narrative language that roughly corresponds to the activities demonstrated in the videos and (2) structured task scripts that define a strong signal of the order in which steps in a task are typically performed.
  • How much do unsupervised models improve when given implicit supervision from task structure and language, and which types of supervision help most? Are discriminative or generative models better suited for the task? Does explicit structure modeling improve the quality of segmentation? To answer these questions, the authors compare two existing models with a generative hidden semi-Markov model, varying the degree of supervision
  • Objectives:

    While previous work mostly focuses on building action segmentation models that perform well on a few metrics (Richard et al, 2018; Zhukov et al, 2019), the authors aim to provide insight into how various modeling choices impact action segmentation.
  • Results:

    The authors first define several baselines based on dataset statistics (Sec. 7.1), which the authors will find to be strong in comparison to past work.
  • Ordering + Narration Supervision U6 ORDEREDDISCRIM U7 HSMM + Narr + Ord.
  • Table 2 shows baselines that do not use video features, but predict steps according to overall statistics of the training data.
  • These demonstrate characteristics of the data, and the importance of using multiple metrics.
  • Predict background (B1) Since most timesteps are background, a model that predicts background everywhere can obtain high overall label accuracy, showing the importance of using step label accuracy as a metric for action segmentation.
  • Conclusion:

    The authors find that unsupervised action segmentation in naturalistic instructional videos is greatly aided by the inductive bias given by typical step orderings within a task, and narrative language describing the actions being done.
  • The authors' results illustrate the importance of strong baselines: without weak supervision from step orderings and narrative language, even state-of-the-art unsupervised action segmentation models operating on rich video features underperform featureagnostic baselines.
  • Future work might explore methods for incorporating richer learned representations both of the diverse visual observations in videos, and their narrative descriptions, into such models
Tables
  • Table1: Characteristics of each model we compare
  • Table2: Model comparison on the CrossTask validation data. We evaluate primarily using all label accuracy and step label accuracy to evaluate action segmentation, and step recall to evaluate step recognition
  • Table3: Unsupervised and weakly supervised results in the cross-validation setting
  • Table4: Comparison between the semi-Markov and hidden semi-Markov models (SMM and HSMM) with the Markov and hidden Markov (MM and HMM) models, which ablate the semi-Markov’s duration model
  • Table5: Performance of key supervised and weaklysupervised models on the validation data when adding narration vectors as features. Numbers in parentheses give the change from adding narration vectors to the systems from Table 2
Download tables as Excel
Related work
Funding
  • DF is supported by a Google PhD Fellowship
Reference
  • Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
    Findings
  • Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon LacosteJulien. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Dare A Baldwin, Jodie A Baird, Megan M Saylor, and M Angela Clark. 2001. Infants parse dynamic action. Child Development, 72(3):708–717.
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Remi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 201Weakly supervised action labeling in videos under ordering constraints. In Proceedings of the European Conference on Computer Vision (ECCV).
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. 2019. D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Li Ding and Chenliang Xu. 201Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Jason Eisner. 2016. Inside-outside and forwardbackward algorithms are just backprop (tutorial paper). In Proceedings of the Workshop on Structured Prediction for NLP.
    Google ScholarLocate open access versionFindings
  • Ehsan Elhamifar and Zwe Naing. 2019. Unsupervised procedure learning via joint dynamic summarization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    Google ScholarLocate open access versionFindings
  • De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. 2016. Connectionist temporal modeling for weakly supervised action labeling. In Proceedings of the European Conference on Computer Vision (ECCV).
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
    Findings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Dan Klein and Christopher D. Manning. 2002. Conditional structure versus conditional estimation in NLP models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Hilde Kuehne, Alexander Richard, and Juergen Gall. 2017. Weakly supervised learning of actions from transcripts. In CVIU.
    Google ScholarLocate open access versionFindings
  • Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97.
    Google ScholarLocate open access versionFindings
  • Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised learning of action classes with continuous temporal embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Dani Levine, Daphna Buchsbaum, Kathy Hirsh-Pasek, and Roberta M Golinkoff. 2019. Finding events in a continuous world: A developmental account. Developmental Psychobiology, 61(3):376–389.
    Google ScholarLocate open access versionFindings
  • Percy Liang and Dan Klein. 2008. Analyzing the errors of unsupervised learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(Nov):2579–2605.
    Google ScholarLocate open access versionFindings
  • Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nicholas Johnston, Andrew Rabinovich, and Kevin Murphy. 2015. What’s cookin’? interpreting cooking videos using text, speech and vision. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Bridgette A Martin and Barbara Tversky. 2003. Segmenting ambiguous events. In Proceedings of the Annual Meeting of the Cognitive Science Society.
    Google ScholarLocate open access versionFindings
  • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC).
    Google ScholarLocate open access versionFindings
  • Kevin Murphy. 2002. Hidden semi-markov models. Unpublished tutorial.
    Google ScholarFindings
  • Janne Pylkkonen and Mikko Kurimo. 2004. Duration modeling techniques for continuous speech recognition. In Eighth International Conference on Spoken Language Processing.
    Google ScholarLocate open access versionFindings
  • Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly supervised action learning with RNN based fine-to-coarse modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Alexander Richard, Hilde Kuehne, and Juergen Gall. 2018. Action sets: Weakly supervised action segmentation without ordering constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252.
    Google ScholarLocate open access versionFindings
  • Roger C Schank and Robert P Abelson. 1977.
    Google ScholarFindings
  • Fadime Sener and Angela Yao. 2018. Unsupervised learning and segmentation of complex activities from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • O. Sener, A. Zamir, S. Savarese, and A. Saxena. 2015. Unsupervised semantic parsing of video collections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Tanya Sharon and Karen Wynn. 1998. Individuation of actions from continuous motion. Psychological Science, 9(5):357–362.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A multi-stream bidirectional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743.
    Findings
  • Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. 2019. COIN: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Ercenur Unal, Yue Ji, and Anna Papafragou. 2019. From event representation to linguistic meaning. Topics in Cognitive Science.
    Google ScholarFindings
  • Huijuan Xu, Abir Das, and Kate Saenko. 2017. RC3D: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision (IJCV), 126(2-4):375–389.
    Google ScholarLocate open access versionFindings
  • Shoou-I Yu, Lu Jiang, and Alexander Hauptmann. 2014. Instructional videos for unsupervised harvesting and learning of action examples. In Proceedings of the ACM International Conference on Multimedia (MM).
    Google ScholarLocate open access versionFindings
  • Shun-Zheng Yu. 2010. Hidden semi-markov models. Artificial intelligence, 174(2):215–243.
    Google ScholarLocate open access versionFindings
  • Jeffrey M Zacks and Khena M Swallow. 2007. Event segmentation. Current Directions in Psychological Science, 16(2):80–84.
    Google ScholarLocate open access versionFindings
  • Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Xu Chenliang, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. 2019. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • For both training conditions for our semi-Markov models that require gradient descent (generative unsupervised and discriminative supervised), we initialize parameters randomly and use Adam (Kingma and Ba, 2015) with an initial learning rate of 5e-3, a batch size of 5 videos, and decay the learning rate when training log likelihood does not decrease for more than one epoch.
    Google ScholarFindings
  • For our features x1:T, we use the same base features as Zhukov et al. (2019). There are three feature types: activity recognition features, produced by an I3D model (Carreira and Zisserman, 2017) trained on the Kinetics-400 dataset (Kay et al., 2017); object classification features, from a ResNet-152 (He et al., 2016) trained on ImageNet (Russakovsky et al., 2015), and audio classification features10 from the VGG model (Simonyan and Zisserman, 2015) trained by Hershey et al. (2017) on a preliminary version of the YouTube8M dataset (Abu-El-Haija et al., 2016).11
    Google ScholarLocate open access versionFindings
  • the actual region labels (which can be step types, or background) for our task. Just as with other unsupervised tasks and models (e.g., part-of-speech induction), we need a mapping from these classes to step types (and background) in order to evaluate the model’s predictions. We follow the evaluation procedure of past work (Sener and Yao, 2018; Sener et al., 2015) by finding the mapping from model states to region labels that maximizes label accuracy, averaged across all videos in the task, using the Hungarian method (Kuhn, 1955). This evaluation condition is only used in the “Unsupervised” section of Table 2 (in the rows marked with optimal accuracy assignment).
    Google ScholarLocate open access versionFindings
  • Label accuracy The standard metric for action segmentation (Sener and Yao, 2018; Richard et al., 2018) is timestep label accuracy, in datasets with a large amount of background, label accuracy on non-background timesteps. The CrossTask dataset has multiple reference step labels in the groundtruth for around 1% of timesteps, due to noisy region annotations that overlap slightly. We obtain a single reference label for these timesteps by taking the step that appears first in the canonical step ordering for the task. We then compute accuracy of the model predictions against these reference labels across all timesteps and all videos for a task (in the all label accuracy condition), or by filtering to those timesteps which have a step label (non-background) in the reference (to focus on the model’s ability to accurately predict step labels), in the step label accuracy condition.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments