Procedure Planning in Instructional Videos

european conference on computer vision, pp. 334-350, 2019.

Cited by: 3|Bibtex|Views67
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space

Abstract:

We propose a new challenging task: procedure planning in instructional videos. Unlike existing planning problems, where both the state and the action spaces are well-defined, the key challenge of planning in instructional videos is that both the state and the action spaces are open-vocabulary. We address this challenge with latent space...More

Code:

Data:

0
Introduction
  • Humans possess a natural ability of planning and reasoning to assist everyday tasks. The authors are able to picture what effects the actions would have and plan multiple steps ahead to achieve the intended goal.
  • One can imagine an indefinitely growing semantic state space, which prevents the application of classical symbolic planning approaches [6] that require a given set of predicates for a well-defined state space
  • This challenge is amplified by the fact that the authors do not assume to know the effects of all the possible actions.
  • How does the agent know that pouring the eggs to the pan will make it cooked? Without this knowledge, it is impossible to perform planning in the space
Highlights
  • Humans possess a natural ability of planning and reasoning to assist everyday tasks
  • As the action space in instructional videos is not continuous, the gradient-based planner is not able to work well. This makes the Universal Planning Networks (UPN) performs similar to Ours w/o T, which is like an recurrent neural network (RNN) goal-conditional policy trained with imitation objectives
  • Our full model combines the strengths of planning and action imitation objective as conjugate constraints, which enables us to learn plannable representations from real-world videos to outperform all the baseline approaches on all metrics
  • While UPN aims to directly leverage goal in the algorithm, the non-differentiable action space prevents the success of gradient-based planning
  • We presented a framework for procedure planning in real-world instructional videos
  • We address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space
Methods
  • The key challenge is that both state and action spaces are open-vocabulary.
  • The authors will first define the procedure planning problem setup and how to address it using a latent space planning approach.
  • The authors will discuss how the authors learn the latent space and leverage the conjugate relationships between states and actions to avoid trivial solutions to the optimization.
  • The authors will present the algorithms for procedure planning and walkthrough planning [14] in the learned plannable space
Results
  • As the action space in instructional videos is not continuous, the gradient-based planner is not able to work well
  • This makes the UPN performs similar to Ours w/o T , which is like an RNN goal-conditional policy trained with imitation objectives.
  • The authors' full model combines the strengths of planning and action imitation objective as conjugate constraints, which enables them to learn plannable representations from real-world videos to outperform all the baseline approaches on all metrics.
  • Neither of the baseline models is able to understand that to perform the rest of the steps, the person needs to get the tools first
Conclusion
  • The authors address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space.
  • The authors' experimental results show the framework is able to learn high-level semantic representations that are plannable and significantly outperforms the baselines across different metrics on two challenging tasks: procedure planning and walkthrough planning.
  • The authors intend to incorporate object-oriented models to further explore the objects and predicates relations from complex visual dynamics data
Summary
  • Introduction:

    Humans possess a natural ability of planning and reasoning to assist everyday tasks. The authors are able to picture what effects the actions would have and plan multiple steps ahead to achieve the intended goal.
  • One can imagine an indefinitely growing semantic state space, which prevents the application of classical symbolic planning approaches [6] that require a given set of predicates for a well-defined state space
  • This challenge is amplified by the fact that the authors do not assume to know the effects of all the possible actions.
  • How does the agent know that pouring the eggs to the pan will make it cooked? Without this knowledge, it is impossible to perform planning in the space
  • Objectives:

    The authors aim to answer the following questions in the experiments: (i) Can the authors learn plannable representations from real-world instructional videos? How does it compare to existing latent space planning methods? (ii) How important are the conjugate constraints? (iii) Can the authors further retrieve intermediate visual subgoals using the same model? The authors answer the first two questions with ablation studies on challenging real-world instructional videos.
  • Methods:

    The key challenge is that both state and action spaces are open-vocabulary.
  • The authors will first define the procedure planning problem setup and how to address it using a latent space planning approach.
  • The authors will discuss how the authors learn the latent space and leverage the conjugate relationships between states and actions to avoid trivial solutions to the optimization.
  • The authors will present the algorithms for procedure planning and walkthrough planning [14] in the learned plannable space
  • Results:

    As the action space in instructional videos is not continuous, the gradient-based planner is not able to work well
  • This makes the UPN performs similar to Ours w/o T , which is like an RNN goal-conditional policy trained with imitation objectives.
  • The authors' full model combines the strengths of planning and action imitation objective as conjugate constraints, which enables them to learn plannable representations from real-world videos to outperform all the baseline approaches on all metrics.
  • Neither of the baseline models is able to understand that to perform the rest of the steps, the person needs to get the tools first
  • Conclusion:

    The authors address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space.
  • The authors' experimental results show the framework is able to learn high-level semantic representations that are plannable and significantly outperforms the baselines across different metrics on two challenging tasks: procedure planning and walkthrough planning.
  • The authors intend to incorporate object-oriented models to further explore the objects and predicates relations from complex visual dynamics data
Tables
  • Table1: Results for procedure planning. Our full model significantly outperforms baselines. With around 10% improvement of accuracy, our full model is able to improve success rate by 8 times compared to Ours w/o T . This shows the importance of reasoning over the full sequence
  • Table2: Results for walkthrough planning. Our model trained for procedure planning can also address walkthrough planning. It significantly outperforms the baseline by explicit sequential reasoning of what actions need to be performed first, and is less distracted by the visual appearances
Download tables as Excel
Related work
  • Planning in Videos. In the planning literature, most studies rely on a prescribed set of state and action representations and symbols for the task [6, 13]. We are interested in works that perform planning in the visual space without well-defined state and action spaces. In addition, the complexity of instructional videos prevents the direct application of approaches that aim to plan directly in the visual space [2, 4]. It is thus necessary to learn plannable representations. In addition to the unsupervised and semi-supervised approaches [18, 23], Universal Planning Networks [19] use a gradient descent planner to learn representations for planning. However, it assumes the actions space to be differentiable. Alternatively, one can also learn the latent dynamics by optimizing the data log-likelihood from the actions [8]. We use a similar formulation and further propose the conjugate constraints to expedite the latent space learning. Without using explicit action supervision, causal InfoGAN [14] extracts state representations by learning salient features that describe the causal structure of the data in simple domains. In contrast to [14], our model operates directly on high dimensional video demonstrations and handle the semantics of actions with sequential learning.
Funding
  • This work was partially funded by Toyota Research Institute (TRI)
Study subjects and analysis
snapshot observations: 9
Figure 3 shows the qualitative results of our procedure planning. The first row shows the 9 snapshot observations in making French toast. In the second row, we pick o3 and o7 as the start and goal observations and ask the model to plan for 4 actions in between

Reference
  • Yazan Abu Farha, Alexander Richard, and Juergen Gall. When will you do what?-anticipating temporal occurrences of activities. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Joint discovery of object states and manipulation actions. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: theory and practice. Elsevier, 2004.
    Google ScholarFindings
  • Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. Open-vocabulary object retrieval. In Robotics: science and systems, volume 2, page 6, 2014.
    Google ScholarLocate open access versionFindings
  • Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Bradley Hayes and Brian Scassellati. Autonomously constructing hierarchical task networks for planning and human-robot collaboration. In ICRA, 2016.
    Google ScholarLocate open access versionFindings
  • De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. Finding “it”: Weakly supervised reference-aware visual grounding in instructional videos. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. In CVPR, 2019.
    Google ScholarFindings
  • Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and Sergey Levine. Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018.
    Findings
  • George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. From skills to symbols: Learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61: 215–289, 2018.
    Google ScholarLocate open access versionFindings
  • Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart J Russell, and Pieter Abbeel. Learning plannable representations with causal infogan. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
    Findings
  • Nicholas Rhinehart and Kris M Kitani. First-person activity forecasting with online inverse reinforcement learning. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
    Google ScholarLocate open access versionFindings
  • A Srinivas, A Jabri, P Abbeel, S Levine, and C Finn. Universal planning networks. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.
    Findings
  • Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. arXiv preprint arXiv:1903.02874, 2019.
    Findings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Kuo-Hao Zeng, William B Shen, De-An Huang, Min Sun, and Juan Carlos Niebles. Visual forecasting by imitating dynamics in natural sequences. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Xavier Puig, Bolei Zhou, Sanja Fidler, and Antonio Torralba. Open vocabulary scene parsing. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. arXiv preprint arXiv:1903.08225, 2019.
    Findings
Full Text
Your rating :
0

 

Tags
Comments