Procedure Planning in Instructional Videos
european conference on computer vision, pp. 334-350, 2019.
Weibo:
Abstract:
We propose a new challenging task: procedure planning in instructional videos. Unlike existing planning problems, where both the state and the action spaces are well-defined, the key challenge of planning in instructional videos is that both the state and the action spaces are open-vocabulary. We address this challenge with latent space...More
Code:
Data:
Introduction
- Humans possess a natural ability of planning and reasoning to assist everyday tasks. The authors are able to picture what effects the actions would have and plan multiple steps ahead to achieve the intended goal.
- One can imagine an indefinitely growing semantic state space, which prevents the application of classical symbolic planning approaches [6] that require a given set of predicates for a well-defined state space
- This challenge is amplified by the fact that the authors do not assume to know the effects of all the possible actions.
- How does the agent know that pouring the eggs to the pan will make it cooked? Without this knowledge, it is impossible to perform planning in the space
Highlights
- Humans possess a natural ability of planning and reasoning to assist everyday tasks
- As the action space in instructional videos is not continuous, the gradient-based planner is not able to work well. This makes the Universal Planning Networks (UPN) performs similar to Ours w/o T, which is like an recurrent neural network (RNN) goal-conditional policy trained with imitation objectives
- Our full model combines the strengths of planning and action imitation objective as conjugate constraints, which enables us to learn plannable representations from real-world videos to outperform all the baseline approaches on all metrics
- While UPN aims to directly leverage goal in the algorithm, the non-differentiable action space prevents the success of gradient-based planning
- We presented a framework for procedure planning in real-world instructional videos
- We address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space
Methods
- The key challenge is that both state and action spaces are open-vocabulary.
- The authors will first define the procedure planning problem setup and how to address it using a latent space planning approach.
- The authors will discuss how the authors learn the latent space and leverage the conjugate relationships between states and actions to avoid trivial solutions to the optimization.
- The authors will present the algorithms for procedure planning and walkthrough planning [14] in the learned plannable space
Results
- As the action space in instructional videos is not continuous, the gradient-based planner is not able to work well
- This makes the UPN performs similar to Ours w/o T , which is like an RNN goal-conditional policy trained with imitation objectives.
- The authors' full model combines the strengths of planning and action imitation objective as conjugate constraints, which enables them to learn plannable representations from real-world videos to outperform all the baseline approaches on all metrics.
- Neither of the baseline models is able to understand that to perform the rest of the steps, the person needs to get the tools first
Conclusion
- The authors address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space.
- The authors' experimental results show the framework is able to learn high-level semantic representations that are plannable and significantly outperforms the baselines across different metrics on two challenging tasks: procedure planning and walkthrough planning.
- The authors intend to incorporate object-oriented models to further explore the objects and predicates relations from complex visual dynamics data
Summary
Introduction:
Humans possess a natural ability of planning and reasoning to assist everyday tasks. The authors are able to picture what effects the actions would have and plan multiple steps ahead to achieve the intended goal.- One can imagine an indefinitely growing semantic state space, which prevents the application of classical symbolic planning approaches [6] that require a given set of predicates for a well-defined state space
- This challenge is amplified by the fact that the authors do not assume to know the effects of all the possible actions.
- How does the agent know that pouring the eggs to the pan will make it cooked? Without this knowledge, it is impossible to perform planning in the space
Objectives:
The authors aim to answer the following questions in the experiments: (i) Can the authors learn plannable representations from real-world instructional videos? How does it compare to existing latent space planning methods? (ii) How important are the conjugate constraints? (iii) Can the authors further retrieve intermediate visual subgoals using the same model? The authors answer the first two questions with ablation studies on challenging real-world instructional videos.Methods:
The key challenge is that both state and action spaces are open-vocabulary.- The authors will first define the procedure planning problem setup and how to address it using a latent space planning approach.
- The authors will discuss how the authors learn the latent space and leverage the conjugate relationships between states and actions to avoid trivial solutions to the optimization.
- The authors will present the algorithms for procedure planning and walkthrough planning [14] in the learned plannable space
Results:
As the action space in instructional videos is not continuous, the gradient-based planner is not able to work well- This makes the UPN performs similar to Ours w/o T , which is like an RNN goal-conditional policy trained with imitation objectives.
- The authors' full model combines the strengths of planning and action imitation objective as conjugate constraints, which enables them to learn plannable representations from real-world videos to outperform all the baseline approaches on all metrics.
- Neither of the baseline models is able to understand that to perform the rest of the steps, the person needs to get the tools first
Conclusion:
The authors address the challenge of open-vocabulary state and action spaces by learning plannable representations with conjugate constraints on the latent space.- The authors' experimental results show the framework is able to learn high-level semantic representations that are plannable and significantly outperforms the baselines across different metrics on two challenging tasks: procedure planning and walkthrough planning.
- The authors intend to incorporate object-oriented models to further explore the objects and predicates relations from complex visual dynamics data
Tables
- Table1: Results for procedure planning. Our full model significantly outperforms baselines. With around 10% improvement of accuracy, our full model is able to improve success rate by 8 times compared to Ours w/o T . This shows the importance of reasoning over the full sequence
- Table2: Results for walkthrough planning. Our model trained for procedure planning can also address walkthrough planning. It significantly outperforms the baseline by explicit sequential reasoning of what actions need to be performed first, and is less distracted by the visual appearances
Related work
- Planning in Videos. In the planning literature, most studies rely on a prescribed set of state and action representations and symbols for the task [6, 13]. We are interested in works that perform planning in the visual space without well-defined state and action spaces. In addition, the complexity of instructional videos prevents the direct application of approaches that aim to plan directly in the visual space [2, 4]. It is thus necessary to learn plannable representations. In addition to the unsupervised and semi-supervised approaches [18, 23], Universal Planning Networks [19] use a gradient descent planner to learn representations for planning. However, it assumes the actions space to be differentiable. Alternatively, one can also learn the latent dynamics by optimizing the data log-likelihood from the actions [8]. We use a similar formulation and further propose the conjugate constraints to expedite the latent space learning. Without using explicit action supervision, causal InfoGAN [14] extracts state representations by learning salient features that describe the causal structure of the data in simple domains. In contrast to [14], our model operates directly on high dimensional video demonstrations and handle the semantics of actions with sequential learning.
Funding
- This work was partially funded by Toyota Research Institute (TRI)
Study subjects and analysis
snapshot observations: 9
Figure 3 shows the qualitative results of our procedure planning. The first row shows the 9 snapshot observations in making French toast. In the second row, we pick o3 and o7 as the start and goal observations and ask the model to plan for 4 actions in between
Reference
- Yazan Abu Farha, Alexander Richard, and Juergen Gall. When will you do what?-anticipating temporal occurrences of activities. In CVPR, 2018.
- Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. In NeurIPS, 2016.
- Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Simon Lacoste-Julien. Joint discovery of object states and manipulation actions. In ICCV, 2017.
- Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.
- Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In CVPR, 2016.
- Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: theory and practice. Elsevier, 2004.
- Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. Open-vocabulary object retrieval. In Robotics: science and systems, volume 2, page 6, 2014.
- Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019.
- Bradley Hayes and Brian Scassellati. Autonomously constructing hierarchical task networks for planning and human-robot collaboration. In ICRA, 2016.
- De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. Finding “it”: Weakly supervised reference-aware visual grounding in instructional videos. In CVPR, 2018.
- De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. In CVPR, 2019.
- Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and Sergey Levine. Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018.
- George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. From skills to symbols: Learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61: 215–289, 2018.
- Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart J Russell, and Pieter Abbeel. Learning plannable representations with causal infogan. In NeurIPS, 2018.
- Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. In ECCV, 2014.
- MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
- Nicholas Rhinehart and Kris M Kitani. First-person activity forecasting with online inverse reinforcement learning. In ICCV, 2017.
- Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In ICRA, 2018.
- A Srinivas, A Jabri, P Abbeel, S Levine, and C Finn. Universal planning networks. In ICML, 2018.
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.
- Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. arXiv preprint arXiv:1903.02874, 2019.
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In CVPR, 2016.
- Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NeurIPS, 2015.
- Kuo-Hao Zeng, William B Shen, De-An Huang, Min Sun, and Juan Carlos Niebles. Visual forecasting by imitating dynamics in natural sequences. In ICCV, 2017.
- Hang Zhao, Xavier Puig, Bolei Zhou, Sanja Fidler, and Antonio Torralba. Open vocabulary scene parsing. In ICCV, 2017.
- Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
- Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. arXiv preprint arXiv:1903.08225, 2019.
Full Text
Tags
Comments