Learning Affordance Landscapes for Interaction Exploration in 3D Environments

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views49
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We proposed the task of “interaction exploration” and developed agents that can learn to efficiently act in new environments to prepare for downstream interaction tasks, while simultaneously building an internal model of object affordances

Abstract:

Embodied agents operating in human spaces must be able to master how their environment works: what objects can the agent use, and how can it use them? We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (s...More

Code:

Data:

0
Introduction
  • The ability to interact with the environment is an essential skill for embodied agents operating in human spaces.
  • An agent learns to navigate to specified objects [18], a dexterous hand learns to solve a Rubik’s cube [4], a robot learns to manipulate a rope [40].
  • In these cases and many others, it is known a priori what objects are relevant for the interactions and what the goal of the interaction is, whether expressed through expert demonstrations or a reward crafted to elicit the desired behavior.
  • The resulting agents remain specialized to the target interactions and objects for which they were taught
Highlights
  • The ability to interact with the environment is an essential skill for embodied agents operating in human spaces
  • We introduce the exploration for interaction problem: a mobile agent in a 3D environment must autonomously discover the objects with which it can physically interact, and what actions are valid as interactions with them
  • We propose a deep reinforcement learning (RL) approach in which the agent discovers the affordance landscape of a new, unmapped 3D environment
  • We proposed the task of “interaction exploration” and developed agents that can learn to efficiently act in new environments to prepare for downstream interaction tasks, while simultaneously building an internal model of object affordances
Methods
  • The authors evaluate agents’ ability to interact with as many objects as possible (Sec. 4.1) and enhance policy learning on downstream tasks (Sec. 4.2).

    Simulation environment The authors experiment with AI2-iTHOR [30], since it supports context-specific interactions that can change object states, vs. simple physics-based interactions in other 3D indoor environments [59, 8].
  • The authors evaluate agents’ ability to interact with as many objects as possible (Sec. 4.1) and enhance policy learning on downstream tasks (Sec. 4.2).
  • Simulation environment The authors experiment with AI2-iTHOR [30], since it supports context-specific interactions that can change object states, vs simple physics-based interactions in other 3D indoor environments [59, 8].
  • The authors use all kitchen scenes; kitchens are a valuable domain since many diverse interactions with objects are possible, as emphasized in prior work [14, 38, 21].
  • The authors randomize objects’ positions and states, agent start location, and camera viewpoint when sampling episodes
Results
  • The authors' agents can quickly seek out new objects to interact with in new environments, matching the performance of the best exploration method in 42% fewer timesteps and surpassing them to discover 1.33× more interactions when fully trained.
  • The authors' full model with learned affordance maps leads to the best interaction exploration policies, and discovers 1.33× more unique object interactions than the strongest baseline.
  • It performs these interactions quickly — it discovers the same number of interactions as RANDOM+ in 63% fewer time-steps
Conclusion
  • The authors proposed the task of “interaction exploration” and developed agents that can learn to efficiently act in new environments to prepare for downstream interaction tasks, while simultaneously building an internal model of object affordances.
  • Embodied agents that can explore environments in the absence of humans have broader applications in service robotics and assistive technology.
  • Such robots could survey, and give a quick rundown of a space for new users to, for example, alert them of appliances in a workspace, which of them are functional, and how these can be activated.
  • It could potentially warn users to avoid interaction with some objects if they are sharp, hot, or otherwise dangerous based on the robot’s own interactions with them
Summary
  • Introduction:

    The ability to interact with the environment is an essential skill for embodied agents operating in human spaces.
  • An agent learns to navigate to specified objects [18], a dexterous hand learns to solve a Rubik’s cube [4], a robot learns to manipulate a rope [40].
  • In these cases and many others, it is known a priori what objects are relevant for the interactions and what the goal of the interaction is, whether expressed through expert demonstrations or a reward crafted to elicit the desired behavior.
  • The resulting agents remain specialized to the target interactions and objects for which they were taught
  • Objectives:

    The authors' goal is to train an interaction exploration agent to enter a new, unseen environment and successfully interact with all objects present.
  • Methods:

    The authors evaluate agents’ ability to interact with as many objects as possible (Sec. 4.1) and enhance policy learning on downstream tasks (Sec. 4.2).

    Simulation environment The authors experiment with AI2-iTHOR [30], since it supports context-specific interactions that can change object states, vs. simple physics-based interactions in other 3D indoor environments [59, 8].
  • The authors evaluate agents’ ability to interact with as many objects as possible (Sec. 4.1) and enhance policy learning on downstream tasks (Sec. 4.2).
  • Simulation environment The authors experiment with AI2-iTHOR [30], since it supports context-specific interactions that can change object states, vs simple physics-based interactions in other 3D indoor environments [59, 8].
  • The authors use all kitchen scenes; kitchens are a valuable domain since many diverse interactions with objects are possible, as emphasized in prior work [14, 38, 21].
  • The authors randomize objects’ positions and states, agent start location, and camera viewpoint when sampling episodes
  • Results:

    The authors' agents can quickly seek out new objects to interact with in new environments, matching the performance of the best exploration method in 42% fewer timesteps and surpassing them to discover 1.33× more interactions when fully trained.
  • The authors' full model with learned affordance maps leads to the best interaction exploration policies, and discovers 1.33× more unique object interactions than the strongest baseline.
  • It performs these interactions quickly — it discovers the same number of interactions as RANDOM+ in 63% fewer time-steps
  • Conclusion:

    The authors proposed the task of “interaction exploration” and developed agents that can learn to efficiently act in new environments to prepare for downstream interaction tasks, while simultaneously building an internal model of object affordances.
  • Embodied agents that can explore environments in the absence of humans have broader applications in service robotics and assistive technology.
  • Such robots could survey, and give a quick rundown of a space for new users to, for example, alert them of appliances in a workspace, which of them are functional, and how these can be activated.
  • It could potentially warn users to avoid interaction with some objects if they are sharp, hot, or otherwise dangerous based on the robot’s own interactions with them
Tables
  • Table1: Exploration performance per interaction. Our policy is both more precise (prec) and discovers more interactions (cov) than all other methods. Methods that cycle through actions eventually succeed, but at the cost of interaction failures along the way
Download tables as Excel
Related work
  • Visual affordances An affordance is the potential for action [22]. In computer vision, visual affordances are explored in various forms: predicting where to grasp an object from images and video [31, 32, 64, 38, 19, 62, 15, 5], inferring how people might use a space [48, 39] or tool [65], and priors for human body poses [26, 52, 58, 17]. Our work offers a new perspective on learning visual affordances. Rather than learn them passively from a static dataset, the proposed agent actively seeks new affordances via exploratory interactions with a dynamic environment. Furthermore, unlike prior work, our approach yields not just an image model, but also a policy for exploring interactions, which we show accelerates learning new downstream tasks for an embodied agent.

    Exploration for navigation in 3D environments Recent embodied AI work in 3D simulators [36, 56, 60, 10] tackles navigation: the agent moves intelligently in an unmapped but static environment to reach a goal (e.g., [12, 11, 36, 6]). Exploration policies for visual navigation efficiently map the environment in an unsupervised “preview” stage [12, 50, 18, 11, 47, 46]. The agent is rewarded for maximizing the area covered in its inferred occupancy map [12, 11, 18], the novelty of the states visited [51], pushing the frontier of explored areas [46], and related metrics [47]. For a game setting in VizDoom, classic frontier-based exploration is improved by learning the visual appearance of hazardous regions (e.g., enemies, lava) where the agent’s health score has previously declined [46].
Reference
  • Gibson sim2real challenge. http://svl.stanford.edu/igibson/challenge.html, 2020.3
    Findings
  • Robothor. https://ai2thor.allenai.org/robothor/challenge/, 2020.3
    Findings
  • P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In NIPS, 2016. 3
    Google ScholarLocate open access versionFindings
  • I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019. 1
    Findings
  • J.-B. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-Julien. Joint discovery of object states and manipulation actions. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016. 3, 6
    Google ScholarLocate open access versionFindings
  • S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville. Home: A household multimodal environment. arXiv preprint arXiv:1711.11017, 2017. 5
    Findings
  • Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiositydriven learning. ICLR, 2013, 6
    Google ScholarLocate open access versionFindings
  • A. Chang, A. Dai, T. Funkhouser,, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov. Learning to explore using active neural slam. In ICLR, 2020. 2
    Google ScholarLocate open access versionFindings
  • T. Chen, S. Gupta, and A. Gupta. Learning exploration policies for navigation. In ICLR, 2019. 2, 1
    Google ScholarLocate open access versionFindings
  • M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. A deep multi-level network for saliency prediction. In ICPR, 2016. 6
    Google ScholarLocate open access versionFindings
  • D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 5
    Google ScholarLocate open access versionFindings
  • D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. W. Mayol-Cuevas. You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In BMVC, 2014. 2
    Google ScholarFindings
  • A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • V. Delaitre, D. F. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. A. Efros. Scene semantics from long-term observation of people. In ECCV, 2012. 2
    Google ScholarLocate open access versionFindings
  • K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese. Scene memory transformer for embodied agents in long-horizon tasks. In CVPR, 2019. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • K. Fang, T.-L. Wu, D. Yang, S. Savarese, and J. J. Lim. Demo2vec: Reasoning object affordances from online videos. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • D. Gandhi, L. Pinto, and A. Gupta. Learning to fly by crashing. In IROS. IEEE, 2017. 3
    Google ScholarLocate open access versionFindings
  • Z. Gao, R. Gong, T. Shu, X. Xie, S. Wang, and S. C. Zhu. Vrkitchen: an interactive 3d virtual environment for task-oriented learning. arXiv:1903.05757, 2019. 3, 5
    Findings
  • J. J. Gibson. The ecological approach to visual perception: classic edition. Psychology Press, 1979. 2
    Google ScholarFindings
  • L. K. L. Goff, G. Mukhtar, A. Coninx, and S. Doncieux. Bootstrapping robotic ecological perception from a limited set of hypotheses through interactive perception. arXiv preprint arXiv:1901.10968, 2019. 3
    Findings
  • L. K. L. Goff, O. Yaakoubi, A. Coninx, and S. Doncieux. Building an affordances map with interactive perception. arXiv preprint arXiv:1903.04413, 2019. 3
    Findings
  • D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • A. Gupta, A. Kembhavi, and L. S. Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. TPAMI, 2009. 2
    Google ScholarFindings
  • N. Haber, D. Mrowca, S. Wang, L. F. Fei-Fei, and D. L. Yamins. Learning to play with intrinsicallymotivated, self-aware agents. In NIPS, 2018. 3
    Google ScholarLocate open access versionFindings
  • D. I. Kim and G. S. Sukhatme. Interactive affordance map building for a robotic task. IROS, 2015. 3
    Google ScholarLocate open access versionFindings
  • E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017. 2, 3
    Google ScholarFindings
  • E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017. 5
    Findings
  • H. S. Koppula and A. Saxena. Physically grounded spatio-temporal object affordances. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic response. TPAMI, 2016. 2
    Google ScholarLocate open access versionFindings
  • S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. IJRR, 2018. 3
    Google ScholarLocate open access versionFindings
  • C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In CoRL, 2019. 3
    Google ScholarLocate open access versionFindings
  • J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. RSS, 2017. 3
    Google ScholarLocate open access versionFindings
  • Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra. Habitat: A Platform for Embodied AI Research. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015. 3
    Google ScholarLocate open access versionFindings
  • T. Nagarajan, C. Feichtenhofer, and K. Grauman. Grounded human-object interaction hotspots from video. In ICCV, 2019. 2, 5
    Google ScholarLocate open access versionFindings
  • T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman. Ego-topo: Environment affordances from egocentric video. CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. ICRA, 2017. 1
    Google ScholarLocate open access versionFindings
  • A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In ICRA, 2017. 3
    Google ScholarLocate open access versionFindings
  • J. Oberlin and S. Tellex. Learning to pick up objects through active exploration. In ICDL-EpiRob, 2015. 3
    Google ScholarLocate open access versionFindings
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017. 3, 6, 7, 8, 2
    Google ScholarLocate open access versionFindings
  • L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016. 3
    Google ScholarLocate open access versionFindings
  • X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba. Virtualhome: Simulating household activities via programs. In CVPR, 2018. 3
    Google ScholarFindings
  • W. Qi, R. T. Mullapudi, S. Gupta, and D. Ramanan. Learning to move with affordance maps. ICLR, 2020. 2
    Google ScholarLocate open access versionFindings
  • S. K. Ramakrishnan, D. Jayaraman, and K. Grauman. An exploration of embodied visual exploration. arXiv preprint arXiv:2001.02192, 2020. 2, 6, 7, 8
    Findings
  • N. Rhinehart and K. M. Kitani. Learning action maps of large environments via first-person vision. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 5, 1
    Google ScholarLocate open access versionFindings
  • N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap, and S. Gelly. Episodic curiosity through reachability. ICLR, 2019. 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner. Scenegrok: Inferring action maps in 3d environments. TOG, 2014. 2
    Google ScholarLocate open access versionFindings
  • J. Schmidhuber. Curious model-building control systems. In IJCNN, 1991. 3
    Google ScholarLocate open access versionFindings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 5
    Findings
  • M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. CVPR, 2020. 3, 8
    Google ScholarLocate open access versionFindings
  • J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019. 2
    Findings
  • A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. JCSS, 2008. 6
    Google ScholarLocate open access versionFindings
  • X. Wang, R. Girdhar, and A. Gupta. Binge watching: Scaling affordance learning from sitcoms. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. RA-L, 2020. 5
    Google ScholarLocate open access versionFindings
  • F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In ICRA, 2018. 3
    Google ScholarLocate open access versionFindings
  • Y. Zhou, B. Ni, R. Hong, X. Yang, and Q. Tian. Cascaded interactional targeting network for egocentric video analysis. In CVPR, 2016. 2
    Google ScholarFindings
  • Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi. Visual semantic planning using deep successor representations. In ICCV, 2017. 3, 8
    Google ScholarLocate open access versionFindings
  • Y. Zhu, C. Jiang, Y. Zhao, D. Terzopoulos, and S.-C. Zhu. Inferring forces and learning human utilities from videos. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Zhu, Y. Zhao, and S. Chun Zhu. Understanding tools: Task-oriented object modeling, learning and recognition. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments