Our empirical results on a stochastic 3D domain showed that our architecture generalizes well to longer sequences of instructions as well as unseen instructions
Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning.
As a step towards developing zero-shot task generalization capabilities in reinforcement learning (RL), we introduce a new RL problem where the agent should learn to execute sequences of instructions after learning useful skills that solve subtasks. In this problem, we consider two types of generalizations: to previously unseen instructio...更多
下载 PDF 全文
- The ability to understand and follow instructions allows them to perform a large number of new complex sequential tasks even without additional learning.
- Developing the ability to execute instructions can potentially allow reinforcement learning (RL) agents to generalize quickly over tasks for which such instructions are available.
- Factory-trained household robots could execute novel tasks in a new house following a human user’s instructions.
- In addition to generalization over instructions, an intelligent agent should be able to handle unexpected events while executing instructions.
- The agent should not blindly execute instructions sequentially but sometimes deviate from instructions depending on circumstances, which requires balancing between two different objectives
- The ability to understand and follow instructions allows us to perform a large number of new complex sequential tasks even without additional learning
- We explored a type of zero-shot task generalization in reinforcement learning with a new problem where the agent is required to execute and generalize over sequences of instructions
- We proposed an analogy-making objective which enables generalization over unseen parameterized tasks in various scenarios
- We proposed a novel way to learn the time-scale of the meta controller that proved to be more efficient and flexible than alternative approaches for interrupting subtasks and for dealing with delayed sequential decision problems
- Our empirical results on a stochastic 3D domain showed that our architecture generalizes well to longer sequences of instructions as well as unseen instructions
- Our hierarchical reinforcement learning architecture was demonstrated in the simple setting where the set of instructions should be executed sequentially, we believe that our key ideas are not limited to this setting but can be extended to richer forms of instructions
- The authors developed a 3D visual environment using Minecraft based on Oh et al (2016) as shown in Figure 1.
- The agent has 9 actions: Look (Left/Right/Up/Down), Move (Forward/Backward), Pick up, Transform, and No operation.
- Pick up removes the object in front of the agent, and Transform changes the object in front of the agent to ice.
- The authors used the same Minecraft domain used in Section 3.3.
- Throughout an episode, a box randomly appears with probability of 0.03 and transforming a box gives +0.9 reward.
- The authors used the best-performing parameterized skill throughout this experiment
- The authors developed a 3D visual environment using Minecraft based on Oh et al (2016) where the agent can interact with many objects.
- The authors' results on multiple sets of parameterized subtasks show that the proposed analogy-making objective can generalize successfully.
- The authors' results on multiple instruction execution problems show that the meta controller’s ability to learn when to update the subtask plays a key role in solving the overall problem and outperforms several hierarchical baselines.
- Section 3 presents the analogymaking objective for generalization to parameterized tasks and demonstrates its application to different generalization scenarios.
- The authors demonstrate the agent’s ability to generalize over sequences of instructions, as well as provide a comparison to several alternative approaches
- The authors explored a type of zero-shot task generalization in RL with a new problem where the agent is required to execute and generalize over sequences of instructions.
- The authors proposed an analogy-making objective which enables generalization over unseen parameterized tasks in various scenarios.
- The authors' empirical results on a stochastic 3D domain showed that the architecture generalizes well to longer sequences of instructions as well as unseen instructions.
- The authors' hierarchical RL architecture was demonstrated in the simple setting where the set of instructions should be executed sequentially, the authors believe that the key ideas are not limited to this setting but can be extended to richer forms of instructions
- Table1: Performance on parameterized tasks. Each entry shows objectives based on contrastive loss (<a class="ref-link" id="cHadsell_et+al_2006_a" href="#rHadsell_et+al_2006_a">Hadsell et al, 2006</a>): ‘Average reward (Success rate)’. We assume an episode is suc-
- Table2: Performance on instruction execution. Each entry shows average reward and success rate. ‘Hierarchical-Dynamic’ is our approach that learns when to update the subtask. An episode is successful only when the agent solves all instructions correctly
- Hierarchical RL. A number of hierarchical RL approaches are designed to deal with sequential tasks. Typically these have the form of a meta controller and a set of lower-level controllers for subtasks (Sutton et al, 1999; Dietterich, 2000; Parr and Russell, 1997; Ghavamzadeh and Mahadevan, 2003; Konidaris et al, 2012; Konidaris and Barto, 2007). However, much of the previous work assumes that the overall task is fixed (e.g., Taxi domain (Dietterich, 2000)). In other words, the optimal sequence of subtasks is fixed during evaluation (e.g., picking up a passenger followed by navigating to a destination in the Taxi domain). This makes it hard to evaluate the agent’s ability to compose pre-learned policies to solve previously unseen sequential tasks in a zero-shot fashion unless we re-train the agent on the new tasks in a transfer learning setting (Singh, 1991; 1992; McGovern and Barto, 2002). Our work is also closely related to Programmable HAMs (PHAMs) (Andre and Russell, 2000; 2002) in that a PHAM is designed to execute a given program. However, the program explicitly specifies the policy in PHAMs which effectively reduces the state-action search space. In contrast, instructions are a description of the task in our work, which means that the agent should learn to use the instructions to maximize its reward.
- This work was supported by NSF grant IIS-1526059
- D. Andre and S. J. Russell. Programmable reinforcement learning agents. In NIPS, 2000.
- D. Andre and S. J. Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, 2002.
- J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. CoRR, abs/1611.01796, 2016.
- P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In AAAI, 2017.
- S. R. K. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay. Reinforcement learning for mapping instructions to actions. In ACL/IJCNLP, 2009.
- D. L. Chen and R. J. Mooney. Learning to interpret natural language navigation instructions from observations. In AAAI, 2011.
- J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017.
- B. C. da Silva, G. Konidaris, and A. G. Barto. Learning parameterized skills. In ICML, 2012.
- C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In ICRA, 2017.
- T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
- C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. In ICLR, 2017.
- M. Ghavamzadeh and S. Mahadevan. Hierarchical policy gradient algorithms. In ICML, 2003.
- A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
- N. Heess, G. Wayne, Y. Tassa, T. P. Lillicrap, M. A. Riedmiller, and D. Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- D. Isele, M. Rostami, and E. Eaton. Using task features for zero-shot knowledge transfer in lifelong learning. In IJCAI, 2016.
- G. Konidaris and A. G. Barto. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 2007.
- G. Konidaris, I. Scheidwasser, and A. G. Barto. Transfer in reinforcement learning via shared features. Journal of Machine Learning Research, 13:1333–1371, 2012.
- J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork rnn. In ICML, 2014.
- T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057, 2016.
- M. MacMahon, B. Stankiewicz, and B. Kuipers. Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI, 2006.
- A. McGovern and A. G. Barto. Autonomous discovery of temporal abstractions from interaction with an environment. PhD thesis, University of Massachusetts, 2002.
- H. Mei, M. Bansal, and M. R. Walter. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. arXiv preprint arXiv:1506.04089, 2015.
- T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
- J. Oh, V. Chockalingam, S. Singh, and H. Lee. Memorybased control of active perception and action in minecraft. In ICML, 2016.
- E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In ICLR, 2016.
- R. Parr and S. J. Russell. Reinforcement learning with hierarchies of machines. In NIPS, 1997.
- S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015.
- A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. In ICLR, 2016.
- T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In ICML, 2015.
- J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In ICLR, 2016.
- S. P. Singh. The efficient learning of multiple task sequences. In NIPS, 1991.
- S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4): 323–339, 1992.
- S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus. Mazebase: A sandbox for learning from games. arXiv preprint arXiv:1511.07401, 2015.
- R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1): 181–211, 1999.
- S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011.
- S. Tellex, R. A. Knepper, A. Li, D. Rus, and N. Roy. Asking for help using inverse semantics. In RSS, 2014.
- C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. In AAAI, 2017.
- W. Zaremba and I. Sutskever. Reinforcement learning neural turing machines. arXiv preprint arXiv:1505.00521, 2015.