AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Our empirical results on a stochastic 3D domain showed that our architecture generalizes well to longer sequences of instructions as well as unseen instructions

Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning.

ICML, (2017)

被引用175|浏览195
EI
下载 PDF 全文
引用
微博一下

摘要

As a step towards developing zero-shot task generalization capabilities in reinforcement learning (RL), we introduce a new RL problem where the agent should learn to execute sequences of instructions after learning useful skills that solve subtasks. In this problem, we consider two types of generalizations: to previously unseen instructio...更多

代码

数据

0
简介
  • The ability to understand and follow instructions allows them to perform a large number of new complex sequential tasks even without additional learning.
  • Developing the ability to execute instructions can potentially allow reinforcement learning (RL) agents to generalize quickly over tasks for which such instructions are available.
  • Factory-trained household robots could execute novel tasks in a new house following a human user’s instructions.
  • In addition to generalization over instructions, an intelligent agent should be able to handle unexpected events while executing instructions.
  • The agent should not blindly execute instructions sequentially but sometimes deviate from instructions depending on circumstances, which requires balancing between two different objectives
重点内容
  • The ability to understand and follow instructions allows us to perform a large number of new complex sequential tasks even without additional learning
  • We explored a type of zero-shot task generalization in reinforcement learning with a new problem where the agent is required to execute and generalize over sequences of instructions
  • We proposed an analogy-making objective which enables generalization over unseen parameterized tasks in various scenarios
  • We proposed a novel way to learn the time-scale of the meta controller that proved to be more efficient and flexible than alternative approaches for interrupting subtasks and for dealing with delayed sequential decision problems
  • Our empirical results on a stochastic 3D domain showed that our architecture generalizes well to longer sequences of instructions as well as unseen instructions
  • Our hierarchical reinforcement learning architecture was demonstrated in the simple setting where the set of instructions should be executed sequentially, we believe that our key ideas are not limited to this setting but can be extended to richer forms of instructions
方法
  • The authors developed a 3D visual environment using Minecraft based on Oh et al (2016) as shown in Figure 1.
  • The agent has 9 actions: Look (Left/Right/Up/Down), Move (Forward/Backward), Pick up, Transform, and No operation.
  • Pick up removes the object in front of the agent, and Transform changes the object in front of the agent to ice.
  • The authors used the same Minecraft domain used in Section 3.3.
  • Throughout an episode, a box randomly appears with probability of 0.03 and transforming a box gives +0.9 reward.
  • The authors used the best-performing parameterized skill throughout this experiment
结果
  • The authors developed a 3D visual environment using Minecraft based on Oh et al (2016) where the agent can interact with many objects.
  • The authors' results on multiple sets of parameterized subtasks show that the proposed analogy-making objective can generalize successfully.
  • The authors' results on multiple instruction execution problems show that the meta controller’s ability to learn when to update the subtask plays a key role in solving the overall problem and outperforms several hierarchical baselines.
  • Section 3 presents the analogymaking objective for generalization to parameterized tasks and demonstrates its application to different generalization scenarios.
  • The authors demonstrate the agent’s ability to generalize over sequences of instructions, as well as provide a comparison to several alternative approaches
结论
  • The authors explored a type of zero-shot task generalization in RL with a new problem where the agent is required to execute and generalize over sequences of instructions.
  • The authors proposed an analogy-making objective which enables generalization over unseen parameterized tasks in various scenarios.
  • The authors' empirical results on a stochastic 3D domain showed that the architecture generalizes well to longer sequences of instructions as well as unseen instructions.
  • The authors' hierarchical RL architecture was demonstrated in the simple setting where the set of instructions should be executed sequentially, the authors believe that the key ideas are not limited to this setting but can be extended to richer forms of instructions
表格
  • Table1: Performance on parameterized tasks. Each entry shows objectives based on contrastive loss (<a class="ref-link" id="cHadsell_et+al_2006_a" href="#rHadsell_et+al_2006_a">Hadsell et al, 2006</a>): ‘Average reward (Success rate)’. We assume an episode is suc-
  • Table2: Performance on instruction execution. Each entry shows average reward and success rate. ‘Hierarchical-Dynamic’ is our approach that learns when to update the subtask. An episode is successful only when the agent solves all instructions correctly
Download tables as Excel
相关工作
  • Hierarchical RL. A number of hierarchical RL approaches are designed to deal with sequential tasks. Typically these have the form of a meta controller and a set of lower-level controllers for subtasks (Sutton et al, 1999; Dietterich, 2000; Parr and Russell, 1997; Ghavamzadeh and Mahadevan, 2003; Konidaris et al, 2012; Konidaris and Barto, 2007). However, much of the previous work assumes that the overall task is fixed (e.g., Taxi domain (Dietterich, 2000)). In other words, the optimal sequence of subtasks is fixed during evaluation (e.g., picking up a passenger followed by navigating to a destination in the Taxi domain). This makes it hard to evaluate the agent’s ability to compose pre-learned policies to solve previously unseen sequential tasks in a zero-shot fashion unless we re-train the agent on the new tasks in a transfer learning setting (Singh, 1991; 1992; McGovern and Barto, 2002). Our work is also closely related to Programmable HAMs (PHAMs) (Andre and Russell, 2000; 2002) in that a PHAM is designed to execute a given program. However, the program explicitly specifies the policy in PHAMs which effectively reduces the state-action search space. In contrast, instructions are a description of the task in our work, which means that the agent should learn to use the instructions to maximize its reward.
基金
  • This work was supported by NSF grant IIS-1526059
引用论文
  • D. Andre and S. J. Russell. Programmable reinforcement learning agents. In NIPS, 2000.
    Google ScholarLocate open access versionFindings
  • D. Andre and S. J. Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, 2002.
    Google ScholarLocate open access versionFindings
  • J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. CoRR, abs/1611.01796, 2016.
    Findings
  • P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • S. R. K. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay. Reinforcement learning for mapping instructions to actions. In ACL/IJCNLP, 2009.
    Google ScholarLocate open access versionFindings
  • D. L. Chen and R. J. Mooney. Learning to interpret natural language navigation instructions from observations. In AAAI, 2011.
    Google ScholarLocate open access versionFindings
  • J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • B. C. da Silva, G. Konidaris, and A. G. Barto. Learning parameterized skills. In ICML, 2012.
    Google ScholarLocate open access versionFindings
  • C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
    Google ScholarLocate open access versionFindings
  • C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • M. Ghavamzadeh and S. Mahadevan. Hierarchical policy gradient algorithms. In ICML, 2003.
    Google ScholarFindings
  • A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
    Findings
  • R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • N. Heess, G. Wayne, Y. Tassa, T. P. Lillicrap, M. A. Riedmiller, and D. Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
    Findings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • D. Isele, M. Rostami, and E. Eaton. Using task features for zero-shot knowledge transfer in lifelong learning. In IJCAI, 2016.
    Google ScholarLocate open access versionFindings
  • G. Konidaris and A. G. Barto. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 2007.
    Google ScholarFindings
  • G. Konidaris, I. Scheidwasser, and A. G. Barto. Transfer in reinforcement learning via shared features. Journal of Machine Learning Research, 13:1333–1371, 2012.
    Google ScholarLocate open access versionFindings
  • J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork rnn. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057, 2016.
    Findings
  • M. MacMahon, B. Stankiewicz, and B. Kuipers. Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI, 2006.
    Google ScholarLocate open access versionFindings
  • A. McGovern and A. G. Barto. Autonomous discovery of temporal abstractions from interaction with an environment. PhD thesis, University of Massachusetts, 2002.
    Google ScholarFindings
  • H. Mei, M. Bansal, and M. R. Walter. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. arXiv preprint arXiv:1506.04089, 2015.
    Findings
  • T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
    Findings
  • J. Oh, V. Chockalingam, S. Singh, and H. Lee. Memorybased control of active perception and action in minecraft. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • R. Parr and S. J. Russell. Reinforcement learning with hierarchies of machines. In NIPS, 1997.
    Google ScholarLocate open access versionFindings
  • S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • S. P. Singh. The efficient learning of multiple task sequences. In NIPS, 1991.
    Google ScholarLocate open access versionFindings
  • S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4): 323–339, 1992.
    Google ScholarLocate open access versionFindings
  • S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus. Mazebase: A sandbox for learning from games. arXiv preprint arXiv:1511.07401, 2015.
    Findings
  • R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1): 181–211, 1999.
    Google ScholarLocate open access versionFindings
  • S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011.
    Google ScholarLocate open access versionFindings
  • S. Tellex, R. A. Knepper, A. Li, D. Rus, and N. Roy. Asking for help using inverse semantics. In RSS, 2014.
    Google ScholarLocate open access versionFindings
  • C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • W. Zaremba and I. Sutskever. Reinforcement learning neural turing machines. arXiv preprint arXiv:1505.00521, 2015.
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科