Fast Adaptation to New Environments via Policy-Dynamics Value Functions

Max Goldstein
Max Goldstein
Arthur Szlam
Arthur Szlam

ICML, pp. 7920-7931, 2020.

Cited by: 1|Views17
EI
Weibo:
We introduce Policy-Dynamics Value Functions, a novel approach for rapidly adapting to dynamics different from those previously seen in training

Abstract:

Standard RL algorithms assume fixed environment dynamics and require a significant amount of interaction to adapt to new environments. We introduce Policy-Dynamics Value Functions (PDVF), a novel approach for rapidly adapting to dynamics different from those previously seen in training. PD-VF explicitly estimates the cumulative reward in ...More

Code:

Data:

0
ZH
Full Text
Bibtex
Weibo
Introduction
  • Deep reinforcement learning (RL) has achieved impressive results on a wide range of complex tasks (Mnih et al, 2015; Silver et al, 2016; 2017; 2018; Jaderberg et al, 2019; Berner et al, 2019; Vinyals et al, 2019).
  • A self-driving car might have to adjust its behavior depending on weather conditions, or a prosthetic control system might have to adapt to a new human.
  • In these cases it is crucial for RL agents to find and execute appropriate policies as quickly as possible.
  • In this work, the authors aim to learn a value function conditioned on elements of a space of policies and tasks, but here, a “task” is specified by the transition function of the MDP instead of the reward function
Highlights
  • Deep reinforcement learning (RL) has achieved impressive results on a wide range of complex tasks (Mnih et al, 2015; Silver et al, 2016; 2017; 2018; Jaderberg et al, 2019; Berner et al, 2019; Vinyals et al, 2019)
  • In some cases, our approach is comparable to the PPOenv upper bound which was directly trained on the respective test environment
  • We propose policy-dynamics value functions (PD-VF), a novel framework for fast adaptation to new environment dynamics
  • The environment embedding can be inferred from only a few interactions, which allows the selection of a policy that maximizes the learned value function
  • Policy-Dynamics Value Functions has a number of desirable properties: it leverages the structure in both the policy and the dynamics space to estimate the expected return, it only needs a small number of steps to adapt to unseen dynamics, it does not update any parameters at test time, and it does not require dense reward or long rollouts to find an effective policy in a new environment
  • As noted by Precup et al (2001), Sutton et al (2011), and White et al (2012), learning about multiple policies in parallel via general value functions can be useful for lifelong learning
Methods
  • The authors evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations.
  • The authors create a number of environments with different dynamics.
  • The authors split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics.
  • PPOenv PD-VF RL 2 MAML PPOdyn PPOall Swimmer Environment Ant-Wind Mean 695 SD: 291 Spaceship Mean 862 SD: 18 Ant-Legs Mean 374 SD: 52
Results
  • While the strength of PD-VF lies in quickly adapting to new dynamics, its performance on training environments is still comparable to that of the other baselines, as shown in Figure 6.
  • This result is not surprising since current state-of-the-art RL algorithms such as PPO can generally learn good policies for the environments they are trained on, given enough interactions, updates, and the right hyperparamters.
  • Even meta-learning approaches like MAML or RL2 struggle to adapt when they are allowed to use only a short trajectory for updating the policy at test time, as is the case here
Conclusion
  • Discussion and Future

    Work

    In this work, the authors propose policy-dynamics value functions (PD-VF), a novel framework for fast adaptation to new environment dynamics.
  • The PD-VF framework can, in principle, be used to evaluate a family of policies and environments on other metrics of interest besides the expected return, such as, for example, reward variance, agent prosociality, deviation from expert behavior, and so on.
  • Another interesting direction is to integrate additional constraints to the optimization problem.
  • PD-VF can be applied to multi-agent settings for adapting to different opponents or teammates whose behaviors determine the environment dynamics
Summary
  • Introduction:

    Deep reinforcement learning (RL) has achieved impressive results on a wide range of complex tasks (Mnih et al, 2015; Silver et al, 2016; 2017; 2018; Jaderberg et al, 2019; Berner et al, 2019; Vinyals et al, 2019).
  • A self-driving car might have to adjust its behavior depending on weather conditions, or a prosthetic control system might have to adapt to a new human.
  • In these cases it is crucial for RL agents to find and execute appropriate policies as quickly as possible.
  • In this work, the authors aim to learn a value function conditioned on elements of a space of policies and tasks, but here, a “task” is specified by the transition function of the MDP instead of the reward function
  • Objectives:

    The authors aim to learn a value function conditioned on elements of a space of policies and tasks, but here, a “task” is specified by the transition function of the MDP instead of the reward function.
  • The authors aim to design an approach that can quickly find a good policy in an environment with new and unknown dynamics, after being trained on a family of environments with related dynamics
  • Methods:

    The authors evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations.
  • The authors create a number of environments with different dynamics.
  • The authors split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics.
  • PPOenv PD-VF RL 2 MAML PPOdyn PPOall Swimmer Environment Ant-Wind Mean 695 SD: 291 Spaceship Mean 862 SD: 18 Ant-Legs Mean 374 SD: 52
  • Results:

    While the strength of PD-VF lies in quickly adapting to new dynamics, its performance on training environments is still comparable to that of the other baselines, as shown in Figure 6.
  • This result is not surprising since current state-of-the-art RL algorithms such as PPO can generally learn good policies for the environments they are trained on, given enough interactions, updates, and the right hyperparamters.
  • Even meta-learning approaches like MAML or RL2 struggle to adapt when they are allowed to use only a short trajectory for updating the policy at test time, as is the case here
  • Conclusion:

    Discussion and Future

    Work

    In this work, the authors propose policy-dynamics value functions (PD-VF), a novel framework for fast adaptation to new environment dynamics.
  • The PD-VF framework can, in principle, be used to evaluate a family of policies and environments on other metrics of interest besides the expected return, such as, for example, reward variance, agent prosociality, deviation from expert behavior, and so on.
  • Another interesting direction is to integrate additional constraints to the optimization problem.
  • PD-VF can be applied to multi-agent settings for adapting to different opponents or teammates whose behaviors determine the environment dynamics
Related work
Funding
  • Roberta and Max were supported by the DARPA L2M grant. Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J
Reference
  • Ammar, H. B., Tuyls, K., Taylor, M. E., Driessens, K., and Weiss, G. Reinforcement learning transfer via sparse coding. In Proceedings of the 11th international conference on autonomous agents and multiagent systems, volume 1, pp. 383–390. International Foundation for Autonomous Agents and Multiagent Systems..., 2012.
    Google ScholarLocate open access versionFindings
  • Ammar, H. B., Eaton, E., Taylor, M. E., Mocanu, D. C., Driessens, K., Weiss, G., and Tuyls, K. An automated measure of mdp similarity for transfer in reinforcement learning. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
    Google ScholarLocate open access versionFindings
  • Andreas, J., Klein, D., and Levine, S. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pp. 166–175, 2017.
    Google ScholarLocate open access versionFindings
  • Arnekvist, I., Kragic, D., and Stork, J. A. Vpe: Variational policy embedding for transfer reinforcement learning. 2019 International Conference on Robotics and Automation (ICRA), pp. 36–42, 2018.
    Google ScholarLocate open access versionFindings
  • Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.
    Google ScholarLocate open access versionFindings
  • Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Jozefowicz, R., Gray, S., Olsson, C., Pachocki, J. W., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., and Zhang, S. Dota 2 with large scale deep reinforcement learning. ArXiv, abs/1912.06680, 2019.
    Findings
  • Borsa, D., Graepel, T., and Shawe-Taylor, J. Learning shared representations in multi-task reinforcement learning. arXiv preprint arXiv:1603.02041, 2016.
    Findings
  • Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., van Hasselt, H., Silver, D., and Schaul, T. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018.
    Findings
  • Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and Levine, S. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Cully, A., Clune, J., Tarapore, D., and Mouret, J.-B. Robots that can adapt like animals. Nature, 521:503–507, 2015.
    Google ScholarLocate open access versionFindings
  • Da Silva, B., Konidaris, G., and Barto, A. Learning parameterized skills. arXiv preprint arXiv:1206.6398, 2012.
    Findings
  • Devin, C., Gupta, A., Darrell, T., Abbeel, P., and Levine, S. Learning modular neural network policies for multi-task and multi-robot transfer. 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176, 2016.
    Google ScholarLocate open access versionFindings
  • Doshi-Velez, F. and Konidaris, G. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. IJCAI: proceedings of the conference, 2016:1432–1440, 2013.
    Google ScholarLocate open access versionFindings
  • Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning. ArXiv, abs/1611.02779, 2016.
    Findings
  • Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., and Zaremba, W. Oneshot imitation learning. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949, 2017.
    Findings
  • Hansen, S., Dabney, W., Barreto, A., Van de Wiele, T., Warde-Farley, D., and Mnih, V. Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030, 2019.
    Findings
  • Hausman, K., Springenberg, J. T., Wang, Z., Heess, N. M. O., and Riedmiller, M. A. Learning an embedding space for transferable robot skills. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • He, Z., Julian, R., Heiden, E., Zhang, H., Schaal, S., Lim, J. J., Sukhatme, G., and Hausman, K. Zero-shot skill composition and simulation-to-real transfer by learning task representations. arXiv preprint arXiv:1810.02422, 2018.
    Findings
  • Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019.
    Google ScholarLocate open access versionFindings
  • Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M. M., Blundell, C., and Lerchner, A. Darla: Improving zero-shot transfer in reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Houthooft, R., Chen, R. Y., Isola, P., Stadie, B. C., Wolski, F., Ho, J., and Abbeel, P. Evolved policy gradients. ArXiv, abs/1802.04821, 2018.
    Findings
  • Humplik, J., Galashov, A., Hasenclever, L., Ortega, P. A., Teh, Y. W., and Heess, N. Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424, 2019.
    Findings
  • Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Human-level performance in 3d multiplayer games with populationbased reinforcement learning. Science, 364(6443):859– 865, 2019.
    Google ScholarLocate open access versionFindings
  • Killian, T. W., Konidaris, G., and Doshi-Velez, F. Robust and efficient transfer learning with hidden parameter markov decision processes. Advances in neural information processing systems, 30:6250–6261, 2017.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Kingma, D. P. and Welling, M. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
    Findings
  • Madjiheurem, S. and Toni, L. State2vec: Off-policy successor features approximators. arXiv preprint arXiv:1910.10277, 2019.
    Findings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347, 2018.
    Findings
  • Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. arXiv preprint arXiv:1706.05064, 2017.
    Findings
  • Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
    Findings
  • Paul, S., Osborne, M. A., and Whiteson, S. Fingerprint policy optimisation for robust reinforcement learning. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Perez, C. F., Such, F. P., and Karaletsos, T. Efficient transfer learning and online adaptation with latent variable models for continuous control. ArXiv, abs/1812.03399, 2018.
    Findings
  • Petangoda, J. C., Pascual-Diaz, S., Adam, V., Vrancx, P., and Grau-Moya, J. Disentangled skill embeddings for reinforcement learning. ArXiv, abs/1906.09223, 2019.
    Findings
  • Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Robust adversarial reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In ICML, pp. 417–424, 2001.
    Google ScholarLocate open access versionFindings
  • Raileanu, R. and Rocktaschel, T. Ride: Rewarding impactdriven exploration for procedurally-generated environments. ArXiv, abs/2002.12292, 2020.
    Findings
  • Rajeswaran, A., Lowrey, K., Todorov, E. V., and Kakade, S. M. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561, 2017.
    Google ScholarLocate open access versionFindings
  • Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp. 5331–5340, 2019.
    Google ScholarLocate open access versionFindings
  • Sæmundsson, S., Hofmann, K., and Deisenroth, M. P. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551, 2018.
    Findings
  • Sahni, H., Kumar, S., Tejani, F., and Isbell, C. Learning to compose skills. arXiv preprint arXiv:1711.11289, 2017.
    Findings
  • Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
    Google ScholarLocate open access versionFindings
  • Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
    Google ScholarLocate open access versionFindings
  • Siriwardhana, S., Weerasakera, R., Matthies, D. J., and Nanayakkara, S. Vusfa: Variational universal successor features approximator to improve transfer drl for target driven visual navigation. arXiv preprint arXiv:1908.06376, 2019.
    Findings
  • Song, X., Jiang, Y., Tu, S., Du, Y., and Neyshabur, B. Observational overfitting in reinforcement learning. ArXiv, abs/1912.02975, 2020.
    Findings
  • Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS, 2011.
    Google ScholarLocate open access versionFindings
  • Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
    Google ScholarLocate open access versionFindings
  • Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
    Google ScholarLocate open access versionFindings
  • Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.
    Google ScholarLocate open access versionFindings
  • van der Maaten, L. and Hinton, G. E. Visualizing data using t-sne. 2008.
    Google ScholarFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • Wang, J. X., Kurth-Nelson, Z., Soyer, H., Leibo, J. Z., Tirumala, D., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. M. Learning to reinforcement learn. ArXiv, abs/1611.05763, 2016.
    Findings
  • Wang, Z., Merel, J., Reed, S. E., de Freitas, N., Wayne, G., and Heess, N. M. O. Robust imitation of diverse behaviors. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • White, A., Modayil, J., and Sutton, R. S. Scaling lifelong off-policy learning. In 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Protecting against evaluation overfitting in empirical reinforcement learning. 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 120–127, 2011.
    Google ScholarLocate open access versionFindings
  • Xu, Z., van Hasselt, H., and Silver, D. Meta-gradient reinforcement learning. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Yang, J., Petersen, B., Zha, H., and Faissol, D. Single episode policy transfer in reinforcement learning. arXiv preprint arXiv:1910.07719, 2019.
    Findings
  • Yao, J., Killian, T. W., Konidaris, G., and Doshi-Velez, F. Direct policy transfer via hidden parameter markov decision processes. 2018.
    Google ScholarFindings
  • Zhang, A., Ballas, N., and Pineau, J. A dissection of overfitting and generalization in continuous reinforcement learning. arXiv preprint arXiv:1806.07937, 2018a.
    Findings
  • Zhang, A., Satija, H., and Pineau, J. Decoupling dynamics and reward for transfer learning. ArXiv, abs/1804.10689, 2018b.
    Findings
  • Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018c.
    Findings
  • Zhang, J., Springenberg, J. T., Boedecker, J., and Burgard, W. Deep reinforcement learning with successor features for navigation across similar environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2371–2378. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In ICML, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments