# Fast Adaptation to New Environments via Policy-Dynamics Value Functions

ICML, pp. 7920-7931, 2020.

EI

Weibo:

Abstract:

Standard RL algorithms assume fixed environment dynamics and require a significant amount of interaction to adapt to new environments. We introduce Policy-Dynamics Value Functions (PDVF), a novel approach for rapidly adapting to dynamics different from those previously seen in training. PD-VF explicitly estimates the cumulative reward in ...More

Code:

Data:

ZH

Introduction

- Deep reinforcement learning (RL) has achieved impressive results on a wide range of complex tasks (Mnih et al, 2015; Silver et al, 2016; 2017; 2018; Jaderberg et al, 2019; Berner et al, 2019; Vinyals et al, 2019).
- A self-driving car might have to adjust its behavior depending on weather conditions, or a prosthetic control system might have to adapt to a new human.
- In these cases it is crucial for RL agents to find and execute appropriate policies as quickly as possible.
- In this work, the authors aim to learn a value function conditioned on elements of a space of policies and tasks, but here, a “task” is specified by the transition function of the MDP instead of the reward function

Highlights

- Deep reinforcement learning (RL) has achieved impressive results on a wide range of complex tasks (Mnih et al, 2015; Silver et al, 2016; 2017; 2018; Jaderberg et al, 2019; Berner et al, 2019; Vinyals et al, 2019)
- In some cases, our approach is comparable to the PPOenv upper bound which was directly trained on the respective test environment
- We propose policy-dynamics value functions (PD-VF), a novel framework for fast adaptation to new environment dynamics
- The environment embedding can be inferred from only a few interactions, which allows the selection of a policy that maximizes the learned value function
- Policy-Dynamics Value Functions has a number of desirable properties: it leverages the structure in both the policy and the dynamics space to estimate the expected return, it only needs a small number of steps to adapt to unseen dynamics, it does not update any parameters at test time, and it does not require dense reward or long rollouts to find an effective policy in a new environment
- As noted by Precup et al (2001), Sutton et al (2011), and White et al (2012), learning about multiple policies in parallel via general value functions can be useful for lifelong learning

Methods

- The authors evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations.
- The authors create a number of environments with different dynamics.
- The authors split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics.
- PPOenv PD-VF RL 2 MAML PPOdyn PPOall Swimmer Environment Ant-Wind Mean 695 SD: 291 Spaceship Mean 862 SD: 18 Ant-Legs Mean 374 SD: 52

Results

- While the strength of PD-VF lies in quickly adapting to new dynamics, its performance on training environments is still comparable to that of the other baselines, as shown in Figure 6.
- This result is not surprising since current state-of-the-art RL algorithms such as PPO can generally learn good policies for the environments they are trained on, given enough interactions, updates, and the right hyperparamters.
- Even meta-learning approaches like MAML or RL2 struggle to adapt when they are allowed to use only a short trajectory for updating the policy at test time, as is the case here

Conclusion

**Discussion and Future**

Work

In this work, the authors propose policy-dynamics value functions (PD-VF), a novel framework for fast adaptation to new environment dynamics.- The PD-VF framework can, in principle, be used to evaluate a family of policies and environments on other metrics of interest besides the expected return, such as, for example, reward variance, agent prosociality, deviation from expert behavior, and so on.
- Another interesting direction is to integrate additional constraints to the optimization problem.
- PD-VF can be applied to multi-agent settings for adapting to different opponents or teammates whose behaviors determine the environment dynamics

Summary

## Introduction:

Deep reinforcement learning (RL) has achieved impressive results on a wide range of complex tasks (Mnih et al, 2015; Silver et al, 2016; 2017; 2018; Jaderberg et al, 2019; Berner et al, 2019; Vinyals et al, 2019).- A self-driving car might have to adjust its behavior depending on weather conditions, or a prosthetic control system might have to adapt to a new human.
- In these cases it is crucial for RL agents to find and execute appropriate policies as quickly as possible.
- In this work, the authors aim to learn a value function conditioned on elements of a space of policies and tasks, but here, a “task” is specified by the transition function of the MDP instead of the reward function
## Objectives:

The authors aim to learn a value function conditioned on elements of a space of policies and tasks, but here, a “task” is specified by the transition function of the MDP instead of the reward function.- The authors aim to design an approach that can quickly find a good policy in an environment with new and unknown dynamics, after being trained on a family of environments with related dynamics
## Methods:

The authors evaluate PD-VF on four continuous control domains, and compare it with an upper bound, four baselines, and four ablations.- The authors create a number of environments with different dynamics.
- The authors split the set of environments into training and test subsets, so that at test time, the agent has to find a policy that behaves well on unseen dynamics.
- PPOenv PD-VF RL 2 MAML PPOdyn PPOall Swimmer Environment Ant-Wind Mean 695 SD: 291 Spaceship Mean 862 SD: 18 Ant-Legs Mean 374 SD: 52
## Results:

While the strength of PD-VF lies in quickly adapting to new dynamics, its performance on training environments is still comparable to that of the other baselines, as shown in Figure 6.- This result is not surprising since current state-of-the-art RL algorithms such as PPO can generally learn good policies for the environments they are trained on, given enough interactions, updates, and the right hyperparamters.
- Even meta-learning approaches like MAML or RL2 struggle to adapt when they are allowed to use only a short trajectory for updating the policy at test time, as is the case here
## Conclusion:

**Discussion and Future**

Work

In this work, the authors propose policy-dynamics value functions (PD-VF), a novel framework for fast adaptation to new environment dynamics.- The PD-VF framework can, in principle, be used to evaluate a family of policies and environments on other metrics of interest besides the expected return, such as, for example, reward variance, agent prosociality, deviation from expert behavior, and so on.
- Another interesting direction is to integrate additional constraints to the optimization problem.
- PD-VF can be applied to multi-agent settings for adapting to different opponents or teammates whose behaviors determine the environment dynamics

Related work

- Our work draws inspiration from multiple research areas such as transfer learning (Taylor & Stone, 2009; Higgins et al, 2017), skill and task embedding (Devin et al, 2016; Zhang et al, 2018b; Hausman et al, 2018; Petangoda et al, 2019), and general value functions (Precup et al, 2001; Sutton et al, 2011; White et al, 2012).

Multi-Task and Transfer Learning. Taylor & Stone (2009) presents an overview of transfer learning methods in RL. A popular approach for transfer in RL is multi-task learning (Taylor & Stone, 2009; Teh et al, 2017), a paradigm in which an agent is trained on a family of related tasks. By simultaneously learning about different tasks, the agent can exploit their common structure, which can lead to faster learning and better generalization to unseen tasks from the same family (Taylor & Stone, 2009; Lazaric, 2012; Ammar et al, 2012; 2014; Parisotto et al, 2015; Borsa et al, 2016; Gupta et al, 2017; Andreas et al, 2017; Oh et al, 2017; Hessel et al, 2019). A large body of work has been inspired by the Horde architecture (Sutton et al, 2011), which consists of a number of independent RL agents with different policies and goals. Each agent is tasked with estimating the value function of a particular policy on a given task, thus collectively representing knowledge about the world. Building on these ideas, other methods leverage the shared dynamics of the tasks (Barreto et al, 2017; Zhang et al, 2017; Madjiheurem & Toni, 2019) or the similarity among value functions and the associated optimal policies (Schaul et al, 2015; Borsa et al, 2018; Hansen et al, 2019; Siriwardhana et al, 2019). However, all these approaches assume the same underlying transition function for all tasks. In contrast, we focus on transferring knowledge across tasks with different dynamics.

Funding

- Roberta and Max were supported by the DARPA L2M grant. Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J

Reference

- Ammar, H. B., Tuyls, K., Taylor, M. E., Driessens, K., and Weiss, G. Reinforcement learning transfer via sparse coding. In Proceedings of the 11th international conference on autonomous agents and multiagent systems, volume 1, pp. 383–390. International Foundation for Autonomous Agents and Multiagent Systems..., 2012.
- Ammar, H. B., Eaton, E., Taylor, M. E., Mocanu, D. C., Driessens, K., Weiss, G., and Tuyls, K. An automated measure of mdp similarity for transfer in reinforcement learning. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
- Andreas, J., Klein, D., and Levine, S. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pp. 166–175, 2017.
- Arnekvist, I., Kragic, D., and Stork, J. A. Vpe: Variational policy embedding for transfer reinforcement learning. 2019 International Conference on Robotics and Automation (ICRA), pp. 36–42, 2018.
- Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.
- Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Jozefowicz, R., Gray, S., Olsson, C., Pachocki, J. W., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., and Zhang, S. Dota 2 with large scale deep reinforcement learning. ArXiv, abs/1912.06680, 2019.
- Borsa, D., Graepel, T., and Shawe-Taylor, J. Learning shared representations in multi-task reinforcement learning. arXiv preprint arXiv:1603.02041, 2016.
- Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., van Hasselt, H., Silver, D., and Schaul, T. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018.
- Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and Levine, S. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. In ICML, 2018.
- Cully, A., Clune, J., Tarapore, D., and Mouret, J.-B. Robots that can adapt like animals. Nature, 521:503–507, 2015.
- Da Silva, B., Konidaris, G., and Barto, A. Learning parameterized skills. arXiv preprint arXiv:1206.6398, 2012.
- Devin, C., Gupta, A., Darrell, T., Abbeel, P., and Levine, S. Learning modular neural network policies for multi-task and multi-robot transfer. 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176, 2016.
- Doshi-Velez, F. and Konidaris, G. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. IJCAI: proceedings of the conference, 2016:1432–1440, 2013.
- Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning. ArXiv, abs/1611.02779, 2016.
- Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., and Zaremba, W. Oneshot imitation learning. In NIPS, 2017.
- Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
- Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949, 2017.
- Hansen, S., Dabney, W., Barreto, A., Van de Wiele, T., Warde-Farley, D., and Mnih, V. Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030, 2019.
- Hausman, K., Springenberg, J. T., Wang, Z., Heess, N. M. O., and Riedmiller, M. A. Learning an embedding space for transferable robot skills. In ICLR, 2018.
- He, Z., Julian, R., Heiden, E., Zhang, H., Schaal, S., Lim, J. J., Sukhatme, G., and Hausman, K. Zero-shot skill composition and simulation-to-real transfer by learning task representations. arXiv preprint arXiv:1810.02422, 2018.
- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019.
- Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M. M., Blundell, C., and Lerchner, A. Darla: Improving zero-shot transfer in reinforcement learning. In ICML, 2017.
- Houthooft, R., Chen, R. Y., Isola, P., Stadie, B. C., Wolski, F., Ho, J., and Abbeel, P. Evolved policy gradients. ArXiv, abs/1802.04821, 2018.
- Humplik, J., Galashov, A., Hasenclever, L., Ortega, P. A., Teh, Y. W., and Heess, N. Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424, 2019.
- Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Human-level performance in 3d multiplayer games with populationbased reinforcement learning. Science, 364(6443):859– 865, 2019.
- Killian, T. W., Konidaris, G., and Doshi-Velez, F. Robust and efficient transfer learning with hidden parameter markov decision processes. Advances in neural information processing systems, 30:6250–6261, 2017.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma, D. P. and Welling, M. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
- Madjiheurem, S. and Toni, L. State2vec: Off-policy successor features approximators. arXiv preprint arXiv:1910.10277, 2019.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533, 2015.
- Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347, 2018.
- Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. arXiv preprint arXiv:1706.05064, 2017.
- Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
- Paul, S., Osborne, M. A., and Whiteson, S. Fingerprint policy optimisation for robust reinforcement learning. In ICML, 2018.
- Perez, C. F., Such, F. P., and Karaletsos, T. Efficient transfer learning and online adaptation with latent variable models for continuous control. ArXiv, abs/1812.03399, 2018.
- Petangoda, J. C., Pascual-Diaz, S., Adam, V., Vrancx, P., and Grau-Moya, J. Disentangled skill embeddings for reinforcement learning. ArXiv, abs/1906.09223, 2019.
- Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Robust adversarial reinforcement learning. In ICML, 2017.
- Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In ICML, pp. 417–424, 2001.
- Raileanu, R. and Rocktaschel, T. Ride: Rewarding impactdriven exploration for procedurally-generated environments. ArXiv, abs/2002.12292, 2020.
- Rajeswaran, A., Lowrey, K., Todorov, E. V., and Kakade, S. M. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561, 2017.
- Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp. 5331–5340, 2019.
- Sæmundsson, S., Hofmann, K., and Deisenroth, M. P. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551, 2018.
- Sahni, H., Kumar, S., Tejani, F., and Isbell, C. Learning to compose skills. arXiv preprint arXiv:1711.11289, 2017.
- Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- Siriwardhana, S., Weerasakera, R., Matthies, D. J., and Nanayakkara, S. Vusfa: Variational universal successor features approximator to improve transfer drl for target driven visual navigation. arXiv preprint arXiv:1908.06376, 2019.
- Song, X., Jiang, Y., Tu, S., Du, Y., and Neyshabur, B. Observational overfitting in reinforcement learning. ArXiv, abs/1912.02975, 2020.
- Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS, 2011.
- Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
- Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
- Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.
- van der Maaten, L. and Hinton, G. E. Visualizing data using t-sne. 2008.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350–354, 2019.
- Wang, J. X., Kurth-Nelson, Z., Soyer, H., Leibo, J. Z., Tirumala, D., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. M. Learning to reinforcement learn. ArXiv, abs/1611.05763, 2016.
- Wang, Z., Merel, J., Reed, S. E., de Freitas, N., Wayne, G., and Heess, N. M. O. Robust imitation of diverse behaviors. In NIPS, 2017.
- White, A., Modayil, J., and Sutton, R. S. Scaling lifelong off-policy learning. In 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6. IEEE, 2012.
- Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Protecting against evaluation overfitting in empirical reinforcement learning. 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 120–127, 2011.
- Xu, Z., van Hasselt, H., and Silver, D. Meta-gradient reinforcement learning. In NeurIPS, 2018.
- Yang, J., Petersen, B., Zha, H., and Faissol, D. Single episode policy transfer in reinforcement learning. arXiv preprint arXiv:1910.07719, 2019.
- Yao, J., Killian, T. W., Konidaris, G., and Doshi-Velez, F. Direct policy transfer via hidden parameter markov decision processes. 2018.
- Zhang, A., Ballas, N., and Pineau, J. A dissection of overfitting and generalization in continuous reinforcement learning. arXiv preprint arXiv:1806.07937, 2018a.
- Zhang, A., Satija, H., and Pineau, J. Decoupling dynamics and reward for transfer learning. ArXiv, abs/1804.10689, 2018b.
- Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018c.
- Zhang, J., Springenberg, J. T., Boedecker, J., and Burgard, W. Deep reinforcement learning with successor features for navigation across similar environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2371–2378. IEEE, 2017.
- Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In ICML, 2018.

Tags

Comments