Policy Improvement via Imitation of Multiple Oracles

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views23
Other Links: arxiv.org
Weibo:
We study how the conflicts between different experts can be resolved through the max-aggregated baseline and propose a new generalized advantage estimation-style gradient for the imitation learning setting, which can be used to improve the robustness and performance of existing s...

Abstract:

Despite its promise, reinforcement learning's real-world adoption has been hampered by its need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an expert policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner ...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) promises to bring self-improving decision-making capability to many applications, including robotics [1], computer systems [2], recommender systems [3] and user interfaces [4].
  • The authors show that the issues mentioned above can be addressed by performing policy improvement upon the state-wise maximum over the experts’ values, i.e. the max-aggregated baseline.
  • To compete with f max yet without the assumption above, the authors design an IL algorithm by a reduction to online learning [29], a technique used in many prior works in the single-expert setting [12,13,14,15,16, 30].
Highlights
  • Reinforcement learning (RL) promises to bring self-improving decision-making capability to many applications, including robotics [1], computer systems [2], recommender systems [3] and user interfaces [4]
  • Deploying RL in any of these domains is fraught with numerous difficulties, as vanilla RL agents need to do a large amount of trial-and-error exploration before discovering good decision policies [5]
  • Summary We conclude this paper by revisiting Fig. 1, which showcases the best multi-expert settings in Figs. 2c and 7a and an additional heuristic extension of MAMBA that replaces the maxaggregated baseline (f max in line of with a mean-aggregated baseline (i.e. Overall these results support the benefits of imitation learning (IL) from multiple experts and the new generalized advantage estimation (GAE)-style IL
  • While our current theoretical results do not fully explain this phenomena, a plausible hypothesis is the use of GAE-style gradient and that we are averaging values, instead of actions
  • We study how the conflicts between different experts can be resolved through the max-aggregated baseline and propose a new GAE-style gradient for the IL setting, which can be used to improve the robustness and performance of existing single-expert IL algorithms
  • The experimental results show that MAMBA is able to improve upon multiple, very suboptimal expert policies to achieve the optimal performance, faster than both the pure RL method (PG-GAE [17]) and the single-expert IL algorithm (AggreVaTeD [14])
Results
  • Ideal setting with known values If the MDP dynamics and rewards are unknown, the authors can treat dπn as the adversary in online learning and define the online loss in the n-th round as n(π) := −T Es∼dπn [Amax(s, π)] .
  • When πmax ∈ Π, running a no-regret algorithm to solve this online learning problem will guarantee producing a policy whose performance at least Es∼d0 [maxk∈[K] V k(s)] + ∆N + o(N ) after N rounds.
  • The above reduction in (7) generalizes AggreVaTE [13] from using f = V πe in Af to define the online loss for the single expert case to f = f max that is applicable to multiple experts.
  • The authors compare MAMBA with two representative algorithms: GAE Policy Gradient [17] for direct RL and AggreVaTeD [14] for IL with a single expert.
  • Because the authors can view these algorithms as different first-order oracles for policy optimization, comparing their performance allows them to study two important questions: 1) whether the proposed GAE-style gradient in (14) is an effective update direction for IL and 2) whether using multiple experts helps the agent learn faster.
  • This is because DoubleInvertedPendulum is a harder domain for exploration than CartPole; using more experts can potentially yield higher performance, the learner needs to spend more time to learn the experts’ value functions.
  • The authors found that replacing the max-aggregated baseline with the meanaggregated baseline in MAMBA can improve the results from AggreVaTeD, which uses the single best expert.
Conclusion
  • As long as most experts have similar values in a state, the mean-aggregated baseline can still provide a meaningful direction for policy improvement.
  • The authors study how the conflicts between different experts can be resolved through the max-aggregated baseline and propose a new GAE-style gradient for the IL setting, which can be used to improve the robustness and performance of existing single-expert IL algorithms.
  • The experimental results show that MAMBA is able to improve upon multiple, very suboptimal expert policies to achieve the optimal performance, faster than both the pure RL method (PG-GAE [17]) and the single-expert IL algorithm (AggreVaTeD [14]).
Summary
  • Reinforcement learning (RL) promises to bring self-improving decision-making capability to many applications, including robotics [1], computer systems [2], recommender systems [3] and user interfaces [4].
  • The authors show that the issues mentioned above can be addressed by performing policy improvement upon the state-wise maximum over the experts’ values, i.e. the max-aggregated baseline.
  • To compete with f max yet without the assumption above, the authors design an IL algorithm by a reduction to online learning [29], a technique used in many prior works in the single-expert setting [12,13,14,15,16, 30].
  • Ideal setting with known values If the MDP dynamics and rewards are unknown, the authors can treat dπn as the adversary in online learning and define the online loss in the n-th round as n(π) := −T Es∼dπn [Amax(s, π)] .
  • When πmax ∈ Π, running a no-regret algorithm to solve this online learning problem will guarantee producing a policy whose performance at least Es∼d0 [maxk∈[K] V k(s)] + ∆N + o(N ) after N rounds.
  • The above reduction in (7) generalizes AggreVaTE [13] from using f = V πe in Af to define the online loss for the single expert case to f = f max that is applicable to multiple experts.
  • The authors compare MAMBA with two representative algorithms: GAE Policy Gradient [17] for direct RL and AggreVaTeD [14] for IL with a single expert.
  • Because the authors can view these algorithms as different first-order oracles for policy optimization, comparing their performance allows them to study two important questions: 1) whether the proposed GAE-style gradient in (14) is an effective update direction for IL and 2) whether using multiple experts helps the agent learn faster.
  • This is because DoubleInvertedPendulum is a harder domain for exploration than CartPole; using more experts can potentially yield higher performance, the learner needs to spend more time to learn the experts’ value functions.
  • The authors found that replacing the max-aggregated baseline with the meanaggregated baseline in MAMBA can improve the results from AggreVaTeD, which uses the single best expert.
  • As long as most experts have similar values in a state, the mean-aggregated baseline can still provide a meaningful direction for policy improvement.
  • The authors study how the conflicts between different experts can be resolved through the max-aggregated baseline and propose a new GAE-style gradient for the IL setting, which can be used to improve the robustness and performance of existing single-expert IL algorithms.
  • The experimental results show that MAMBA is able to improve upon multiple, very suboptimal expert policies to achieve the optimal performance, faster than both the pure RL method (PG-GAE [17]) and the single-expert IL algorithm (AggreVaTeD [14]).
Reference
  • Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.
    Google ScholarLocate open access versionFindings
  • Nguyen Cong Luong, Dinh Thai Hoang, Shimin Gong, Dusit Niyato, Ping Wang, Ying-Chang Liang, and Dong In Kim. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys & Tutorials, 21, 2019.
    Google ScholarLocate open access versionFindings
  • Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems, pages 1800–1809, 2018.
    Google ScholarLocate open access versionFindings
  • Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179, 2018. ISSN 1935-8261.
    Google ScholarLocate open access versionFindings
  • Dean Pomerleau. ALVINN: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, 1989.
    Google ScholarLocate open access versionFindings
  • Peter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning, 2004.
    Google ScholarLocate open access versionFindings
  • Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Conference on Artificial Intelligence, 2008.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn and Sergey Levine. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, 2011.
    Google ScholarLocate open access versionFindings
  • Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
    Findings
  • Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning, pages 3309–3318. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Hal Daumé III. Learning to search better than your teacher. In International Conference on Machine Learning, 2015.
    Google ScholarLocate open access versionFindings
  • Ching-An Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through imitation and reinforcemen. In Conference on Uncertainty in Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
    Findings
  • Jung-Yeon Baek, Georges Kaddoum, Sahil Garg, Kuljeet Kaur, and Vivianne Gravel. Managing fog networks using reinforcement learning based load balancing algorithm. In IEEE Wireless Communications and Networking Conference, 2019.
    Google ScholarLocate open access versionFindings
  • Guohao Li, Matthias Mueller, Vincent Casser, Neil Smith, Dominik L Michels, and Bernard Ghanem. OIL: Observational imitation learning. In Robotics: Science and Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Andrey Kurenkov, Ajay Mandlekar, Roberto Martin-Martin, Silvio Savarese, and Animesh Garg. AC-Teach: A Bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. In Conference on Robot Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Ching-An Cheng and Byron Boots. Convergence of value aggregation for imitation learning. In International Conference on Artificial Intelligence and Statistics, pages 1801–1809, 2018.
    Google ScholarLocate open access versionFindings
  • Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.
    Google ScholarLocate open access versionFindings
  • Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
    Google ScholarLocate open access versionFindings
  • Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, volume 99, pages 278–287, 1999.
    Google ScholarLocate open access versionFindings
  • Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
    Google ScholarFindings
  • Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936, 2003.
    Google ScholarLocate open access versionFindings
  • Ching-An Cheng, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Accelerating imitation learning with predictive models. In International Conference on Artificial Intelligence and Statistics, pages 3187–3196, 2019.
    Google ScholarLocate open access versionFindings
  • Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning & imitation learning. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2): 251–276, 1998.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Jeongseok Lee, Michael Grey, Sehoon Ha, Tobias Kunz, Sumit Jain, Yuting Ye, Siddhartha Srinivasa, Mike Stilman, and Chuanjian Liu. Dart: Dynamic animation and robotics toolkit. Journal of Open Source Software, 3(22):500, 2018.
    Google ScholarLocate open access versionFindings
  • J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph Lim. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Aviv Tamar, Khashayar Rohanimanesh, Yinlam Chow, Chris Vigorito, Ben Goodrich, Michael Kahane, and Derik Pridmore. Imitation learning from visual data with multiple intentions. In International Conference on Representation Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Michael Gimelfarb, Scott Sanner, and Chi-Guhn Lee. Reinforcement learning with multiple experts: A Bayesian model combination approach. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Ching-An Cheng, Xinyan Yan, Nathan Ratliff, and Byron Boots. Predictor-corrector policy optimization. In International Conference on Machine Learning, pages 1151–1161, 2019.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments