## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

NIPS 2020, (2020)

EI

Keywords

Abstract

We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $\epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (...More

Code:

Data:

Introduction

- In settings where the reward signal is not informative enough — e.g., too sparse, time-varying or even absent — a reinforcement learning (RL) agent needs to explore the environment driven by objectives other than reward maximization, see [e.g., 2, 3, 4, 5].
- The authors strengthen the objective of incremental exploration and require the agent to learn "-optimal goal-conditioned policies for any L-controllable state.

Highlights

- In settings where the reward signal is not informative enough — e.g., too sparse, time-varying or even absent — a reinforcement learning (RL) agent needs to explore the environment driven by objectives other than reward maximization, see [e.g., 2, 3, 4, 5]
- While we primarily focus the analysis of DisCo in the tabular case, we believe that the formal definition of Autonomous eXploration (AX) problems and the general structure of DisCo may serve as a theoretical grounding of many recent approaches to unsupervised exploration
- Go-Explore assumes that the world is deterministic and resettable, meaning that one can reset the state of the simulator to a previous visit to that cell
- Very recently [14], the same authors proposed a way to relax this requirement by training goal-conditioned policies to reliably return to cells in the archive during the exploration phase
- Interesting directions for future investigation include: 1) Deriving a lower bound for the AX problems; 2) Integrating DisCo into the meta-algorithm MNM [31] which deals with incremental exploration for AXL in non-stationary environments; 3) Extending the problem to continuous state space and function approximation; 4) Relaxing the definition of incrementally controllable states and relaxing the performance definition towards allowing the agent to have a non-zero but limited sample complexity of learning a shortest-path policy for any state at test time

Results

- AXL is the original objective introduced in [1] and it requires the agent to discover all the incrementally L-controllable states as fast as possible.5 At the end of the learning process, for each state s the agent should return a policy that can reach s from s0 in at most steps.
- The algorithm proceeds through rounds, which are indexed by k and incremented whenever a state in U gets transferred to the set K, i.e., when the transition model reaches a level of accuracy sufficient to compute a policy to control one of the states encountered before.
- By considering a restricted set Wk ✓ Uk. if the estimated probability pbk of reaching a state s 2 Uk from any of the controllable states in Kk is lower than (1 "/2)/L, no shortest-path policy restricted on Kk could get to s from s0 in less than L + " steps on average.
- Kk. Since optimistic p0 is unknown, k value iteration k (OVISSP) for SSP [25, 26] to obtain a value function ues0 and its associated greedy policy ⇡es0 restricted on Kk. The agent chooses a candidate goal state s† for which the value ue† := ues† (s0) is the smallest.
- The algorithm terminates and, using the current estimates of the model, it recomputes an optimistic shortest-path policy ⇡s restricted on the final set KK for each state s 2 KK.
- (and on the expansion of the set of the so far controllable states may alter and refine the optimal goal-reaching policies restricted on it.
- 2.2, the better dependency on " both improves the quality of the output goal-reaching policies as well as reduces the number of incrementally (L + ")-controllable states returned by the algorithm.

Conclusion

- The authors investigated the theoretical dimension of this direction, by provably learning such goal-conditioned policies for the set of incrementally controllable states.
- Interesting directions for future investigation include: 1) Deriving a lower bound for the AX problems; 2) Integrating DisCo into the meta-algorithm MNM [31] which deals with incremental exploration for AXL in non-stationary environments; 3) Extending the problem to continuous state space and function approximation; 4) Relaxing the definition of incrementally controllable states and relaxing the performance definition towards allowing the agent to have a non-zero but limited sample complexity of learning a shortest-path policy for any state at test time.

Summary

- In settings where the reward signal is not informative enough — e.g., too sparse, time-varying or even absent — a reinforcement learning (RL) agent needs to explore the environment driven by objectives other than reward maximization, see [e.g., 2, 3, 4, 5].
- The authors strengthen the objective of incremental exploration and require the agent to learn "-optimal goal-conditioned policies for any L-controllable state.
- AXL is the original objective introduced in [1] and it requires the agent to discover all the incrementally L-controllable states as fast as possible.5 At the end of the learning process, for each state s the agent should return a policy that can reach s from s0 in at most steps.
- The algorithm proceeds through rounds, which are indexed by k and incremented whenever a state in U gets transferred to the set K, i.e., when the transition model reaches a level of accuracy sufficient to compute a policy to control one of the states encountered before.
- By considering a restricted set Wk ✓ Uk. if the estimated probability pbk of reaching a state s 2 Uk from any of the controllable states in Kk is lower than (1 "/2)/L, no shortest-path policy restricted on Kk could get to s from s0 in less than L + " steps on average.
- Kk. Since optimistic p0 is unknown, k value iteration k (OVISSP) for SSP [25, 26] to obtain a value function ues0 and its associated greedy policy ⇡es0 restricted on Kk. The agent chooses a candidate goal state s† for which the value ue† := ues† (s0) is the smallest.
- The algorithm terminates and, using the current estimates of the model, it recomputes an optimistic shortest-path policy ⇡s restricted on the final set KK for each state s 2 KK.
- (and on the expansion of the set of the so far controllable states may alter and refine the optimal goal-reaching policies restricted on it.
- 2.2, the better dependency on " both improves the quality of the output goal-reaching policies as well as reduces the number of incrementally (L + ")-controllable states returned by the algorithm.
- The authors investigated the theoretical dimension of this direction, by provably learning such goal-conditioned policies for the set of incrementally controllable states.
- Interesting directions for future investigation include: 1) Deriving a lower bound for the AX problems; 2) Integrating DisCo into the meta-algorithm MNM [31] which deals with incremental exploration for AXL in non-stationary environments; 3) Extending the problem to continuous state space and function approximation; 4) Relaxing the definition of incrementally controllable states and relaxing the performance definition towards allowing the agent to have a non-zero but limited sample complexity of learning a shortest-path policy for any state at test time.

Reference

- Shiau Hong Lim and Peter Auer. Autonomous exploration for navigating in MDPs. In Conference on Learning Theory, pages 40–1, 2012.
- Jurgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991.
- Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
- Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606. Cognitive Science Society, 2009.
- Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
- Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pages 1471–1479, 2016.
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Variational information maximizing exploration. Advances in Neural Information Processing Systems (NIPS), 2016.
- Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
- Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo Avila Pires, Jean-Bastian Grill, Florent Altche, and Remi Munos. World discovery models. arXiv preprint arXiv:1902.07685, 2019.
- Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019.
- Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamicsaware unsupervised discovery of skills. In International Conference on Learning Representations, 2020.
- Vıctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 2020.
- Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. First return then explore. arXiv preprint arXiv:2004.12919, 2020.
- Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pages 1515–1528, 2018.
- David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In International Conference on Learning Representations, 2019.
- Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-fit: State-covering self-supervised reinforcement learning. In International Conference on Machine Learning, 2020.
- Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681–2691, 2019.
- Jean Tarbouriech and Alessandro Lazaric. Active exploration in markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 974–982, 2019.
- Wang Chi Cheung. Exploration-exploitation trade-off in reinforcement learning on online markov decision processes with global concave rewards. arXiv preprint arXiv:1905.06466, 2019.
- Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, and Alessandro Lazaric. Active model estimation in markov decision processes. In Conference on Uncertainty in Artificial Intelligence, 2020.
- Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, 2020.
- Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
- Dimitri Bertsekas. Dynamic programming and optimal control, volume 2. 2012.
- Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, 2020.
- Alon Cohen, Haim Kaplan, Yishay Mansour, and Aviv Rosenberg. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, 2020.
- Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013.
- Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Mohammad Gheshlaghi Azar, Vicenc Gomez, and Hilbert J Kappen. Dynamic policy programming. Journal of Machine Learning Research, 13(Nov):3207–3245, 2012.
- Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- Pratik Gajane, Ronald Ortner, Peter Auer, and Csaba Szepesvari. Autonomous exploration for navigating in non-stationary CMPs. arXiv preprint arXiv:1910.08446, 2019.
- Blai Bonet. On the speed of convergence of value iteration on stochastic shortest-path problems. Mathematics of Operations Research, 32(2):365–373, 2007.
- Jean-Yves Audibert, Remi Munos, and Csaba Szepesvari. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150–165.
- Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
- Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595, 1991.
- Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
- Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2 with empirical bernstein inequality. arXiv preprint arXiv:2007.05456, 2020.
- Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.

Tags

Comments