# Noisy Networks for Exploration

international conference on learning representations, 2018.

EI

Weibo:

Abstract:

We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent’s policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward t...More

Code:

Data:

Introduction

- Despite the wealth of research into efficient methods for exploration in Reinforcement Learning (RL) (Kearns & Singh, 2002; Jaksch et al, 2010), most exploration heuristics rely on random perturbations of the agent’s policy, such as -greedy (Sutton & Barto, 1998) or entropy regularisation (Williams, 1992), to induce novel behaviours.

Highlights

- Despite the wealth of research into efficient methods for exploration in Reinforcement Learning (RL) (Kearns & Singh, 2002; Jaksch et al, 2010), most exploration heuristics rely on random perturbations of the agent’s policy, such as -greedy (Sutton & Barto, 1998) or entropy regularisation (Williams, 1992), to induce novel behaviours
- Optimism in the face of uncertainty is a common exploration heuristic in reinforcement learning. Various forms of this heuristic often come with theoretical guarantees on agent performance (Azar et al, 2017; Lattimore et al, 2013; Jaksch et al, 2010; Auer & Ortner, 2007; Kearns & Singh, 2002). These methods are often limited to small state-action spaces or to linear function approximations and are not applied with more complicated function approximators such as neural networks (except from work by (Geist & Pietquin, 2010a;b) but it doesn’t come with convergence guarantees)
- NoisyNet can be adapted to any deep Reinforcement Learning algorithm and we demonstrate this versatility by providing NoisyNet versions of Deep Q-Networks (Mnih et al, 2015), Dueling (Wang et al, 2016) and A3C (Mnih et al, 2016) algorithms
- We have presented a general method for exploration in deep reinforcement learning that shows significant performance improvements across many Atari games in three different agent architectures
- We observe that in games such as Beam rider, Asteroids and Freeway that the standard Deep Q-Networks, Dueling and A3C perform poorly compared with the human player, NoisyNet-Deep Q-Networks, NoisyNet-Dueling and NoisyNet-A3C achieve super human performance, respectively
- Having weights with greater uncertainty introduces more variability into the decisions made by the policy, which has potential for exploratory actions, but further analysis needs to be done in order to disentangle the exploration and optimisation effects. Another advantage of NoisyNet is that the amount of noise injected in the network is tuned automatically by the Reinforcement Learning algorithm

Results

- The authors used the random start no-ops scheme for training and evaluation as described the original DQN paper (Mnih et al, 2015).
- The raw average scores of the agents are evaluated during training, every 1M frames in the environment, by suspending (a) Improvement in percentage of NoisyNet-DQN over DQN (Mnih et al, 2015).
- For the NoisyNet variants the authors used the same hyper parameters as in the respective original paper for the baseline

Conclusion

- The authors have presented a general method for exploration in deep reinforcement learning that shows significant performance improvements across many Atari games in three different agent architectures.
- Having weights with greater uncertainty introduces more variability into the decisions made by the policy, which has potential for exploratory actions, but further analysis needs to be done in order to disentangle the exploration and optimisation effects
- Another advantage of NoisyNet is that the amount of noise injected in the network is tuned automatically by the RL algorithm.
- This alleviates the need for any hyper parameter tuning
- This is in contrast to many other methods that add intrinsic motivation signals that may destabilise learning or change the optimal policy.
- A similar randomisation technique can be applied to LSTM units (Fortunato et al, 2017) and is extended to reinforcement learning, the authors leave this as future work

Summary

## Introduction:

Despite the wealth of research into efficient methods for exploration in Reinforcement Learning (RL) (Kearns & Singh, 2002; Jaksch et al, 2010), most exploration heuristics rely on random perturbations of the agent’s policy, such as -greedy (Sutton & Barto, 1998) or entropy regularisation (Williams, 1992), to induce novel behaviours.## Results:

The authors used the random start no-ops scheme for training and evaluation as described the original DQN paper (Mnih et al, 2015).- The raw average scores of the agents are evaluated during training, every 1M frames in the environment, by suspending (a) Improvement in percentage of NoisyNet-DQN over DQN (Mnih et al, 2015).
- For the NoisyNet variants the authors used the same hyper parameters as in the respective original paper for the baseline
## Conclusion:

The authors have presented a general method for exploration in deep reinforcement learning that shows significant performance improvements across many Atari games in three different agent architectures.- Having weights with greater uncertainty introduces more variability into the decisions made by the policy, which has potential for exploratory actions, but further analysis needs to be done in order to disentangle the exploration and optimisation effects
- Another advantage of NoisyNet is that the amount of noise injected in the network is tuned automatically by the RL algorithm.
- This alleviates the need for any hyper parameter tuning
- This is in contrast to many other methods that add intrinsic motivation signals that may destabilise learning or change the optimal policy.
- A similar randomisation technique can be applied to LSTM units (Fortunato et al, 2017) and is extended to reinforcement learning, the authors leave this as future work

- Table1: Comparison between the baseline DQN, Dueling and A3C and their NoisyNet version in terms of median and mean human-normalised scores defined in Eq (18). We report on the last column the percentage improvement on the baseline in terms of median human-normalised score
- Table2: Comparison between the baseline DQN, Dueling and A3C and their NoisyNet version in terms of median and mean human-normalised scores defined in Eq (18). In the case of A3C we inculde both factorised and non-factorised variant of the algorithm. We report on the last column the percentage improvement on the baseline in terms of median human-normalised score
- Table3: Raw scores across all games with random starts

Reference

- Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. Advances in Neural Information Processing Systems, 19:49, 2007.
- Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017.
- Marc Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
- Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458, 2017.
- Richard Bellman and Robert Kalaba. Dynamic programming and modern control theory. Academic Press New York, 1965.
- Dimitri Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scientific, Belmont, MA, 1995.
- Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7 (1):108–116, 1995.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1613–1622, 2015.
- Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv preprint arXiv:1704.02798, 2017.
- Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1050–1059, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/gal16.html.
- Matthieu Geist and Olivier Pietquin. Kalman temporal differences. Journal of artificial intelligence research, 39:483–532, 2010a.
- Matthieu Geist and Olivier Pietquin. Managing uncertainty within value function approximation in reinforcement learning. In Active Learning and Experimental Design workshop (collocated with AISTATS 2010), Sardinia, Italy, volume 92, 2010b.
- Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pp. 2348–2356, 2011.
- Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International Conference on Machine Learning, pp. 1833–1841, 2016.
- Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM, 1993.
- Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
- Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
- Tor Lattimore, Marcus Hutter, and Peter Sunehag. The sample-complexity of general reinforcement learning. In Proceedings of The 30th International Conference on Machine Learning, pp. 28–36, 2013.
- Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Zachary C Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, and Li Deng. Efficient exploration for dialogue policy learning with BBQ networks & replay buffer spiking. arXiv preprint arXiv:1608.05081, 2016.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
- Hossein Mobahi. Training recurrent neural networks by diffusion. arXiv preprint arXiv:1601.04114, 2016.
- David E Moriarty, Alan C Schultz, and John J Grefenstette. Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, 11:241–276, 1999.
- Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
- Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances In Neural Information Processing Systems, pp. 4026–4034, 2016.
- Ian Osband, Daniel Russo, Zheng Wen, and Benjamin Van Roy. Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608, 2017.
- Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Remi Munos. Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.
- Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? A typology of computational approaches. Frontiers in neurorobotics, 1, 2007.
- Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
- Martin Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
- Tim Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv e-prints, 2017.
- Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In Proc. of ICML, pp. 1889–1897, 2015.
- Satinder P Singh, Andrew G Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. In NIPS, volume 17, pp. 1281–1288, 2004.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Cambridge Univ Press, 1998.
- Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. of NIPS, volume 99, pp. 1057–1063, 1999.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Proc. of AAAI, pp. 2094–2100, 2016.
- Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1995–2003, 2016.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- In contrast with value-based algorithms, policy-based methods such as A3C (Mnih et al., 2016)
- parameterise the policy π(a|x; θπ) directly and update the parameters θπ by performing a gradient ascent on the mean value-function Ex∼D[V π(·|·;θπ)(x)] (also called the expected return) (Sutton et al., 1999). A3C uses a deep neural network with weights θ = θπ ∪θV to parameterise the policy π and the value V. The network has one softmax output for the policy-head π(·|·; θπ) and one linear output for the value-head V (·; θV ), with all non-output layers shared. The parameters θπ (resp. θV ) are relative to the shared layers and the policy head (resp. the value head). A3C is an asynchronous and online algorithm that uses roll-outs of size k + 1 of the current policy to perform a policy improvement step.
- For simplicity, here we present the A3C version with only one thread. For a multi-thread implementation, refer to the pseudo-code C.2 or to the original A3C paper (Mnih et al., 2016). In order to train the policy-head, an approximation of the policy-gradient is computed for each state of the roll-out (xt+i, at+i ∼ π(·|xt+i; θπ), rt+i)ki=0:

Full Text

Tags

Comments