# Evaluating the Performance of Reinforcement Learning Algorithms

ICML, pp. 4962-4973, 2020.

EI

Weibo:

Abstract:

Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning. Recent reproducibility analyses have shown that reported performance results are often inconsistent and difficult to replicate. In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics....More

Code:

Data:

Introduction

- When applying reinforcement learning (RL), to real-world applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention.
- Existing RL algorithms are difficult to apply to real-world applications (Dulac-Arnold et al, 2019).
- To both make and track progress towards developing reliable and easy-to-use algorithms, the authors propose a principled evaluation procedure that quantifies the difficulty of using an algorithm.
- The performance metric captures the usability of the algorithm over a wide variety of environments.
- An evaluation procedure should be computationally tractable, meaning that a typical researcher should be able to run the procedure and repeat experiments found in the literature

Highlights

- When applying reinforcement learning (RL), to real-world applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention
- Current evaluation practices do not properly account for the uncertainty in the results (Henderson et al, 2018) and neglect the difficulty of applying reinforcement learning algorithms to a given problem
- We include three versions of Sarsa(λ), Q(λ), and Actor-Critic with eligibility traces: a base version, a version that scales the step-size with the number of parameters (e.g., Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dabney, 2014), that does not require specifying the step size
- The evaluation framework that we propose provides a principled method for evaluating reinforcement learning algorithms
- By developing a method to establish high-confidence bounds over this approach, we provide the framework necessary for reliable comparisons
- We hope that our provided implementations will allow other researchers to leverage this approach to report the performances of the algorithms they create

Results

- As it is crucial to quantify the uncertainty of all claimed performance measures, the authors first discuss how to compute confidence intervals for both single environment and aggregate measures, give details on displaying the results.

5.1. - There are three parts to the method: answering the stated hypothesis, providing tables and plots showing the performance and ranking of algorithms for all environments, and the aggregate score, for each performance measure, provide confidence intervals to convey uncertainty.
- The authors include three versions of Sarsa(λ), Q(λ), and AC: a base version, a version that scales the step-size with the number of parameters (e.g., Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dabney, 2014), that does not require specifying the step size
- Since none of these algorithms have an existing complete definition, the authors create one by randomly sampling hyperparameters from fixed ranges.

Conclusion

- The evaluation framework that the authors propose provides a principled method for evaluating RL algorithms.
- This approach facilitates fair comparisons of algorithms by removing unintentional biases common in the research setting.
- By developing a method to establish high-confidence bounds over this approach, the authors provide the framework necessary for reliable comparisons.
- The authors hope that the provided implementations will allow other researchers to leverage this approach to report the performances of the algorithms they create

Summary

## Introduction:

When applying reinforcement learning (RL), to real-world applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention.- Existing RL algorithms are difficult to apply to real-world applications (Dulac-Arnold et al, 2019).
- To both make and track progress towards developing reliable and easy-to-use algorithms, the authors propose a principled evaluation procedure that quantifies the difficulty of using an algorithm.
- The performance metric captures the usability of the algorithm over a wide variety of environments.
- An evaluation procedure should be computationally tractable, meaning that a typical researcher should be able to run the procedure and repeat experiments found in the literature
## Objectives:

Recall that the objective is to capture the difficulty of using a particular algorithm.- While this number of trials may seem excessive, the goal is to detect a statistically meaningful result
## Results:

As it is crucial to quantify the uncertainty of all claimed performance measures, the authors first discuss how to compute confidence intervals for both single environment and aggregate measures, give details on displaying the results.

5.1.- There are three parts to the method: answering the stated hypothesis, providing tables and plots showing the performance and ranking of algorithms for all environments, and the aggregate score, for each performance measure, provide confidence intervals to convey uncertainty.
- The authors include three versions of Sarsa(λ), Q(λ), and AC: a base version, a version that scales the step-size with the number of parameters (e.g., Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dabney, 2014), that does not require specifying the step size
- Since none of these algorithms have an existing complete definition, the authors create one by randomly sampling hyperparameters from fixed ranges.
## Conclusion:

The evaluation framework that the authors propose provides a principled method for evaluating RL algorithms.- This approach facilitates fair comparisons of algorithms by removing unintentional biases common in the research setting.
- By developing a method to establish high-confidence bounds over this approach, the authors provide the framework necessary for reliable comparisons.
- The authors hope that the provided implementations will allow other researchers to leverage this approach to report the performances of the algorithms they create

- Table1: Aggregate performance measures for each algorithm and their rank. The parentheses contain the intervals computed using PBP and together all hold with 95% confidence. The bolded numbers identify the best ranked statistically significant differences
- Table2: Table showing the failure rate (FR) and proportion of significant pairwise comparison (SIG) identified for δ = 0.05 using different bounding techniques and sample sizes. The first column represents the sample size. The second, third, and fourth columns represent the results for PBP, PBP-t, and bootstrap bound methods respectively. For each sample size, 1,000 experiments were conducted
- Table3: List of symbols used to create confidence intervals on the aggregate performance
- Table4: This table show the distributions from which each hyperparameter is sampled. The All algorithm means the hyperparameter and distribution were used for all algorithms. Steps sizes are labeled with various αs. The discount factor γ an algorithm uses is scaled down from Γ that is specified by the environment. For all environments used in this work Γ = 1.0. PPO uses the same learning rate for both the policy and value function. The max dependent order on the Fourier basis is limited such that no more than 10,000 features are generated as a result of dorder
- Table5: This table list every used in this paper along with the number of episodes each algorithm was allowed to interact with the environment and its type of state space

Related work

- This paper is not the first to investigate and address issues in empirically evaluating algorithms. The evaluation of algorithms has become a signficant enough topic to spawn its own field of study, known as experimental algorithmics (Fleischer et al, 2002; McGeoch, 2012).

In RL, there have been significant efforts to discuss and improve the evaluation of algorithms (Whiteson & Littman, 2011). One common theme has been to produce shared benchmark environments, such as those in the annual reinforcement learning competitions (Whiteson et al, 2010; Dimitrakakis et al, 2014), the Arcade Learning Environment (Bellemare et al, 2013), and numerous others which are to long to list here. Recently, there has been a trend of explicit investigations into the reproducibility of reported results (Henderson et al, 2018; Islam et al, 2017; Khetarpal et al, 2018; Colas et al, 2018). These efforts are in part due to the inadequate experimental practices and reporting in RL and general machine learning (Pineau et al, 2020; Lipton & Steinhardt, 2018). Similar to these studies, this work has been motivated by the need for a more reliable evaluation procedure to compare algorithms. The primary difference in our work to these is that the knowledge required to use an algorithm gets included in the performance metric.

Funding

- Additionally, we would like to thank the reviewers and metareviewers for their comments, which helped improved this paper. This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative
- This work was supported in part by a gift from Adobe
- This work was supported in part by the Center for Intelligent Information Retrieval
- Research reported in this paper was sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)

Reference

- Anderson, T. W. Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function. Bulletin of The International and Statistical Institute, 43:249–251, 1969.
- Atrey, A., Clary, K., and Jensen, D. D. Exploratory not explanatory: Counterfactual analysis of saliency maps for deep reinforcement learning. In 8th International Conference on Learning Representations, ICLR. OpenReview.net, 2020.
- Baird, L. C. Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A. and Russell, S. J. (eds.), Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, pp. 30– 37. Morgan Kaufmann, 1995.
- Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Reevaluating evaluation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS., pp. 3272–3283, 2018.
- Barreto, A. M. S., Bernardino, H. S., and Barbosa, H. J. C. Probabilistic performance profiles for the experimental evaluation of stochastic algorithms. In Pelikan, M. and Branke, J. (eds.), Genetic and Evolutionary Computation Conference, GECCO, pp. 751–758. ACM, 2010.
- Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
- Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
- Cohen, D., Jordan, S. M., and Croft, W. B. Distributed evaluations: Ending neural point metrics. CoRR, abs/1806.03790, 2018.
- Colas, C., Sigaud, O., and Oudeyer, P. How many random seeds? Statistical power analysis in deep reinforcement learning experiments. CoRR, abs/1806.08295, 2018.
- Csaji, B. C., Jungers, R. M., and Blondel, V. D. Pagerank optimization by edge selection. Discrete Applied Mathematics, 169:73–87, 2014.
- Dabney, W. C. Adaptive step-sizes for reinforcement learning. PhD thesis, University of Massachusetts Amherst, 2014.
- de Kerchove, C., Ninove, L., and Dooren, P. V. Maximizing pagerank via outlinks. CoRR, abs/0711.2867, 2007.
- Degris, T., Pilarski, P. M., and Sutton, R. S. Model-free reinforcement learning with continuous action in practice. In American Control Conference, ACC, pp. 2177–2182, 2012.
- Dimitrakakis, C., Li, G., and Tziortziotis, N. The reinforcement learning competition 20AI Magazine, 35(3): 61–65, 2014.
- Dodge, Y. and Commenges, D. The Oxford dictionary of statistical terms. Oxford University Press on Demand, 2006.
- Dolan, E. D. and More, J. J. Benchmarking optimization software with performance profiles. Math. Program., 91 (2):201–213, 2002.
- Dulac-Arnold, G., Mankowitz, D. J., and Hester, T. Challenges of real-world reinforcement learning. CoRR, abs/1904.12901, 2019.
- Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of a sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics, 27:642–669, 1956.
- Farahmand, A. M., Ahmadabadi, M. N., Lucas, C., and Araabi, B. N. Interaction of culture-based learning and cooperative co-evolution and its application to automatic behavior-based system design. IEEE Trans. Evolutionary Computation, 14(1):23–57, 2010.
- Fercoq, O., Akian, M., Bouhtou, M., and Gaubert, S. Ergodic control and polyhedral approaches to pagerank optimization. IEEE Trans. Automat. Contr., 58(1):134–148, 2013.
- Fleischer, R., Moret, B. M. E., and Schmidt, E. M. (eds.). Experimental Algorithmics, From Algorithm Design to Robust and Efficient Software [Dagstuhl seminar, September 2000], volume 2547 of Lecture Notes in Computer Science, 2002. Springer.
- Fleming, P. J. and Wallace, J. J. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM, 29(3):218–221, 1986.
- Florian, R. V. Correct equations for the dynamics of the cartpole system. Center for Cognitive and Neural Studies (Coneural), Romania, 2007.
- Foley, J., Tosch, E., Clary, K., and Jensen, D. Toybox: Better Atari Environments for Testing Reinforcement Learning Agents. In NeurIPS 2018 Workshop on Systems for ML, 2018.
- Geramifard, A., Dann, C., Klein, R. H., Dabney, W., and How, J. P. RLPy: A value-function-based reinforcement learning framework for education and research. Journal of Machine Learning Research, 16:1573–1578, 2015.
- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pp. 3207–3214, 2018.
- Hooker, J. N. Testing heuristics: We have it all wrong. Journal of Heuristics, 1(1):33–42, 1995.
- Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. S. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv: Statistics Theory, 2018.
- Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR, abs/1708.04133, 2017.
- Jordan, S. M., Cohen, D., and Thomas, P. S. Using cumulative distribution based performance analysis to benchmark models. In Critiquing and Correcting Trends in Machine Learning Workshop at Neural Information Processing Systems, 2018.
- Khetarpal, K., Ahmed, Z., Cianflone, A., Islam, R., and Pineau, J. Re-evaluate: Reproducibility in evaluating reinforcement learning algorithms. 2018.
- Konidaris, G., Osentoski, S., and Thomas, P. S. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI, 2011.
- Konidaris, G. D. and Barto, A. G. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems 22., pp. 1015–1023. Curran Associates, Inc., 2009.
- Lipton, Z. C. and Steinhardt, J. Troubling trends in machine learning scholarship. CoRR, abs/1807.03341, 2018.
- Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are gans created equal? A large-scale study. In Advances in Neural Information Processing Systems 31., pp. 698–707, 2018.
- Lyle, C., Bellemare, M. G., and Castro, P. S. A comparative analysis of expected and distributional reinforcement learning. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp. 4504–4511. AAAI Press, 2019.
- Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. J., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. J. Artif. Intell. Res., 61:523–562, 2018.
- Massart, P. The tight constant in the Dvoretzky-KieferWolfowitz inequality. The Annals of Probability, 18(3): 1269–1283, 1990.
- McGeoch, C. C. A Guide to Experimental Algorithmics. Cambridge University Press, 2012.
- Melis, G., Dyer, C., and Blunsom, P. On the state of the art of evaluation in neural language models. In 6th International Conference on Learning Representations, ICLR. OpenReview.net, 2018.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533, 2015.
- Morimura, T., Uchibe, E.. i. e. j., and Doya, K. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and Its Applications, pp. 256–263, 2005.
- Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R. α-rank: Multi-agent evaluation by evolution. Scientific reports, 9(1):1–29, 2019.
- Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
- Perkins, T. J. and Precup, D. A convergent form of approximate policy iteration. In Advances in Neural Information Processing Systems 15, pp. 1595–1602. MIT Press, 2002.
- Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d’Alche-Buc, F., Fox, E. B., and Larochelle, H. Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program). CoRR, abs/2003.12206, 2020.
- Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994.
- Reimers, N. and Gurevych, I. Reporting score distributions makes a difference: Performance study of LSTMnetworks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 338–348, 2017.
- Rowland, M., Omidshafiei, S., Tuyls, K., Perolat, J., Valko, M., Piliouras, G., and Munos, R. Multiagent evaluation under incomplete information. In Advances in Neural Information Processing Systems 3, NeurIPS, pp. 12270– 12282, 2019.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pp. 1038–1044, 1995.
- Sutton, R. S. and Barto, A. G. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998.
- Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
- Thomas, P. Bias in natural actor-critic algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML, pp. 441–448, 2014.
- Tucker, G., Bhupatiraju, S., Gu, S., Turner, R. E., Ghahramani, Z., and Levine, S. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 5022–5031, 2018.
- Whiteson, S. and Littman, M. L. Introduction to the special issue on empirical evaluations in reinforcement learning. Mach. Learn., 84(1-2):1–6, 2011.
- Whiteson, S., Tanner, B., and White, A. M. Report on the 2008 reinforcement learning competition. AI Magazine, 31(2):81–94, 2010.
- Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning, ADPRL, pp. 120–127. IEEE, 2011.
- Wiering, M. Convergence and divergence in standard and averaging reinforcement learning. In Machine Learning: ECML 2004, 15th European Conference on Machine Learning, volume 3201 of Lecture Notes in Computer Science, pp. 477–488.
- Williams, R. J. and Baird, L. C. Tight performance bounds on greedy policies based on imperfect value functions. 1993.
- Witty, S., Lee, J. K., Tosch, E., Atrey, A., Littman, M., and Jensen, D. Measuring and characterizing generalization in deep reinforcement learning. arXiv preprint arXiv:1812.02868, 2018.

Full Text

Tags

Comments