Evaluating the Performance of Reinforcement Learning Algorithms

Scott Jordan
Scott Jordan
Yash Chandak
Yash Chandak
Daniel Cohen
Daniel Cohen
Mengxue Zhang
Mengxue Zhang

ICML, pp. 4962-4973, 2020.

Cited by: 0|Bibtex|Views9
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
The evaluation framework that we propose provides a principled method for evaluating reinforcement learning algorithms

Abstract:

Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning. Recent reproducibility analyses have shown that reported performance results are often inconsistent and difficult to replicate. In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics....More

Code:

Data:

0
Introduction
  • When applying reinforcement learning (RL), to real-world applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention.
  • Existing RL algorithms are difficult to apply to real-world applications (Dulac-Arnold et al, 2019).
  • To both make and track progress towards developing reliable and easy-to-use algorithms, the authors propose a principled evaluation procedure that quantifies the difficulty of using an algorithm.
  • The performance metric captures the usability of the algorithm over a wide variety of environments.
  • An evaluation procedure should be computationally tractable, meaning that a typical researcher should be able to run the procedure and repeat experiments found in the literature
Highlights
  • When applying reinforcement learning (RL), to real-world applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention
  • Current evaluation practices do not properly account for the uncertainty in the results (Henderson et al, 2018) and neglect the difficulty of applying reinforcement learning algorithms to a given problem
  • We include three versions of Sarsa(λ), Q(λ), and Actor-Critic with eligibility traces: a base version, a version that scales the step-size with the number of parameters (e.g., Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dabney, 2014), that does not require specifying the step size
  • The evaluation framework that we propose provides a principled method for evaluating reinforcement learning algorithms
  • By developing a method to establish high-confidence bounds over this approach, we provide the framework necessary for reliable comparisons
  • We hope that our provided implementations will allow other researchers to leverage this approach to report the performances of the algorithms they create
Results
  • As it is crucial to quantify the uncertainty of all claimed performance measures, the authors first discuss how to compute confidence intervals for both single environment and aggregate measures, give details on displaying the results.

    5.1.
  • There are three parts to the method: answering the stated hypothesis, providing tables and plots showing the performance and ranking of algorithms for all environments, and the aggregate score, for each performance measure, provide confidence intervals to convey uncertainty.
  • The authors include three versions of Sarsa(λ), Q(λ), and AC: a base version, a version that scales the step-size with the number of parameters (e.g., Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dabney, 2014), that does not require specifying the step size
  • Since none of these algorithms have an existing complete definition, the authors create one by randomly sampling hyperparameters from fixed ranges.
Conclusion
  • The evaluation framework that the authors propose provides a principled method for evaluating RL algorithms.
  • This approach facilitates fair comparisons of algorithms by removing unintentional biases common in the research setting.
  • By developing a method to establish high-confidence bounds over this approach, the authors provide the framework necessary for reliable comparisons.
  • The authors hope that the provided implementations will allow other researchers to leverage this approach to report the performances of the algorithms they create
Summary
  • Introduction:

    When applying reinforcement learning (RL), to real-world applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention.
  • Existing RL algorithms are difficult to apply to real-world applications (Dulac-Arnold et al, 2019).
  • To both make and track progress towards developing reliable and easy-to-use algorithms, the authors propose a principled evaluation procedure that quantifies the difficulty of using an algorithm.
  • The performance metric captures the usability of the algorithm over a wide variety of environments.
  • An evaluation procedure should be computationally tractable, meaning that a typical researcher should be able to run the procedure and repeat experiments found in the literature
  • Objectives:

    Recall that the objective is to capture the difficulty of using a particular algorithm.
  • While this number of trials may seem excessive, the goal is to detect a statistically meaningful result
  • Results:

    As it is crucial to quantify the uncertainty of all claimed performance measures, the authors first discuss how to compute confidence intervals for both single environment and aggregate measures, give details on displaying the results.

    5.1.
  • There are three parts to the method: answering the stated hypothesis, providing tables and plots showing the performance and ranking of algorithms for all environments, and the aggregate score, for each performance measure, provide confidence intervals to convey uncertainty.
  • The authors include three versions of Sarsa(λ), Q(λ), and AC: a base version, a version that scales the step-size with the number of parameters (e.g., Sarsa(λ)-s), and an adaptive step-size method, Parl2 (Dabney, 2014), that does not require specifying the step size
  • Since none of these algorithms have an existing complete definition, the authors create one by randomly sampling hyperparameters from fixed ranges.
  • Conclusion:

    The evaluation framework that the authors propose provides a principled method for evaluating RL algorithms.
  • This approach facilitates fair comparisons of algorithms by removing unintentional biases common in the research setting.
  • By developing a method to establish high-confidence bounds over this approach, the authors provide the framework necessary for reliable comparisons.
  • The authors hope that the provided implementations will allow other researchers to leverage this approach to report the performances of the algorithms they create
Tables
  • Table1: Aggregate performance measures for each algorithm and their rank. The parentheses contain the intervals computed using PBP and together all hold with 95% confidence. The bolded numbers identify the best ranked statistically significant differences
  • Table2: Table showing the failure rate (FR) and proportion of significant pairwise comparison (SIG) identified for δ = 0.05 using different bounding techniques and sample sizes. The first column represents the sample size. The second, third, and fourth columns represent the results for PBP, PBP-t, and bootstrap bound methods respectively. For each sample size, 1,000 experiments were conducted
  • Table3: List of symbols used to create confidence intervals on the aggregate performance
  • Table4: This table show the distributions from which each hyperparameter is sampled. The All algorithm means the hyperparameter and distribution were used for all algorithms. Steps sizes are labeled with various αs. The discount factor γ an algorithm uses is scaled down from Γ that is specified by the environment. For all environments used in this work Γ = 1.0. PPO uses the same learning rate for both the policy and value function. The max dependent order on the Fourier basis is limited such that no more than 10,000 features are generated as a result of dorder
  • Table5: This table list every used in this paper along with the number of episodes each algorithm was allowed to interact with the environment and its type of state space
Download tables as Excel
Related work
  • This paper is not the first to investigate and address issues in empirically evaluating algorithms. The evaluation of algorithms has become a signficant enough topic to spawn its own field of study, known as experimental algorithmics (Fleischer et al, 2002; McGeoch, 2012).

    In RL, there have been significant efforts to discuss and improve the evaluation of algorithms (Whiteson & Littman, 2011). One common theme has been to produce shared benchmark environments, such as those in the annual reinforcement learning competitions (Whiteson et al, 2010; Dimitrakakis et al, 2014), the Arcade Learning Environment (Bellemare et al, 2013), and numerous others which are to long to list here. Recently, there has been a trend of explicit investigations into the reproducibility of reported results (Henderson et al, 2018; Islam et al, 2017; Khetarpal et al, 2018; Colas et al, 2018). These efforts are in part due to the inadequate experimental practices and reporting in RL and general machine learning (Pineau et al, 2020; Lipton & Steinhardt, 2018). Similar to these studies, this work has been motivated by the need for a more reliable evaluation procedure to compare algorithms. The primary difference in our work to these is that the knowledge required to use an algorithm gets included in the performance metric.
Funding
  • Additionally, we would like to thank the reviewers and metareviewers for their comments, which helped improved this paper. This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative
  • This work was supported in part by a gift from Adobe
  • This work was supported in part by the Center for Intelligent Information Retrieval
  • Research reported in this paper was sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)
Reference
  • Anderson, T. W. Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function. Bulletin of The International and Statistical Institute, 43:249–251, 1969.
    Google ScholarLocate open access versionFindings
  • Atrey, A., Clary, K., and Jensen, D. D. Exploratory not explanatory: Counterfactual analysis of saliency maps for deep reinforcement learning. In 8th International Conference on Learning Representations, ICLR. OpenReview.net, 2020.
    Google ScholarLocate open access versionFindings
  • Baird, L. C. Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A. and Russell, S. J. (eds.), Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, pp. 30– 37. Morgan Kaufmann, 1995.
    Google ScholarLocate open access versionFindings
  • Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Reevaluating evaluation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS., pp. 3272–3283, 2018.
    Google ScholarLocate open access versionFindings
  • Barreto, A. M. S., Bernardino, H. S., and Barbosa, H. J. C. Probabilistic performance profiles for the experimental evaluation of stochastic algorithms. In Pelikan, M. and Branke, J. (eds.), Genetic and Evolutionary Computation Conference, GECCO, pp. 751–758. ACM, 2010.
    Google ScholarLocate open access versionFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
    Google ScholarLocate open access versionFindings
  • Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
    Google ScholarLocate open access versionFindings
  • Cohen, D., Jordan, S. M., and Croft, W. B. Distributed evaluations: Ending neural point metrics. CoRR, abs/1806.03790, 2018.
    Findings
  • Colas, C., Sigaud, O., and Oudeyer, P. How many random seeds? Statistical power analysis in deep reinforcement learning experiments. CoRR, abs/1806.08295, 2018.
    Findings
  • Csaji, B. C., Jungers, R. M., and Blondel, V. D. Pagerank optimization by edge selection. Discrete Applied Mathematics, 169:73–87, 2014.
    Google ScholarLocate open access versionFindings
  • Dabney, W. C. Adaptive step-sizes for reinforcement learning. PhD thesis, University of Massachusetts Amherst, 2014.
    Google ScholarFindings
  • de Kerchove, C., Ninove, L., and Dooren, P. V. Maximizing pagerank via outlinks. CoRR, abs/0711.2867, 2007.
    Findings
  • Degris, T., Pilarski, P. M., and Sutton, R. S. Model-free reinforcement learning with continuous action in practice. In American Control Conference, ACC, pp. 2177–2182, 2012.
    Google ScholarLocate open access versionFindings
  • Dimitrakakis, C., Li, G., and Tziortziotis, N. The reinforcement learning competition 20AI Magazine, 35(3): 61–65, 2014.
    Google ScholarLocate open access versionFindings
  • Dodge, Y. and Commenges, D. The Oxford dictionary of statistical terms. Oxford University Press on Demand, 2006.
    Google ScholarFindings
  • Dolan, E. D. and More, J. J. Benchmarking optimization software with performance profiles. Math. Program., 91 (2):201–213, 2002.
    Google ScholarLocate open access versionFindings
  • Dulac-Arnold, G., Mankowitz, D. J., and Hester, T. Challenges of real-world reinforcement learning. CoRR, abs/1904.12901, 2019.
    Findings
  • Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of a sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics, 27:642–669, 1956.
    Google ScholarLocate open access versionFindings
  • Farahmand, A. M., Ahmadabadi, M. N., Lucas, C., and Araabi, B. N. Interaction of culture-based learning and cooperative co-evolution and its application to automatic behavior-based system design. IEEE Trans. Evolutionary Computation, 14(1):23–57, 2010.
    Google ScholarLocate open access versionFindings
  • Fercoq, O., Akian, M., Bouhtou, M., and Gaubert, S. Ergodic control and polyhedral approaches to pagerank optimization. IEEE Trans. Automat. Contr., 58(1):134–148, 2013.
    Google ScholarLocate open access versionFindings
  • Fleischer, R., Moret, B. M. E., and Schmidt, E. M. (eds.). Experimental Algorithmics, From Algorithm Design to Robust and Efficient Software [Dagstuhl seminar, September 2000], volume 2547 of Lecture Notes in Computer Science, 2002. Springer.
    Google ScholarLocate open access versionFindings
  • Fleming, P. J. and Wallace, J. J. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM, 29(3):218–221, 1986.
    Google ScholarLocate open access versionFindings
  • Florian, R. V. Correct equations for the dynamics of the cartpole system. Center for Cognitive and Neural Studies (Coneural), Romania, 2007.
    Google ScholarLocate open access versionFindings
  • Foley, J., Tosch, E., Clary, K., and Jensen, D. Toybox: Better Atari Environments for Testing Reinforcement Learning Agents. In NeurIPS 2018 Workshop on Systems for ML, 2018.
    Google ScholarLocate open access versionFindings
  • Geramifard, A., Dann, C., Klein, R. H., Dabney, W., and How, J. P. RLPy: A value-function-based reinforcement learning framework for education and research. Journal of Machine Learning Research, 16:1573–1578, 2015.
    Google ScholarLocate open access versionFindings
  • Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pp. 3207–3214, 2018.
    Google ScholarLocate open access versionFindings
  • Hooker, J. N. Testing heuristics: We have it all wrong. Journal of Heuristics, 1(1):33–42, 1995.
    Google ScholarLocate open access versionFindings
  • Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. S. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv: Statistics Theory, 2018.
    Google ScholarLocate open access versionFindings
  • Islam, R., Henderson, P., Gomrokchi, M., and Precup, D. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR, abs/1708.04133, 2017.
    Findings
  • Jordan, S. M., Cohen, D., and Thomas, P. S. Using cumulative distribution based performance analysis to benchmark models. In Critiquing and Correcting Trends in Machine Learning Workshop at Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Khetarpal, K., Ahmed, Z., Cianflone, A., Islam, R., and Pineau, J. Re-evaluate: Reproducibility in evaluating reinforcement learning algorithms. 2018.
    Google ScholarFindings
  • Konidaris, G., Osentoski, S., and Thomas, P. S. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI, 2011.
    Google ScholarLocate open access versionFindings
  • Konidaris, G. D. and Barto, A. G. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems 22., pp. 1015–1023. Curran Associates, Inc., 2009.
    Google ScholarLocate open access versionFindings
  • Lipton, Z. C. and Steinhardt, J. Troubling trends in machine learning scholarship. CoRR, abs/1807.03341, 2018.
    Findings
  • Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are gans created equal? A large-scale study. In Advances in Neural Information Processing Systems 31., pp. 698–707, 2018.
    Google ScholarLocate open access versionFindings
  • Lyle, C., Bellemare, M. G., and Castro, P. S. A comparative analysis of expected and distributional reinforcement learning. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp. 4504–4511. AAAI Press, 2019.
    Google ScholarLocate open access versionFindings
  • Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. J., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. J. Artif. Intell. Res., 61:523–562, 2018.
    Google ScholarLocate open access versionFindings
  • Massart, P. The tight constant in the Dvoretzky-KieferWolfowitz inequality. The Annals of Probability, 18(3): 1269–1283, 1990.
    Google ScholarLocate open access versionFindings
  • McGeoch, C. C. A Guide to Experimental Algorithmics. Cambridge University Press, 2012.
    Google ScholarFindings
  • Melis, G., Dyer, C., and Blunsom, P. On the state of the art of evaluation in neural language models. In 6th International Conference on Learning Representations, ICLR. OpenReview.net, 2018.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Morimura, T., Uchibe, E.. i. e. j., and Doya, K. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and Its Applications, pp. 256–263, 2005.
    Google ScholarLocate open access versionFindings
  • Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R. α-rank: Multi-agent evaluation by evolution. Scientific reports, 9(1):1–29, 2019.
    Google ScholarLocate open access versionFindings
  • Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
    Google ScholarFindings
  • Perkins, T. J. and Precup, D. A convergent form of approximate policy iteration. In Advances in Neural Information Processing Systems 15, pp. 1595–1602. MIT Press, 2002.
    Google ScholarLocate open access versionFindings
  • Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d’Alche-Buc, F., Fox, E. B., and Larochelle, H. Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program). CoRR, abs/2003.12206, 2020.
    Findings
  • Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994.
    Google ScholarFindings
  • Reimers, N. and Gurevych, I. Reporting score distributions makes a difference: Performance study of LSTMnetworks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 338–348, 2017.
    Google ScholarLocate open access versionFindings
  • Rowland, M., Omidshafiei, S., Tuyls, K., Perolat, J., Valko, M., Piliouras, G., and Munos, R. Multiagent evaluation under incomplete information. In Advances in Neural Information Processing Systems 3, NeurIPS, pp. 12270– 12282, 2019.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
    Findings
  • Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pp. 1038–1044, 1995.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. and Barto, A. G. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998.
    Google ScholarFindings
  • Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Thomas, P. Bias in natural actor-critic algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML, pp. 441–448, 2014.
    Google ScholarLocate open access versionFindings
  • Tucker, G., Bhupatiraju, S., Gu, S., Turner, R. E., Ghahramani, Z., and Levine, S. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 5022–5031, 2018.
    Google ScholarLocate open access versionFindings
  • Whiteson, S. and Littman, M. L. Introduction to the special issue on empirical evaluations in reinforcement learning. Mach. Learn., 84(1-2):1–6, 2011.
    Google ScholarLocate open access versionFindings
  • Whiteson, S., Tanner, B., and White, A. M. Report on the 2008 reinforcement learning competition. AI Magazine, 31(2):81–94, 2010.
    Google ScholarLocate open access versionFindings
  • Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning, ADPRL, pp. 120–127. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Wiering, M. Convergence and divergence in standard and averaging reinforcement learning. In Machine Learning: ECML 2004, 15th European Conference on Machine Learning, volume 3201 of Lecture Notes in Computer Science, pp. 477–488.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. and Baird, L. C. Tight performance bounds on greedy policies based on imperfect value functions. 1993.
    Google ScholarFindings
  • Witty, S., Lee, J. K., Tosch, E., Atrey, A., Littman, M., and Jensen, D. Measuring and characterizing generalization in deep reinforcement learning. arXiv preprint arXiv:1812.02868, 2018.
    Findings
Full Text
Your rating :
0

 

Tags
Comments