CoinDICE: Off-Policy Confidence Interval Estimation

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views209
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have developed CoinDICE, a novel and efficient confidence interval estimator applicable to the behavior-agnostic offline setting

Abstract:

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of...More

Code:

Data:

0
Introduction
  • One of the major barriers that hinders the application of reinforcement learning (RL) is the ability to evaluate new policies reliably before deployment, a problem generally known as off-policy evaluation (OPE).
  • The primary challenge with these correction-based approaches is the high variance resulting from multiplying per-step importance ratios in long-horizon problems.
  • They typically require full knowledge of the behavior policy, which is not available in behavior-agnostic OPE settings (Nachum et al, 2019a)
Highlights
  • One of the major barriers that hinders the application of reinforcement learning (RL) is the ability to evaluate new policies reliably before deployment, a problem generally known as off-policy evaluation (OPE)
  • Most existing high-confidence off-policy evaluation algorithms in RL (Bottou et al, 2013; Thomas et al, 2015a,b; Hanna et al, 2017) construct such intervals using statistical techniques such as concentration inequalities and the bootstrap applied to importance corrected estimates of policy value
  • We evaluate the empirical performance of CoinDICE, comparing it to a number of existing confidence interval estimators for OPE based on concentration inequalities
  • We have developed CoinDICE, a novel and efficient confidence interval estimator applicable to the behavior-agnostic offline setting
  • The algorithm builds on a few technical components, including a new feature embedded Q-LP, and a generalized empirical likelihood approach to confidence interval estimation
Methods
  • The authors evaluate the empirical performance of CoinDICE, comparing it to a number of existing confidence interval estimators for OPE based on concentration inequalities.
  • Given a dataset of logged trajectories, the authors first use weighted step-wise importance sampling (Precup et al, 2000) to calculate a separate estimate of the target policy value for each trajectory.
  • Figure 2 shows that the intervals produced by CoinDICE achieve an empirical coverage close to the intended coverage
  • In this simple bandit setting, the coverages of Student’s t and bootstrapping are close to correct, they suffer more in the low-data regime.
  • The width of the intervals produced by CoinDICE are especially narrow while maintaining accurate coverage
Conclusion
  • The authors have developed CoinDICE, a novel and efficient confidence interval estimator applicable to the behavior-agnostic offline setting.
  • The algorithm builds on a few technical components, including a new feature embedded Q-LP, and a generalized empirical likelihood approach to confidence interval estimation.
  • The authors analyzed the asymptotic coverage of CoinDICE’s estimate, and provided an inite-sample bound.
  • On a variety of off-policy benchmarks the authors empirically compared the new algorithm with several strong baselines and found it to be superior to them
Summary
  • Introduction:

    One of the major barriers that hinders the application of reinforcement learning (RL) is the ability to evaluate new policies reliably before deployment, a problem generally known as off-policy evaluation (OPE).
  • The primary challenge with these correction-based approaches is the high variance resulting from multiplying per-step importance ratios in long-horizon problems.
  • They typically require full knowledge of the behavior policy, which is not available in behavior-agnostic OPE settings (Nachum et al, 2019a)
  • Methods:

    The authors evaluate the empirical performance of CoinDICE, comparing it to a number of existing confidence interval estimators for OPE based on concentration inequalities.
  • Given a dataset of logged trajectories, the authors first use weighted step-wise importance sampling (Precup et al, 2000) to calculate a separate estimate of the target policy value for each trajectory.
  • Figure 2 shows that the intervals produced by CoinDICE achieve an empirical coverage close to the intended coverage
  • In this simple bandit setting, the coverages of Student’s t and bootstrapping are close to correct, they suffer more in the low-data regime.
  • The width of the intervals produced by CoinDICE are especially narrow while maintaining accurate coverage
  • Conclusion:

    The authors have developed CoinDICE, a novel and efficient confidence interval estimator applicable to the behavior-agnostic offline setting.
  • The algorithm builds on a few technical components, including a new feature embedded Q-LP, and a generalized empirical likelihood approach to confidence interval estimation.
  • The authors analyzed the asymptotic coverage of CoinDICE’s estimate, and provided an inite-sample bound.
  • On a variety of off-policy benchmarks the authors empirically compared the new algorithm with several strong baselines and found it to be superior to them
Related work
Funding
  • Csaba Szepesvári gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC
Reference
  • András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148, 2019.
    Google ScholarLocate open access versionFindings
  • P. Auer. Using upper confidence bounds for online learning. In Proc. 41st Annual Symposium on Foundations of Computer Science, pages 270–279. IEEE Computer Society Press, Los Alamitos, CA, 2000.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in neural information processing systems, pages 89–96, 2009.
    Google ScholarLocate open access versionFindings
  • P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
    Google ScholarLocate open access versionFindings
  • Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Michel Broniatowski and Amor Keziou. Divergences and duality for estimation and test under moment condition models. Journal of Statistical Planning and Inference, 142(9):2554–2573, 2012.
    Google ScholarLocate open access versionFindings
  • Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799, 2020.
    Findings
  • Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830, 2019.
    Findings
  • Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al. KullbackLeibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
    Google ScholarLocate open access versionFindings
  • Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 456–464, 2019.
    Google ScholarLocate open access versionFindings
  • X. Chen, T. M. Christensen, and E. Tamer. Monte Carlo confidence sets for identified sets. Econometrica, 86 (6):1965–2018, 2018.
    Google ScholarLocate open access versionFindings
  • Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015.
    Google ScholarLocate open access versionFindings
  • Michael K Cohen and Marcus Hutter. Pessimism about unknown unknowns inspires conservatism. In Conference on Learning Theory, pages 1344–1373. PMLR, 2020.
    Google ScholarLocate open access versionFindings
  • Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. CoRR, abs/1712.10285, 2017.
    Findings
  • Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
    Google ScholarLocate open access versionFindings
  • Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850–865, 2003.
    Google ScholarLocate open access versionFindings
  • Thomas G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In Proc. Intl. Conf. Machine Learning, pages 118–126. Morgan Kaufmann, San Francisco, CA, 1998.
    Google ScholarLocate open access versionFindings
  • John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.
    Findings
  • Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, pages 1097–1104, 2011. CoRR abs/1103.4601.
    Findings
  • Ivar Ekeland and Roger Temam. Convex analysis and variational problems, volume 28. SIAM, 1999.
    Google ScholarFindings
  • Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of Operations Research, 208(1):383–416, 2013.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.
    Google ScholarLocate open access versionFindings
  • Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings, 2018. arXiv:1805.12298.
    Findings
  • Josiah P. Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 4933–4934, 2017.
    Google ScholarLocate open access versionFindings
  • Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67–79, 2010.
    Google ScholarLocate open access versionFindings
  • Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
    Google ScholarLocate open access versionFindings
  • Nan Jiang and Jiawei Huang. Minimax confidence interval for off-policy evaluation and policy optimization, 2020. arXiv:2002.02081.
    Findings
  • Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 652–661, 2016.
    Google ScholarLocate open access versionFindings
  • Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
    Google ScholarLocate open access versionFindings
  • Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. arXiv preprint arXiv:1908.08526, 2019a.
    Findings
  • Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems 32, pages 3320–3329, 2019b.
    Google ScholarLocate open access versionFindings
  • Nikos Karampatziakis, John Langford, and Paul Mineiro. Empirical likelihood for contextual bandits. arXiv preprint arXiv:1906.03323, 2019.
    Findings
  • Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784–11794, 2019.
    Google ScholarLocate open access versionFindings
  • Ilja Kuzborskij, Claire Vernade, András György, Csaba Szepesvári Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting. arXiv preprint arXiv:2006.10460, 2020.
    Findings
  • Chandrashekar Lakshminarayanan, Shalabh Bhatnagar, and Csaba Szepesvari. A linearly relaxed approximate linear program for Markov decision processes. arXiv preprint arXiv:1704.02544, 2017.
    Findings
  • Henry Lam and Enlu Zhou. The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters, 45(4):301–307, 2017.
    Google ScholarLocate open access versionFindings
  • Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pages 3652–3661. PMLR, 2019.
    Google ScholarLocate open access versionFindings
  • Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
    Google ScholarFindings
  • Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041–3074, 2012.
    Google ScholarLocate open access versionFindings
  • Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-banditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. pages 608–616, 2015.
    Google ScholarFindings
  • Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. CoRR, abs/1906.00331, 2019.
    Findings
  • Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon offpolicy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5356–5366. Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling, 2019. arXiv:1910.06508.
    Findings
  • Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. 2014.
    Google ScholarFindings
  • Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
    Findings
  • Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456): 1410–1423, 2001.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum and Bo Dai. Reinforcement learning via Fenchel-Rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
    Findings
  • Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. pages 2315–2325, 2019a.
    Google ScholarFindings
  • Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
    Findings
  • Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in neural information processing systems, pages 2208–2216, 2016.
    Google ScholarLocate open access versionFindings
  • Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in neural information processing systems, pages 2971–2980, 2017.
    Google ScholarLocate open access versionFindings
  • Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
    Google ScholarLocate open access versionFindings
  • Art B Owen. Empirical likelihood. Chapman and Hall/CRC, 2001.
    Google ScholarLocate open access versionFindings
  • Jason Pazis and Ronald Parr. Non-parametric approximate linear programming for MDPs. In AAAI, 2011.
    Google ScholarLocate open access versionFindings
  • Doina Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proc. Intl. Conf. Machine Learning, pages 759–766. Morgan Kaufmann, San Francisco, CA, 2000.
    Google ScholarLocate open access versionFindings
  • Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. the Annals of Statistics, pages 300–325, 1994.
    Google ScholarLocate open access versionFindings
  • R Tyrrell Rockafellar. Augmented lagrange multiplier functions and duality in nonconvex programming. SIAM Journal on Control, 12(2):268–285, 1974.
    Google ScholarLocate open access versionFindings
  • Werner Römisch. Delta method, infinite dimensional. Wiley StatsRef: Statistics Reference Online, 2014.
    Google ScholarFindings
  • Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12:2389–2410, 2011.
    Google ScholarLocate open access versionFindings
  • Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in finite MDPs: PAC analysis. In Journal of Machine Learning Research, 10:2413–2444, 2009.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. arXiv preprint arXiv:1206.3285, 2012.
    Findings
  • Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
    Google ScholarLocate open access versionFindings
  • Istvan Szita and Csaba Szepesvari. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on International Conference on Machine Learning, page 1031–1038.
    Google ScholarLocate open access versionFindings
  • Omnipress, 2010.
    Google ScholarFindings
  • Aviv Tamar, Huan Xu, and Shie Mannor. Scaling up robust mdps by reinforcement learning. arXiv preprint arXiv:1306.6189, 2013.
    Findings
  • Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. In Proceedings of the 8th International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388, 2015a.
    Google ScholarLocate open access versionFindings
  • Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
    Google ScholarFindings
  • Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 2139–2148, 2016.
    Google ScholarLocate open access versionFindings
  • Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015b.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and Q-function learning for off-policy evaluation. arXiv preprint arXiv:1910.12809, 2019.
    Findings
  • A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.
    Google ScholarFindings
  • Weiran Wang and Miguel A Carreira-Perpinán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv preprint arXiv:1309.1541, 2013.
    Findings
  • Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
    Findings
  • Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pages 9665–9675, 2019.
    Google ScholarLocate open access versionFindings
  • Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.
    Findings
  • Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020a.
    Google ScholarLocate open access versionFindings
  • Shangtong Zhang, Bo Liu, and Shimon Whiteson. GradientDICE: Rethinking generalized offline estimation of stationary values, 2020b. arXiv:2001.11113.
    Findings
Full Text
Your rating :
0

 

Tags
Comments