Off-Policy Evaluation via the Regularized Lagrangian

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views110
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have proposed a unified view of off-policy evaluation via the regularized Lagrangian of the d-linear program

Abstract:

The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data. While these estimators all perform some form of stationary distribution correction, they arise from different derivations and objective functions. In this paper, we ...More

Code:

Data:

0
Introduction
  • One of the most fundamental problems in reinforcement learning (RL) is policy evaluation, where the authors seek to estimate the expected long-term payoff of a given target policy in a decision making environment.
  • An important variant of this problem, off-policy evaluation (OPE) (Precup et al, 2000), is motivated by applications where deploying a policy in a live environment entails significant cost or risk (Murphy et al, 2001; Thomas et al, 2015)
  • To circumvent these issues, OPE attempts to estimate the value of a target policy by referring only to a dataset of experience previously gathered by other policies in the environment.
Highlights
  • One of the most fundamental problems in reinforcement learning (RL) is policy evaluation, where we seek to estimate the expected long-term payoff of a given target policy in a decision making environment
  • An important variant of this problem, off-policy evaluation (OPE) (Precup et al, 2000), is motivated by applications where deploying a policy in a live environment entails significant cost or risk (Murphy et al, 2001; Thomas et al, 2015)
  • OPE attempts to estimate the value of a target policy by referring only to a dataset of experience previously gathered by other policies in the environment
  • We show that the previous distribution correction estimation (DICE) formulations are all equivalent to regularized Lagrangians of the same linear program (LP)
  • We have proposed a unified view of off-policy evaluation via the regularized Lagrangian of the d-LP
  • By systematically studying the mathematical properties and empirical effects of these choices, we have found that the dual estimates offer greater flexibility in incorporating optimization stablizers while preserving asymptotic unibasedness, in comparison to the primal estimates
Methods
  • The authors empirically verify the theoretical findings. The authors evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with offline data collected from behavior policies with different noise levels (π1 and π2).
  • The authors perform Lagrangian optimization with regularization chosen according to Theorem 2 to not bias the resulting estimator.
  • The authors use αR = 1 and include redundant constraints for λ and ζ ≥ 0 in the dual estimator.
  • The authors evaluated combinations of regularizations which can bias the estimator and found that these generally performed worse; see Section 4.2 for a subset of these experiments
Conclusion
  • The authors have proposed a unified view of off-policy evaluation via the regularized Lagrangian of the d-LP
  • Under this unification, existing DICE algorithms are recovered by specific choices of regularizers, constraints, and ways to convert optimized solutions to policy values.
  • The authors' study reveals alternative estimators not previously identified in the literature that exhibit improved performance.
  • Overall, these findings suggest promising new directions of focus for OPE research in the offline setting
Summary
  • Introduction:

    One of the most fundamental problems in reinforcement learning (RL) is policy evaluation, where the authors seek to estimate the expected long-term payoff of a given target policy in a decision making environment.
  • An important variant of this problem, off-policy evaluation (OPE) (Precup et al, 2000), is motivated by applications where deploying a policy in a live environment entails significant cost or risk (Murphy et al, 2001; Thomas et al, 2015)
  • To circumvent these issues, OPE attempts to estimate the value of a target policy by referring only to a dataset of experience previously gathered by other policies in the environment.
  • Methods:

    The authors empirically verify the theoretical findings. The authors evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with offline data collected from behavior policies with different noise levels (π1 and π2).
  • The authors perform Lagrangian optimization with regularization chosen according to Theorem 2 to not bias the resulting estimator.
  • The authors use αR = 1 and include redundant constraints for λ and ζ ≥ 0 in the dual estimator.
  • The authors evaluated combinations of regularizations which can bias the estimator and found that these generally performed worse; see Section 4.2 for a subset of these experiments
  • Conclusion:

    The authors have proposed a unified view of off-policy evaluation via the regularized Lagrangian of the d-LP
  • Under this unification, existing DICE algorithms are recovered by specific choices of regularizers, constraints, and ways to convert optimized solutions to policy values.
  • The authors' study reveals alternative estimators not previously identified in the literature that exhibit improved performance.
  • Overall, these findings suggest promising new directions of focus for OPE research in the offline setting
Tables
  • Table1: Optimal solutions for all configurations. Configurations with new proofs are shaded in gray
Download tables as Excel
Related work
Reference
  • Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.
    Google ScholarLocate open access versionFindings
  • Joan Bas-Serrano and Gergely Neu. Faster saddle-point optimization for solving large-scale Markov decision processes, 2019. arXiv:1909.10904.
    Findings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear π learning using state and action features. arXiv preprint arXiv:1804.10328, 2018.
    Findings
  • Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579, 2016.
    Findings
  • Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Proceedings of the 35th International Conference on Machine Learning, pages 1133–1142, 2018.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1049–1058. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Yaqi Duan and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation, 2020. arXiv:2002.09516.
    Findings
  • Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018.
    Findings
  • Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of Operations Research, 208(1):383–416, 2013.
    Google ScholarLocate open access versionFindings
  • H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI, volume 16, pages 2094–2100, 2016.
    Google ScholarLocate open access versionFindings
  • Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
    Findings
  • Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. arXiv preprint arXiv:1908.08526, 2019a.
    Findings
  • Nathan Kallus and Masatoshi Uehara. Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850, 2019b.
    Findings
  • Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
    Google ScholarLocate open access versionFindings
  • Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample analysis of proximal gradient TD algorithms. In Proc. The 31st Conf. Uncertainty in Artificial Intelligence, Amsterdam, Netherlands, 2015.
    Google ScholarLocate open access versionFindings
  • Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
    Google ScholarLocate open access versionFindings
  • Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2012.
    Google ScholarFindings
  • R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
    Google ScholarLocate open access versionFindings
  • S. Murphy, M. van der Laan, and J. Robins. Marginal mean models for dynamic regimes. Journal of American Statistical Association, 96(456):1410–1423, 2001.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pages 2315–2325, 2019a.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. AlgaeDICE: Policy gradient from arbitrary experience, 2019b.
    Google ScholarFindings
  • Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759–766, 2000.
    Google ScholarLocate open access versionFindings
  • Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
    Google ScholarFindings
  • R Tyrrell Rockafellar. Augmented Lagrange multiplier functions and duality in nonconvex programming. SIAM Journal on Control, 12(2):268–285, 1974.
    Google ScholarLocate open access versionFindings
  • Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. In Proceedings of the 8th International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In Proceedings of the 29th Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Masatoshi Uehara and Nan Jiang. Minimax weight and Q-function learning for off-policy evaluation. arXiv preprint arXiv:1910.12809, 2019.
    Findings
  • Mengdi Wang. Randomized linear programming solves the discounted Markov decision problem in nearly-linear (sometimes sublinear) running time. arXiv preprint arXiv:1704.01869, 2017.
    Findings
  • Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020a.
    Google ScholarLocate open access versionFindings
  • Shangtong Zhang, Bo Liu, and Shimon Whiteson. GradientDICE: Rethinking generalized offline estimation of stationary values. arXiv preprint arXiv:2001.11113, 2020b.
    Findings
Full Text
Your rating :
0

 

Tags
Comments