AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

ICML, pp.6120-6130, (2021)

Cited by: 0|Views28
EI
Full Text
Bibtex
Weibo

Abstract

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior po...More

Code:

Data:

0
Introduction
  • The availability of large-scale datasets has been one of the important factors contributing to the recent success in machine learning for real-world tasks such as computer vision (Deng et al, 2009; Krizhevsky et al, 2012) and natural language processing (Devlin et al, 2019).
  • The standard workflow in developing systems for typical machine learning tasks is to train and validate the model on the dataset, and to deploy the model with its parameter fixed when the authors are satisfied with training
  • This offline training allows them to address various operational requirements of the system without actual deployment, such as acceptable level of prediction accuracy rate once the system goes online.
  • This workflow is not straightforwardly applicable to the standard setting of reinforcement learning (RL) (Sutton & Barto, 1998) because of the online learning assumption: the RL agent needs to continuously explore the environment and learn from its trial-and-error experiences to be properly trained.
Highlights
  • The availability of large-scale datasets has been one of the important factors contributing to the recent success in machine learning for real-world tasks such as computer vision (Deng et al, 2009; Krizhevsky et al, 2012) and natural language processing (Devlin et al, 2019)
  • We present an offline reinforcement learning (RL) algorithm that essentially eliminates the need to evaluate out-of-distribution actions, avoiding the problematic overestimation of values
  • We consider the reinforcement learning problem with the environment modeled as a Markov Decision Process (MDP)
  • We presented OptiDICE, an offline RL algorithm that aims to estimate stationary distribution corrections between the optimal policy’s stationary distribution and the dataset distribution
  • We formulated the estimation problem as a minimax optimization that does not involve sampling from the target policy, which essentially circumvents the overestimation issue incurred by bootstrapped target with out-of-distribution actions, practiced by most model-free offline RL algorithms
  • We demonstrated that OptiDICE performs competitively with the state-of-the-art offline RL baselines
Methods
  • The authors evaluate OptiDICE for both tabular and continuous MDPs. For the f -divergence, the authors chose f (x) (x 1), i.e.

    χ2-divergence for the tabular-MDP experiment, while the authors use its softened version for continuous

    MDPs (See Appendix E for details).

    4.1.
  • The authors evaluate OptiDICE for both tabular and continuous MDPs. For the f -divergence, the authors chose f (x) (x 1), i.e. χ2-divergence for the tabular-MDP experiment, while the authors use its softened version for continuous.
  • Random MDPs. The authors validate tabular OptiDICE’s efficiency and robustness using randomly generated MDPs by following the experimental protocol from Laroche et al (2019) and Lee et al (2020) (See Appendix F.1.).
  • The authors consider a data-collection policy πD characterized by the behavior optimality parameter ζ that relates to πD ’s performance ζV ∗(s0) + (1 − ζ)V πunif (s0) where πunif denotes the uniformly random policy.
  • 0.0 number of trajectories in D (a)
Conclusion
  • The authors presented OptiDICE, an offline RL algorithm that aims to estimate stationary distribution corrections between the optimal policy’s stationary distribution and the dataset distribution.
  • The authors formulated the estimation problem as a minimax optimization that does not involve sampling from the target policy, which essentially circumvents the overestimation issue incurred by bootstrapped target with out-of-distribution actions, practiced by most model-free offline RL algorithms.
  • Deriving the closed-form solution of the inner optimization, the authors simplified the nested minimax optimization for obtaining the optimal policy to a convex minimization problem.
  • The authors demonstrated that OptiDICE performs competitively with the state-of-the-art offline RL baselines
Tables
  • Table1: Normalized performance of OptiDICE compared with the best model-free baseline in the D4RL benchmark tasks (<a class="ref-link" id="cFu_et+al_2021_a" href="#rFu_et+al_2021_a"><a class="ref-link" id="cFu_et+al_2021_a" href="#rFu_et+al_2021_a">Fu et al, 2021</a></a>). In the Best baseline column, the algorithm with the best performance among 8 algorithms (offline SAC (<a class="ref-link" id="cHaarnoja_et+al_2018_a" href="#rHaarnoja_et+al_2018_a">Haarnoja et al, 2018</a>), BEAR (<a class="ref-link" id="cKumar_et+al_2019_a" href="#rKumar_et+al_2019_a">Kumar et al, 2019</a>), BRAC (<a class="ref-link" id="cWu_et+al_2019_a" href="#rWu_et+al_2019_a">Wu et al, 2019</a>), AWR (<a class="ref-link" id="cPeng_et+al_2019_a" href="#rPeng_et+al_2019_a">Peng et al, 2019</a>), cREM (<a class="ref-link" id="cAgarwal_et+al_2020_a" href="#rAgarwal_et+al_2020_a">Agarwal et al, 2020</a>), BCQ (<a class="ref-link" id="cFujimoto_et+al_2019_a" href="#rFujimoto_et+al_2019_a">Fujimoto et al, 2019</a>), AlgaeDICE (Nachum et al, 2019b), CQL (<a class="ref-link" id="cKumar_et+al_2020_a" href="#rKumar_et+al_2020_a">Kumar et al, 2020</a>)) is presented, taken from (<a class="ref-link" id="cFu_et+al_2021_a" href="#rFu_et+al_2021_a"><a class="ref-link" id="cFu_et+al_2021_a" href="#rFu_et+al_2021_a">Fu et al, 2021</a></a>). OptiDICE achieved highest scores in 7 tasks
  • Table2: Hyperaparameters
  • Table3: Normalized performance of OptiDICE compared with baselines. Mean scores for baselines—BEAR (<a class="ref-link" id="cKumar_et+al_2019_a" href="#rKumar_et+al_2019_a">Kumar et al, 2019</a>), BRAC (<a class="ref-link" id="cWu_et+al_2019_a" href="#rWu_et+al_2019_a">Wu et al, 2019</a>), AlgaeDICE (Nachum et al, 2019b), and CQL (<a class="ref-link" id="cKumar_et+al_2020_a" href="#rKumar_et+al_2020_a"><a class="ref-link" id="cKumar_et+al_2020_a" href="#rKumar_et+al_2020_a">Kumar et al, 2020</a></a>)— come from D4RL benchmark. We also report the performance of CQL (<a class="ref-link" id="cKumar_et+al_2020_a" href="#rKumar_et+al_2020_a"><a class="ref-link" id="cKumar_et+al_2020_a" href="#rKumar_et+al_2020_a">Kumar et al, 2020</a></a>) obtained by running the code released by authors (denoted as CQL (ours) in the table). OptiDICE achieves the best performance on 6 tasks compared to our baselines. Note that 3-run mean scores without confidence intervals were reported on each task by <a class="ref-link" id="cFu_et+al_2021_a" href="#rFu_et+al_2021_a">Fu et al (2021</a>). For CQL (ours) and OptiDICE, we use 5 runs and report means and 95% confidence intervals
  • Table4: Hyperaparameters for importance-weighted BC
Download tables as Excel
Funding
  • This work was supported by the National Research Foundation (NRF) of Korea (NRF-2019M3F2A1072238 and NRF-2019R1A2C1087634), and the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No 2019-0-00075, IITP No 2020-0-00940 and IITP No 2017-0-01779 XAI)
  • We also acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute of Advanced Research (CIFAR)
Reference
  • Agarwal, R., Schuurmans, D., and Norouzi, M. An optimistic perspective on offline reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML), 1995.
    Google ScholarLocate open access versionFindings
  • Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
    Google ScholarFindings
  • Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvari, C., and Schuurmans, D. CoinDICE: Off-policy confidence interval estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
    Google ScholarLocate open access versionFindings
  • Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 2005.
    Google ScholarLocate open access versionFindings
  • Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI), 2016.
    Google ScholarLocate open access versionFindings
  • Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4RL: Datasets for deep data-driven reinforcement learning, 2021. URL https://openreview.net/forum?id=px0-N3_KjA.
    Findings
  • Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
    Google ScholarLocate open access versionFindings
  • Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 2005.
    Google ScholarLocate open access versionFindings
  • Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, 2019.
    Google ScholarFindings
  • Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. MOReL: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • Kostrikov, I., Agrawal, K. K., Dwibedi, D., Levine, S., and Tompson, J. Discriminator-Actor-Critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019a.
    Google ScholarLocate open access versionFindings
  • Kostrikov, I., Nachum, O., and Tompson, J. Imitation learning via off-policy distribution matching. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019b.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
    Google ScholarLocate open access versionFindings
  • Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. Stabilizing off-policy Q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative Q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • Lange, S., Gabel, T., and Riedmiller, M. Reinforcement learning: State-of-the-art. Springer Berlin Heidelberg, 2012.
    Google ScholarFindings
  • Laroche, R., Trichelair, P., and Des Combes, R. T. Safe policy improvement with baseline bootstrapping. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Lee, B.-J., Lee, J., Vrancx, P., Kim, D., and Kim, K.-E. Batch reinforcement learning with hyperparameter gradients. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020.
    Google ScholarFindings
  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016.
    Google ScholarLocate open access versionFindings
  • Nachum, O., Chow, Y., Dai, B., and Li, L. DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems (NeurIPS), 2019a.
    Google ScholarLocate open access versionFindings
  • Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
    Findings
  • Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 2005.
    Google ScholarLocate open access versionFindings
  • Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable offpolicy reinforcement learning, 2019.
    Google ScholarFindings
  • Petrik, M., Ghavamzadeh, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
    Google ScholarLocate open access versionFindings
  • Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1st edition, 1994.
    Google ScholarFindings
  • Schulman, J., Chen, X., and Abbeel, P. Equivalence between policy gradients and soft Q-learning, 2017.
    Google ScholarFindings
  • Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 1998.
    Google ScholarFindings
  • Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999.
    Google ScholarLocate open access versionFindings
  • Szita, I. and Lorincz, A. The many faces of optimism: A unifying approach. In Proceedings of the 25th International Conference on Machine Learning (ICML), 2008.
    Google ScholarLocate open access versionFindings
  • Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning, 2019.
    Google ScholarFindings
  • Yang, M., Nachum, O., Dai, B., Li, L., and Schuurmans, D. Off-policy evaluation via the regularized lagrangian. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • Yu, C., Liu, J., and Nemati, S. Reinforcement learning in healthcare: A survey, 2020a.
    Google ScholarFindings
  • Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., and Darrell, T. BDD100K: A diverse driving dataset for heterogeneous multitask learning. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
    Google ScholarLocate open access versionFindings
  • Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine, S., Finn, C., and Ma, T. MOPO: Model-based offline policy optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2020c.
    Google ScholarLocate open access versionFindings
  • Zhang, R., Dai, B., Li, L., and Schuurmans, D. GenDICE: Generalized offline estimation of stationary values. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020a.
    Google ScholarLocate open access versionFindings
  • Zhang, S., Liu, B., and Whiteson, S. GradientDICE: Rethinking generalized offline estimation of stationary values. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2020b.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科