# Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Weibo:

Abstract:

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous pr...More

Code:

Data:

Introduction

- Reinforcement learning provides a mathematical formalism for learning-based control. By utilizing reinforcement learning, the authors can automatically acquire near-optimal behavioral skills, represented by policies, for optimizing user-specified reward functions.
- The process of reinforcement learning involves iteratively collecting experience by interacting with the environment, typically with the latest learned policy, and using that experience to improve the policy (Sutton and Barto, 1998).
- In many settings, this sort of online interaction is impractical, either because data collection is expensive and dangerous.
- Even in domains where online interaction is feasible, the authors might still prefer to utilize previously collected data instead – for example, if the domain is complex and effective generalization requires large datasets

Highlights

- Reinforcement learning provides a mathematical formalism for learning-based control
- In the absence of well-developed evaluation protocols, one approach employed in recent work is to utilize training data collected via a standard online reinforcement learning algorithm, using either the entire replay buffer for an off-policy algorithm for training (Kumar et al, 2019a; Agarwal et al, 2019; Fujimoto et al, 2018), or even data from the optimal policy
- This evaluation setting is rather unrealistic, since the entire point of utilizing offline reinforcement learning algorithms in the real world is to obtain a policy that is better than the best behavior in the dataset, potentially in settings where running reinforcement learning online is impractical due to cost or safety concerns
- A simple compromise solution is to utilize data from a “suboptimal” online reinforcement learning run, for example by stopping the online process early, saving out the buffer, and using this buffer as the dataset for offline RL (Kumar et al, 2019a). Even this formulation does not fully evaluate capabilities of offline reinforcement learning methods, and the statistics of the training data have a considerable effect on the difficult of offline RL (Fu et al, 2020), including how concentrated the data distribution Figure 4: An example of exploiting is around a specific set of trajectories, and how multi-modal the compositional structure in trajectories data is
- Offline reinforcement learning offers the possibility of turning reinforcement learning – which is conventionally viewed as a fundamentally active learning paradigm – into a data-driven discipline, such that it can benefit from the same kind of “blessing of scale” that has proven so effective across a range of supervised learning application areas (LeCun et al, 2015)
- A reasonable question we might ask in regard to datasets for offline RL is: in which situations might we expect offline RL to yield a policy that is significantly better than any trajectory in the training set? While we cannot expect offline RL to discover actions that are better than any action illustrated in the data, we can expect it to effectively utilize the compositional structure inherent in any temporal process
- The standard off-policy training methods in these two categories have generally proven unsuitable for the kinds of complex domains typically studied in modern deep reinforcement learning

Results

**Evaluation and Benchmarks**

While individual application domains, such as recommender systems and healthcare, discussed below, have developed particular domain-specific evaluations, the general state of benchmarking for modern offline reinforcement learning research is less well established.- A simple compromise solution is to utilize data from a “suboptimal” online reinforcement learning run, for example by stopping the online process early, saving out the buffer, and using this buffer as the dataset for offline RL (Kumar et al, 2019a)
- Even this formulation does not fully evaluate capabilities of offline reinforcement learning methods, and the statistics of the training data have a considerable effect on the difficult of offline RL (Fu et al, 2020), including how concentrated the data distribution Figure 4: An example of exploiting is around a specific set of trajectories, and how multi-modal the compositional structure in trajectories data is.
- The authors' recently proposed set of offline reinforcement learning benchmarks aims to provide standardized datasets and simulations that cover such difficult cases (Fu et al, 2020)

Conclusion

- Offline reinforcement learning offers the possibility of turning reinforcement learning – which is conventionally viewed as a fundamentally active learning paradigm – into a data-driven discipline, such that it can benefit from the same kind of “blessing of scale” that has proven so effective across a range of supervised learning application areas (LeCun et al, 2015)
- Making this possible will require new innovations that bring to bear sophisticated statistical methods and combine them with the fundamentals of sequential decision making that are conventionally studied in reinforcement learning.
- These formulations have the potential to mitigate the shortcomings of early methods, by explicitly account for the key challenge in offline RL: distributional shift due to differences between the learned policy and the behavior policy

Summary

## Introduction:

Reinforcement learning provides a mathematical formalism for learning-based control. By utilizing reinforcement learning, the authors can automatically acquire near-optimal behavioral skills, represented by policies, for optimizing user-specified reward functions.- The process of reinforcement learning involves iteratively collecting experience by interacting with the environment, typically with the latest learned policy, and using that experience to improve the policy (Sutton and Barto, 1998).
- In many settings, this sort of online interaction is impractical, either because data collection is expensive and dangerous.
- Even in domains where online interaction is feasible, the authors might still prefer to utilize previously collected data instead – for example, if the domain is complex and effective generalization requires large datasets
## Objectives:

The authors aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection.- The goal of this article is to provide the reader with the conceptual tools needed to get started on research in the field of offline reinforcement learning ( called batch reinforcement learning (Ernst et al, 2005; Lange et al, 2012)), so as to hopefully begin addressing some of these deficiencies
## Results:

**Evaluation and Benchmarks**

While individual application domains, such as recommender systems and healthcare, discussed below, have developed particular domain-specific evaluations, the general state of benchmarking for modern offline reinforcement learning research is less well established.- A simple compromise solution is to utilize data from a “suboptimal” online reinforcement learning run, for example by stopping the online process early, saving out the buffer, and using this buffer as the dataset for offline RL (Kumar et al, 2019a)
- Even this formulation does not fully evaluate capabilities of offline reinforcement learning methods, and the statistics of the training data have a considerable effect on the difficult of offline RL (Fu et al, 2020), including how concentrated the data distribution Figure 4: An example of exploiting is around a specific set of trajectories, and how multi-modal the compositional structure in trajectories data is.
- The authors' recently proposed set of offline reinforcement learning benchmarks aims to provide standardized datasets and simulations that cover such difficult cases (Fu et al, 2020)
## Conclusion:

Offline reinforcement learning offers the possibility of turning reinforcement learning – which is conventionally viewed as a fundamentally active learning paradigm – into a data-driven discipline, such that it can benefit from the same kind of “blessing of scale” that has proven so effective across a range of supervised learning application areas (LeCun et al, 2015)- Making this possible will require new innovations that bring to bear sophisticated statistical methods and combine them with the fundamentals of sequential decision making that are conventionally studied in reinforcement learning.
- These formulations have the potential to mitigate the shortcomings of early methods, by explicitly account for the key challenge in offline RL: distributional shift due to differences between the learned policy and the behavior policy

Funding

- A reasonable question we might ask in regard to datasets for offline RL is: in which situations might we actually expect offline RL to yield a policy that is significantly better than any trajectory in the training set? While we cannot expect offline RL to discover actions that are better than any action illustrated in the data, we can expect it to effectively utilize the compositional structure inherent in any temporal process

Reference

- Agarwal, R., Schuurmans, D., and Norouzi, M. (2019). An optimistic perspective on offline reinforcement learning. arXiv preprint arXiv:1907.04543.
- Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
- Bagnell, J. A. and Schneider, J. G. (2001). Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), volume 2, pages 1615–1620. IEEE.
- Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. (2016). Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510.
- Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. (2017). Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918.
- Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.
- Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. (2013). Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260.
- Cabi, S., Colmenarejo, S. G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zołna, K., Aytar, Y., Budden, D., Vecerik, M., et al. (2019). A framework for data-driven robotics. arXiv preprint arXiv:1909.12200.
- Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N., and Fox, D. (2019). Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA), pages 8973–897IEEE.
- Chen, Y. and Wang, M. (2016). Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516.
- Cheng, C.-A., Yan, X., and Boots, B. (2019). Trajectory-wise control variates for variance reduction in policy gradient methods. arXiv preprint arXiv:1908.03263.
- Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765.
- Codevilla, F., Miiller, M., López, A., Koltun, V., and Dosovitskiy, A. (2018). End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–9. IEEE.
- Dai, B., He, N., Pan, Y., Boots, B., and Song, L. (2016). Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579.
- Dai, B., Shaw, A., He, N., Li, L., and Song, L. (2017a). Boosting the actor with dual critic. arXiv preprint arXiv:1712.10282.
- Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. (2017b). Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285.
- Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., and Finn, C. (2019). Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215.
- Degris, T., White, M., and Sutton, R. S. (2012). Off-policy actor-critic. arXiv preprint arXiv:1205.4839.
- Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472.
- Dosovitskiy, A. and Koltun, V. (2016). Learning to act by predicting the future. arXiv preprint arXiv:1611.01779.
- Dudík, M., Erhan, D., Langford, J., Li, L., et al. (2014). Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511.
- Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. (2018). Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568.
- Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556.
- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.
- Eysenbach, B., Gu, S., Ibarz, J., and Levine, S. (2017). Leave no trace: Learning to reset for safe and autonomous reinforcement learning. arXiv preprint arXiv:1711.06782.
- Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576.
- Farajtabar, M., Chow, Y., and Ghavamzadeh, M. (2018). More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493.
- Finn, C. and Levine, S. (2017). Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE.
- Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. (2020). D4rl: Datasets for deep data-driven reinforcement learning. In arXiv.
- Fu, J., Kumar, A., Soh, M., and Levine, S. (2019). Diagnosing bottlenecks in deep Q-learning algorithms. arXiv preprint arXiv:1902.10250.
- Fujimoto, S., Meger, D., and Precup, D. (2018). Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900.
- Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059.
- Garcin, F., Faltings, B., Donatsch, O., Alazzawi, A., Bruttin, C., and Huber, A. (2014). Offline and online evaluation of news recommender systems at swissinfo. ch. In Proceedings of the 8th ACM Conference on Recommender systems, pages 169–176.
- Gelada, C. and Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3647–3655.
- Ghavamzadeh, M., Petrik, M., and Chow, Y. (2016). Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pages 2298–2306.
- Gilotte, A., Calauzènes, C., Nedelec, T., Abraham, A., and Dollé, S. (2018). Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198–206.
- Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. (2019). Guidelines for reinforcement learning in healthcare. Nat Med, 25(1):16–18.
- Gottesman, O., Johansson, F., Meier, J., Dent, J., Lee, D., Srinivasan, S., Zhang, L., Ding, Y., Wihl, D., Peng, X., et al. (2018). Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298.
- Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017a). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE.
- Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., and Levine, S. (2017b). Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in neural information processing systems, pages 3846–3855.
- Guez, A., Vincent, R. D., Avoli, M., and Pineau, J. (2008). Adaptive treatment of epilepsy via batch-mode reinforcement learning. In AAAI, pages 1671–1678.
- Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In arXiv.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2018). Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551.
- Hafner, R. and Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine learning, 84(1-2):137–169.
- Hallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1372–1383. JMLR. org.
- Hallak, A., Tamar, A., and Mannor, S. (2015). Emphatic td bellman operator is a contraction. arXiv preprint arXiv:1508.03411.
- Hallak, A., Tamar, A., Munos, R., and Mannor, S. (2016). Generalized emphatic temporal difference learning: Bias-variance analysis. In Thirtieth AAAI Conference on Artificial Intelligence.
- Hanna, J. P., Stone, P., and Niekum, S. (2017). Bootstrapping with models: Confidence intervals for off-policy evaluation. In Thirty-First AAAI Conference on Artificial Intelligence.
- Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. (2015). Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952.
- Henderson, J., Lemon, O., and Georgila, K. (2008). Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics, 34(4):487–511.
- Huang, J. and Jiang, N. (2019). From importance sampling to doubly robust policy gradient. arXiv preprint arXiv:1910.09066.
- Imani, E., Graves, E., and White, M. (2018). An off-policy policy gradient theorem using emphatic weightings. In Advances in Neural Information Processing Systems, pages 96–106.
- Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
- Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509.
- Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. (2019). Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.
- Jiang, N. and Li, L. (2015). Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722.
- Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035.
- Kahn, G., Abbeel, P., and Levine, S. (2020). Badgr: An autonomous self-supervised learning-based navigation system. arXiv preprint arXiv:2002.05700.
- Kahn, G., Villaflor, A., Abbeel, P., and Levine, S. (2018). Composable action-conditioned predictors: Flexible off-policy learning for robot navigation. arXiv preprint arXiv:1810.07167.
- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. (2019a). Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374.
- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. (2019b). Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374.
- Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), volume 2.
- Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538.
- Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. (2018). Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673.
- Kallus, N. and Uehara, M. (2019a). Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850.
- Kallus, N. and Uehara, M. (2019b). Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems, pages 3320– 3329.
- Kandasamy, K., Bachrach, Y., Tomioka, R., Tarlow, D., and Carter, D. (2017). Batch policy gradient methods for improving neural conversation models. arXiv preprint arXiv:1702.03334.
- Kendall, A. and Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584.
- Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., Lam, V.-D., Bewley, A., and Shah, A. (2019). Learning to drive in a day. In 2019 International Conference on Robotics and Automation (ICRA), pages 8248–8254. IEEE.
- Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589.
- Ko, J., Klein, D. J., Fox, D., and Haehnel, D. (2007). Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Proceedings 2007 ieee international conference on robotics and automation, pages 742–747. IEEE.
- Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014.
- Koppejan, R. and Whiteson, S. (2009). Neuroevolutionary reinforcement learning for generalized helicopter control. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 145–152.
- Kumar, A. (2019). Data-driven deep reinforcement learning. https://bair.berkeley.edu/blog/2019/12/05/bear/. BAIR Blog.
- Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019a). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761–11771.
- Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019b). Stabilizing off-policy q-learning via bootstrapping error reduction. In Neural Information Processing Systems (NeurIPS).
- Kumar, A. and Gupta, A. (2020). Does on-policy data collection fix errors in reinforcement learning? https://bair.berkeley.edu/blog/2020/03/16/discor/. BAIR Blog.
- Kumar, A., Gupta, A., and Levine, S. (2020). Discor: Corrective feedback in reinforcement learning via distribution correction. arXiv preprint arXiv:2003.07305.
- Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149.
- Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer.
- Langford, J., Strehl, A., and Wortman, J. (2008). Exploration scavenging. In Proceedings of the 25th international conference on Machine learning, pages 528–535.
- Laroche, R., Trichelair, P., and Combes, R. T. d. (2017). Safe policy improvement with baseline bootstrapping. arXiv preprint arXiv:1712.06924.
- LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.
- Lee, D. and He, N. (2018). Stochastic primal-dual q-learning. arXiv preprint arXiv:1810.08298.
- Lerer, A., Gross, S., and Fergus, R. (2016). Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312.
- Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909.
- Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373.
- Levine, S. and Koltun, V. (2013). Guided policy search. In International Conference on Machine Learning, pages 1–9.
- Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., and Quillen, D. (2018). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436.
- Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321.
- Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366.
- Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2019). Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473.
- Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. (2018). Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858.
- Maddern, W., Pascoe, G., Linegar, C., and Newman, P. (2017). 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15.
- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Mo, K., Li, H., Lin, Z., and Lee, J.-Y. (2018). The adobeindoornav dataset: Towards deep reinforcement learning based real-world indoor robot visual navigation. arXiv preprint arXiv:1802.08824.
- Mousavi, A., Li, L., Liu, Q., and Zhou, D. (2020). Black-box off-policy estimation for infinite-horizon reinforcement learning. arXiv preprint arXiv:2003.11126.
- Murphy, S. A., van der Laan, M. J., Robins, J. M., and Group, C. P. P. R. (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423.
- Nachum, O., Chow, Y., Dai, B., and Li, L. (2019a). Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pages 2315–2325.
- Nachum, O. and Dai, B. (2020). Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866.
- Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. (2019b). Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074.
- Nadjahi, K., Laroche, R., and Combes, R. T. d. (2019). Safe policy improvement with soft baseline bootstrapping. arXiv preprint arXiv:1907.05079.
- Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
- Nie, X., Brunskill, E., and Wager, S. (2019). Learning when-to-treat policies. arXiv preprint arXiv:1905.09751.
- Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279.
- Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871.
- Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems, pages 6118–6128.
- Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034.
- Osband, I. and Van Roy, B. (2017). Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org.
- O’Donoghue, B., Osband, I., Munos, R., and Mnih, V. (2018). The uncertainty bellman equation and exploration. In International Conference on Machine Learning, pages 3836–3845.
- Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., and Boots, B. (2017). Agile autonomous driving using end-to-end deep imitation learning. arXiv preprint arXiv:1709.07174.
- Pankov, S. (2018). Reward-estimation variance elimination in sequential decision processes. arXiv preprint arXiv:1811.06225.
- Parmas, P., Rasmussen, C. E., Peters, J., and Doya, K. (2019). Pipps: Flexible model-based policy search robust to the curse of chaos. arXiv preprint arXiv:1902.01240.
- Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752–759.
- Peng, X. B., Kumar, A., Zhang, G., and Levine, S. (2019). Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177.
- Peshkin, L. and Shelton, C. R. (2002). Learning from scarce experience. arXiv preprint cs/0204043.
- Pietquin, O., Geist, M., Chandramohan, S., and Frezza-Buet, H. (2011). Sample-efficient batch reinforcement learning for dialogue management optimization. ACM Transactions on Speech and Language Processing (TSLP), 7(3):1–21.
- Pinto, L. and Gupta, A. (2016). Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE.
- Prasad, N., Cheng, L.-F., Chivers, C., Draugelis, M., and Engelhardt, B. E. (2017). A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300.
- Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80.
- Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In ICML, pages 417–424.
- Raghu, A., Komorowski, M., Ahmed, I., Celi, L., Szolovits, P., and Ghassemi, M. (2017). Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602.
- Rhinehart, N., McAllister, R., and Levine, S. (2018). Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544.
- Ross, S. and Bagnell, D. (2010). Efficient reductions for imitation learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 661–668.
- Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 627–635.
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252.
- Sachdeva, N., Su, Y., and Joachims, T. (2020). Off-policy bandits with deficient support.
- Sadeghi, F. and Levine, S. (2017). CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems.
- Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. (2019). Distributionally robust neural networks. In International Conference on Learning Representations.
- Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S. (2017). Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76.
- Schölkopf, B. (2019). Causality for machine learning. arXiv preprint arXiv:1911.10500.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015b). Trust region policy optimization. In International Conference on Machine Learning (ICML).
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Shortreed, S. M., Laber, E., Lizotte, D. J., Stroup, T. S., Pineau, J., and Murphy, S. A. (2011). Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine learning, 84(1-2):109–136.
- Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., and Riedmiller, M. (2020). Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396.
- Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML).
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676):354–359.
- Sinha, A., Namkoong, H., and Duchi, J. (2017). Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.
- Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. (2009). On integral probability metrics,\phi-divergences and binary classification. arXiv preprint arXiv:0901.2698.
- Strehl, A., Langford, J., Li, L., and Kakade, S. M. (2010). Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, pages 2217–2225.
- Sun, L., Peng, C., Zhan, W., and Tomizuka, M. (2018a). A fast integrated planning and control framework for autonomous driving via imitation learning. In Dynamic Systems and Control Conference, volume 51913, page V003T37A012. American Society of Mechanical Engineers.
- Sun, W., Gordon, G. J., Boots, B., and Bagnell, J. (2018b). Dual policy iteration. In Advances in Neural Information Processing Systems, pages 7059–7069.
- Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163.
- Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.
- Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000.
- Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603– 2631.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
- Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., and Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems, pages 3632–3642.
- Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162.
- Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., Bohez, S., and Vanhoucke, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332.
- Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. (2019). Doubly robust bias reduction in infinite horizon off-policy estimation. arXiv preprint arXiv:1910.07186.
- Tassa, Y., Erez, T., and Todorov, E. (2012). Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913. IEEE.
- Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219.
- Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. (2015). Personalized ad recommendation systems for life-time value optimization with guarantees. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
- Thomas, P. (2014). Bias in natural actor-critic algorithms. In International Conference on Machine Learning (ICML).
- Thomas, P. and Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148.
- Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. (2015a). High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
- Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. (2015b). High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
- Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., and Brunskill, E. (2017). Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Twenty-Ninth IAAI Conference.
- Todorov, E. (2006). Linearly-solvable markov decision problems. In Advances in Neural Information Processing Systems (NIPS).
- Tseng, H.-H., Luo, Y., Cui, S., Chien, J.-T., Ten Haken, R. K., and El Naqa, I. (2017). Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705.
- Uehara, M. and Jiang, N. (2019). Minimax weight and q-function learning for off-policy evaluation. arXiv preprint arXiv:1910.12809.
- Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. (2018). Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.
- Vanseijen, H. and Sutton, R. (2015). A deeper look at planning as learning from replay. In International conference on machine learning, pages 2314–2322.
- Wang, L., Zhang, W., He, X., and Zha, H. (2018). Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2447–2456.
- Wang, M. and Chen, Y. (2016). An online primal-dual method for discounted markov decision processes. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4516–4521. IEEE.
- Wang, Y.-X., Agarwal, A., and Dudik, M. (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3589–3597. JMLR. org.
- Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
- Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279–292.
- Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. (2015). Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754.
- Wu, Y., Tucker, G., and Nachum, O. (2019a). Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361.
- Wu, Y., Tucker, G., and Nachum, O. (2019b). Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361.
- Wu, Y., Winston, E., Kaushik, D., and Lipton, Z. (2019c). Domain adaptation with asymmetricallyrelaxed distribution alignment. arXiv preprint arXiv:1903.01689.
- Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., and Darrell, T. (2018). Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687.
- Yu, H. (2015). On convergence of emphatic temporal-difference learning. In Conference on Learning Theory, pages 1724–1751.
- Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K. (2020). A survey of autonomous driving: Common practices and emerging technologies. IEEE Access.
- Zeng, A., Song, S., Welker, S., Lee, J., Rodriguez, A., and Funkhouser, T. (2018). Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4238–4245. IEEE.
- Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., and Levine, S. (2018). Solar: deep structured representations for model-based reinforcement learning. arXiv preprint arXiv:1808.09105.
- Zhang, R., Dai*, B., Li, L., and Schuurmans, D. (2020). Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations.
- Zhang, S., Boehmer, W., and Whiteson, S. (2019). Generalized off-policy actor-critic. In Advances in Neural Information Processing Systems, pages 1999–2009.
- Zhou, L., Small, K., Rokhlenko, O., and Elkan, C. (2017). End-to-end offline goal-oriented dialog policy learning via policy gradient. arXiv preprint arXiv:1712.02838.

Full Text

Tags

Comments