# BRPO: Batch Residual Policy Optimization

IJCAI 2020, pp. 2824-2830, 2020.

EI

Weibo:

Abstract:

In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exp...More

Code:

Data:

Introduction

- Deep reinforcement learning (RL) methods are increasingly successful in domains such as games (Mnih et al, 2013), recommender systems (Gauci et al, 2018), and robotic manipulation (Nachum et al, 2019).
- Any off-policy RL algorithm (e.g., DDPG (Lillicrap et al, 2015), DDQN (Hasselt et al, 2016)) may be used in this batch fashion; but in practice, such methods have been shown to fail to learn when presented with arbitrary, static, off-policy data
- This can arise for several reasons: lack of exploration (Lange et al, 2012), Various techniques have been proposed to address these issues, many of which can be interpreted as constraining or regularizing the learned policy to be close to the behavior policy (Fujimoto et al, 2018; Kumar et al, 2019).
- In domains for which batch RL is well-suited, such guarantees can be critical to deployment of the resulting RL policies

Highlights

- Deep reinforcement learning (RL) methods are increasingly successful in domains such as games (Mnih et al, 2013), recommender systems (Gauci et al, 2018), and robotic manipulation (Nachum et al, 2019)
- To illustrate the effectiveness of Batch Residual Policy Optimization, we compare against six baselines: DQN (Mnih et al, 2013), discrete Batch-constrained Q-learning (Fujimoto et al, 2019), KL-regularized Q-learning (KL-Q) (Jaques et al, 2019), SPIBB (Laroche and Trichelair, 2017), Behavior Cloning (BC) (Kober and Peters, 2010), and Batch Residual Policy Optimization-C, which is a simplified version of Batch Residual Policy Optimization that uses a constant parameter as confidence weight3
- We have presented Batch Residual Policy Optimization (BRPO) for learning residual policies in batch reinforcement learning settings
- Inspired by conservative policy improvement, we derived learning rules for jointly optimizing both the candidate policy and state-action dependent confidence mixture of a residual policy to maximize a conservative lower bound on policy performance
- Batch Residual Policy Optimization is more exploitative in areas of state space that are well-covered by the batch data and more conservative in others
- While we have shown successful application of Batch Residual Policy Optimization to various benchmarks, future work includes deriving finite-sample analysis of Batch Residual Policy Optimization, and applying Batch Residual Policy Optimization to more practical batch domains

Results

- To illustrate the effectiveness of BRPO, the authors compare against six baselines: DQN (Mnih et al, 2013), discrete BCQ (Fujimoto et al, 2019), KL-regularized Q-learning (KL-Q) (Jaques et al, 2019), SPIBB (Laroche and Trichelair, 2017), Behavior Cloning (BC) (Kober and Peters, 2010), and BRPO-C, which is a simplified version of BRPO that uses a constant parameter as confidence weight3.
- It is generally inferior to BRPO-C because candidate policy learning does not optimize the performance of the final mixture policy.
- The behavior policy in each environment is trained using standard DQN until it reaches 75% of optimal performance, similar to the process adopted in related work (e.g., Fujimoto et al (2018)).

Conclusion

- The authors have presented Batch Residual Policy Optimization (BRPO) for learning residual policies in batch RL settings.
- Inspired by CPI, the authors derived learning rules for jointly optimizing both the candidate policy and state-action dependent confidence mixture of a residual policy to maximize a conservative lower bound on policy performance.
- While the authors have shown successful application of BRPO to various benchmarks, future work includes deriving finite-sample analysis of BRPO, and applying BRPO to more practical batch domains

Summary

## Introduction:

Deep reinforcement learning (RL) methods are increasingly successful in domains such as games (Mnih et al, 2013), recommender systems (Gauci et al, 2018), and robotic manipulation (Nachum et al, 2019).- Any off-policy RL algorithm (e.g., DDPG (Lillicrap et al, 2015), DDQN (Hasselt et al, 2016)) may be used in this batch fashion; but in practice, such methods have been shown to fail to learn when presented with arbitrary, static, off-policy data
- This can arise for several reasons: lack of exploration (Lange et al, 2012), Various techniques have been proposed to address these issues, many of which can be interpreted as constraining or regularizing the learned policy to be close to the behavior policy (Fujimoto et al, 2018; Kumar et al, 2019).
- In domains for which batch RL is well-suited, such guarantees can be critical to deployment of the resulting RL policies
## Results:

To illustrate the effectiveness of BRPO, the authors compare against six baselines: DQN (Mnih et al, 2013), discrete BCQ (Fujimoto et al, 2019), KL-regularized Q-learning (KL-Q) (Jaques et al, 2019), SPIBB (Laroche and Trichelair, 2017), Behavior Cloning (BC) (Kober and Peters, 2010), and BRPO-C, which is a simplified version of BRPO that uses a constant parameter as confidence weight3.- It is generally inferior to BRPO-C because candidate policy learning does not optimize the performance of the final mixture policy.
- The behavior policy in each environment is trained using standard DQN until it reaches 75% of optimal performance, similar to the process adopted in related work (e.g., Fujimoto et al (2018)).
## Conclusion:

The authors have presented Batch Residual Policy Optimization (BRPO) for learning residual policies in batch RL settings.- Inspired by CPI, the authors derived learning rules for jointly optimizing both the candidate policy and state-action dependent confidence mixture of a residual policy to maximize a conservative lower bound on policy performance.
- While the authors have shown successful application of BRPO to various benchmarks, future work includes deriving finite-sample analysis of BRPO, and applying BRPO to more practical batch domains

- Table1: The mean and st. dev. of average return with the best hyperparameter configuration (with the top-2 results boldfaced). Full training curves are given in Figure 1 in the appendix. For BRPO-C, the optimal confidence parameter is found by grid search
- Table2: The range of hyperparameters sweeped over and the final hyperparameters used for the baselines (BC, BCQ, SARSA, DQN, and KL-Q)
- Table3: The range of hyperparameters sweeped over and the final hyperparameters used for the proposed methods (BRPO and BRPO-C)

Related work

- Similar to the above policy formulation, CPI (Kakade and Langford, 2002) also develops a policy mixing methodology that guarantees performance improvement when the confidence λ is a constant. However, CPI is an online algorithm, and it learns the candidate policy independently of (not jointly with) the mixing factor; thus, extension of CPI to offline, batch setting is unclear. Other existing work also deals with online residual policy learning without jointly learning mixing factors (Johannink et al, 2019; Silver et al., 2018). Common applications of CPI may treat λ as a hyperparameter, which specifies the maximum total-variation distance between the learned and behavior policy distributions (see standard proxies in Schulman et al (2015); Pirotta et al.

(2013) for details).

Batch-constrained Q-learning (BCQ) (Fujimoto et al., 2018, 2019) incorporates the behavior policy when defining the admissible action set in Q-learning for selecting the highest-valued actions that are similar to data samples in the batch. BEAR (Kumar et al, 2019) is motivated as a means to control the accumulation of out-of-distribution value errors; but its main algorithmic contribution is realized by adding a regularizer to the loss that measures the kernel maximum mean discrepancy (MMD) (Gretton et al, 2007) between the learned and behavior policies similar to KL-control (Jaques et al, 2019). Algorithms such as SPI (Ghavamzadeh et al, 2016) and SPIBB (Laroche and Trichelair, 2017) bootstraps the learned policy with the behavior policy when the uncertainty in the update for current state-action pair is high, where the uncertainty is measured by the visitation frequency of state-action pairs in the batch data. While these methods work well in some applications it is unclear if they have any performance guarantees.

Reference

- Y. Abbasi-Yadkori, P. Bartlett, X. Chen, and A. Malek. Largescale Markov decision problems via the linear programming dual. arXiv:1901.01992, 2019.
- B. Amos and Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. ICML-17, pp.136–145. 2017.
- F. Bolley and C. Villani. Weighted Csiszar-Kullback-Pinsker inequalities and applications to transportation inequalities. Annales de la Facultedes sciences de Toulouse: Mathematiques, 14:331–352, 2005.
- S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.
- P. De Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Op. Res., 51(6):850–865, 2003.
- L. Faybusovich and J. Moore. Infinite-dimensional quadratic optimization: interior-point methods and control applications. Appl. Math. and Opt., 36(1):43–66, 1997.
- S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. arXiv:1812.02900, 2018.
- S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau. Benchmarking batch deep reinforcement learning algorithms. arXiv:1910.01708, 2019.
- J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, and X. Ye. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv:1811.00260, 2018.
- M. Ghavamzadeh, M. Petrik, and Y. Chow. Safe policy improvement by minimizing robust baseline regret. NeurIPS16, pp.2298–2306, 2016.
- A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel approach to comparing distributions. AAAI-07, pp.1637–1641, 2007.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290, 2018.
- H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. AAAI-16, pp.2094–2200, 2016.
- D. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30–37, 2004.
- N. Jaques, A. Ghandeharioun, J. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog. arXiv:1907.00456, 2019.
- T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Ojea, E. Solowjow, and S.Levine. Residual reinforcement learning for robot control. ICRA-19, pp.6023–6029, 2019.
- S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML-02, pp.267–274, 2002.
- E. Knight and O. Lerner. Natural gradient deep Q-learning. arXiv:1803.07482, 2018.
- J. Kober and J. Peters. Imitation and reinforcement learning. IEEE Rob. & Autom., 17(2):55–62, 2010.
- A. Kumar, J. Fu, G. Tucker, and S. Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. arXiv:1906.00949, 2019.
- R. Laroche and P. Trichelair. Safe policy improvement with baseline bootstrapping. arXiv:1712.06924, 2017.
- T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv:1509.02971, 2015.
- A. Mahmood, H. van Hasselt, and R. Sutton. Weighted importance sampling for off-policy learning with linear function approximation. NeurIPS-14, pp.3014–3022, 2014.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.
- O. Nachum, M. Ahn, H. Ponte, S. Gu, and V. Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. CORL-19, 2019.
- B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. Combining policy gradient and Q-learning. arXiv:1611.01626, 2016.
- M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. ICML-13, pp.307–315, 2013.
- A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv:1511.06295, 2015.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. ICML-15, pp.1889–1897, 2015.
- A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
- T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning. arXiv:1812.06298, 2018.
- R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, 2018.
- Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imag. Sci., 6(3):1758–1789, 2013.
- This approach of combining the optimal Bellman operator with the on-policy counterpart belongs to the general class of hybrid on/off-policy RL algorithms (O’Donoghue et al., 2016). Therefore, we learn an advantage function W that is a weighted combination of Aβ and Aπ∗. Using the batch data B, the expected advantage Aβ can be learned with any critic-learning technique, such as SARSA (Sutton and Barto, 2018). We can learn Aπ∗ by DQN (Mnih et al., 2013) or other Q-learning algorithm. We provide pseudo-code of our BRPO algorithm in Algorithm 1.
- D.1 Behavior policy We train the behavior policy using DQN, using architecture and hyper-parameters specified in Section D.2. The behavior policy was trained for each task until the performance reaches around 75% of the optimal performance similar to Fujimoto et al. (2018) and Kumar et al. (2019). Specifically, we trained the behavior policy for 100,000 steps for Lunarlander-v2, and 50,000 steps for Cartpole-v1 and Acrobot-v1. We used two-layers MLP with FC(32)-FC(16). The replay buffer size is 500, 000 and batch size is 64. The performance of the behavior policies are given in Table 1.
- Soft target update rate (τ ) Soft target update period Discount factor Mini-batch size Q-function learning rates Neural network optimizer [BCQ] Behavior policy threshold (τ in Fujimoto et al. (2019)) [SPIBB] Bootstrapping set threshold [KL-Q] KL-regularization weight

Full Text

Tags

Comments