# Constrained episodic reinforcement learning in concave-convex and knapsack settings

NIPS 2020, 2020.

EI

Weibo:

Abstract:

We propose an algorithm for tabular episodic reinforcement learning with constraints. We provide a modular analysis with strong theoretical guarantees for settings with concave rewards and convex constraints, and for settings with hard constraints (knapsacks). Most of the previous work in constrained reinforcement learning is limited to...More

Introduction

- Standard reinforcement learning (RL) approaches seek to maximize a scalar reward (Sutton and Barto, 1998, 2018; Schulman et al, 2015; Mnih et al, 2015), but in many settings this is insufficient, because the desired properties of the agent behavior are better described using constraints.
- All of these approaches focus on linear reward objective and linear constraints and do not handle the concave-convex and knapsack settings that the authors consider.

Highlights

- Standard reinforcement learning (RL) approaches seek to maximize a scalar reward (Sutton and Barto, 1998, 2018; Schulman et al, 2015; Mnih et al, 2015), but in many settings this is insufficient, because the desired properties of the agent behavior are better described using constraints
- In this paper we study constrained episodic reinforcement learning, which encompasses all of these applications
- Our learning algorithms optimize their actions with respect to a model based on the empirical statistics, while optimistically overestimating rewards and underestimating the resource consumption. This idea was previously introduced in multiarmed bandits (Agrawal and Devanur, 2014); extending it to episodic reinforcement learning poses additional challenges since the policy space is exponential in the episode horizon. Circumventing these challenges, we provide a modular way to analyze this approach in the basic setting where both rewards and constraints are linear (Section 3) and transfer this result to the more complicated concave-convex and knapsack settings (Sections 4 and 5)
- We introduce a simple algorithm that allows to simultaneously effectively bound reward and consumption regrets for the basic setting introduced in the previous section. Even in this basic setting, we provide the first sample-efficient guarantees in constrained episodic reinforcement learning
- Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments
- Analogous to (1), the learner wishes to compete against the following benchmark which which can be viewed as a reinforcement learning variant of the benchmark used by Agrawal and Devanur (2014) in multi-armed bandits: max f Eπ,p π

Results

- In the basic setting (Section 3), the learner wishes to maximize reward while respecting the consumption constraints in expectation by competing favorably against the following benchmark: H
- The authors' main results hold more generally for concave reward objective and convex consumption constraints (Section 4) and extend to the knapsack setting where constraints are hard (Section 5).
- Even in this basic setting, the authors provide the first sample-efficient guarantees in constrained episodic reinforcement learning.
- The authors extend the algorithm and guarantees derived for the basic setting to when the objective is concave function of the accumulated reward and the constraints a convex function of the cumulative consumptions.
- There is a concave reward-objective function f : R → R and a convex consumption-objective function g : Rd → R; the only assumption is that these functions are L-Lipschitz for some constant L, i.e., |f (x) − f (y)| ≤ L|x − y| for any x, y ∈ R, and |g(x) − g(y)| ≤ L x − y 1 for any x, y ∈ Rd. Analogous to (1), the learner wishes to compete against the following benchmark which which can be viewed as a reinforcement learning variant of the benchmark used by Agrawal and Devanur (2014) in multi-armed bandits: max f Eπ,p π
- To extend the guarantee of the basic setting to the concave-convex setting, the authors face an additional challenge: it is not immediately clear that the optimal policy π is feasible for the ConvexConPlanner program , since ConvexConPlanner is defined with respect to the empirical transition probabilities p(k).6 The authors use a novel application of mean-value theorem to show that π is a feasible solution of that program.
- With probability 1 − δ, the algorithm in the concave-convex setting has reward and consumption regret upper bounded by L · RewReg and Ld · ConsReg respectively.

Conclusion

- As in most works on bandits with knapsacks, the algorithm is allowed to use a “null action” for an episode, i.e., an action that gives 0 reward and consumption when selected in the beginning of the episode.
- 7 Let AggReg(δ) be a bound on the aggregate reward or consumption regret for the soft-constraint setting (Theorem 3.4) where δ is its failure probability.

Summary

- Standard reinforcement learning (RL) approaches seek to maximize a scalar reward (Sutton and Barto, 1998, 2018; Schulman et al, 2015; Mnih et al, 2015), but in many settings this is insufficient, because the desired properties of the agent behavior are better described using constraints.
- All of these approaches focus on linear reward objective and linear constraints and do not handle the concave-convex and knapsack settings that the authors consider.
- In the basic setting (Section 3), the learner wishes to maximize reward while respecting the consumption constraints in expectation by competing favorably against the following benchmark: H
- The authors' main results hold more generally for concave reward objective and convex consumption constraints (Section 4) and extend to the knapsack setting where constraints are hard (Section 5).
- Even in this basic setting, the authors provide the first sample-efficient guarantees in constrained episodic reinforcement learning.
- The authors extend the algorithm and guarantees derived for the basic setting to when the objective is concave function of the accumulated reward and the constraints a convex function of the cumulative consumptions.
- There is a concave reward-objective function f : R → R and a convex consumption-objective function g : Rd → R; the only assumption is that these functions are L-Lipschitz for some constant L, i.e., |f (x) − f (y)| ≤ L|x − y| for any x, y ∈ R, and |g(x) − g(y)| ≤ L x − y 1 for any x, y ∈ Rd. Analogous to (1), the learner wishes to compete against the following benchmark which which can be viewed as a reinforcement learning variant of the benchmark used by Agrawal and Devanur (2014) in multi-armed bandits: max f Eπ,p π
- To extend the guarantee of the basic setting to the concave-convex setting, the authors face an additional challenge: it is not immediately clear that the optimal policy π is feasible for the ConvexConPlanner program , since ConvexConPlanner is defined with respect to the empirical transition probabilities p(k).6 The authors use a novel application of mean-value theorem to show that π is a feasible solution of that program.
- With probability 1 − δ, the algorithm in the concave-convex setting has reward and consumption regret upper bounded by L · RewReg and Ld · ConsReg respectively.
- As in most works on bandits with knapsacks, the algorithm is allowed to use a “null action” for an episode, i.e., an action that gives 0 reward and consumption when selected in the beginning of the episode.
- 7 Let AggReg(δ) be a bound on the aggregate reward or consumption regret for the soft-constraint setting (Theorem 3.4) where δ is its failure probability.

- Table1: Considered Hyperparameters
- Table2: Selected Hyperparameters

Related work

- Sample-efficient exploration in constrained episodic reinforcement learning has only recently started to receive attention. Most previous works on episodic reinforcement learning focus on unconstrained settings (Jaksch et al, 2010; Azar et al, 2017; Dann et al, 2017). A notable exception is the work of Cheung (2019), which provides theoretical guarantees for the reinforcement learning setting with a single episode, but requires a strong reachability assumption, which is not needed in the episodic setting studied here. Also, our results for the knapsack setting allow for a significantly smaller budget as we illustrate in Section 5. Moreover, our approach is based on a tighter bonus, which leads to a superior empirical performance (see Section 6). Recently, there have also been several concurrent and independent works on sample-efficient exploration for reinforcement learning with constraints (Singh et al, 2020; Efroni et al, 2020; Qiu et al, 2020; Ding et al, 2020). Unlike our work, all of these approaches focus on linear reward objective and linear constraints and do not handle the concave-convex and knapsack settings that we consider.

Funding

- Work was supported by National Science Foundation under Grant

Reference

- Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–3JMLR. org.
- Agrawal, S. and Devanur, N. R. (2014). Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM Conference on Economics and Computatxion (EC).
- Altman, E. (1999). Constrained Markov Decision Processes. Chapman and Hall.
- Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML).
- Babaioff, M., Dughmi, S., Kleinberg, R. D., and Slivkins, A. (2015). Dynamic pricing with limited supply. TEAC, 3(1):4. Special issue for 13th ACM EC, 2012.
- Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2018). Bandits with knapsacks. Journal of the ACM, 65(3):13:1–13:55. Preliminary version in FOCS 2013.
- Bellman, R. (1957). A markovian decision process. Indiana Univ. Math. J., 6:679–684.
- Besbes, O. and Zeevi, A. (2009). Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420.
- Besbes, O. and Zeevi, A. (2011). On the minimax complexity of pricing in a changing environment. Operations Reseach, 59(1):66–79.
- Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
- Cheung, W. C. (2019). Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In Advances in Neural Information Processing Systems (NeurIPS).
- Dann, C., Lattimore, T., and Brunskill, E. (2017). Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723.
- Ding, D., Wei, X., Yang, Z., Wang, Z., and Jovanović, M. R. (2020). Provably efficient safe exploration via primal-dual policy optimization. arXiv preprint arXiv:2003.00534.
- Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189.
- Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
- Kearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2):209–232.
- Le, H. M., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. CoRR, abs/1903.08738.
- Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. (2017). Ai safety gridworlds. arXiv preprint arXiv:1711.09883.
- Mao, H., Alizadeh, M., Menache, I., and Kandula, S. (2016). Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, page 50–56, New York, NY, USA. Association for Computing Machinery.
- Miryoosefi, S., Brantley, K., Daume III, H., Dudik, M., and Schapire, R. E. (2019). Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems (NeurIPS).
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
- Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Qiu, S., Wei, X., Yang, Z., Ye, J., and Wang, Z. (2020). Upper confidence primal-dual optimization: Stochastically constrained markov decision processes with adversarial losses and unknown transitions. arXiv preprint arXiv:2003.00660.
- Ray, A., Achiam, J., and Amodei, D. (2020). Benchmarking safe exploration in deep reinforcement learning. https://cdn.openai.com/safexp-short.pdf. Accessed March 11, 2020.
- Rosenberg, A. and Mansour, Y. (2019). Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478–5486.
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. CoRR, abs/1502.05477.
- Singh, R., Gupta, A., and Shroff, N. B. (2020). Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435.
- Slivkins, A. (2019). Introduction to multi-armed bandits. Foundations and Trends R in Machine Learning, 12(1-2):1–286. Also available at https://arxiv.org/abs/1904.07272.
- Sun, W., Vemula, A., Boots, B., and Bagnell, J. A. (2019). Provably efficient imitation learning from observation alone. arXiv preprint arXiv:1905.10948.
- Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull., 2(4):160–163.
- Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, first edition.
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, second edition.
- Syed, U. and Schapire, R. E. (2007). A game-theoretic approach to apprenticeship learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, page 1449–1456, Red Hook, NY, USA. Curran Associates Inc.
- Tessler, C., Mankowitz, D. J., and Mannor, S. (2019). Reward constrained policy optimization. In International Conference on Learning Representations.
- Wang, Z., Deng, S., and Ye, Y. (2014). Close the gaps: A learning-while-doing algorithm for single-product revenue management problems. Operations Research, 62(2):318–331.
- Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438.
- Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML).
- This optimization problem can be solved exactly since it is equivalent to the following linear program on occupation measures (Rosenberg and Mansour, 2019; Altman, 1999). Decision variables are ρ(s, a, h), i.e. probability of agent being at state action pair (s, a) at time step h.
- To prove the Bellman-error regret decomposition, an essential piece is the so called simulation lemma (Kearns and Singh, 2002) which we adapt to constrained settings below: Lemma B.3 (Simulation lemma). For any policy π, any cMDP M = (p, r, c), and any objective m ∈ {r} ∪ {ci}i∈D with corresponding true objective m ∈ {r } ∪ {ci }i∈D,, it holds that: H
- Devanur (2014) since in bandits, there are no transitions. In the proof above, to show that π is feasible in ConvexConPlanner which is defined with respect to p(k), we leverage the fact that g(α) is continuous and a novel application of mean-value theorem to link π ’s performance in the optimistic model

Full Text

Tags

Comments