Better Optimism By Bayes: Adaptive Planning with Rich Models.
CoRR(2014)
摘要
The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of counterexamples to show formal problems with the over-optimism inherent in Thompson sampling. Then we leverage state-of-theart techniques in efficient Bayes-adaptive planning and non-parametric Bayesian methods to perform qualitatively better than both existing conventional algorithms and Thompson sampling on two contextual bandit-like problems. As computer power increases and statistical methods improve, there is an increasingly rich range and variety of probabilistic models of the world. Models embody inductive biases, allowing appropriately confident inferences to be drawn from limited observations. One domain that should benefit markedly from such models is planning and control — models arbitrate the exquisite balance between safe exploration and lucrative exploitation. A general and powerful solution to this balancing act involves forward-looking Bayesian planning in the face of partial observability, which treats the exploration-exploitation trade-off as an optimization problem, squeezing the greatest benefit from each choice. Unfortunately, this is notoriously computationally costly, particularly for complex models, leaving open the possibility that it might not be justified compared to heuristic approaches that may perform very similarly at a much reduced computational cost, for instance treating the tradeoff as a learning problem in a regret setting, focusing on an asymptotic requirement to discover the optimal solution (to avoid accumulating regret). The motivation for this paper is to demonstrate the practical power of Bayesian planning. We show that, despite the arduous optimization problem, sample-based planning approximations can excel with rich models in realistic settings ‐ here a challenging exploration-exploitation task derived from a real dataset (the UCI ’mushroom’ task) ‐ even when the data have not been generated from the prior. By contrast, we show that the benefits of Bayesian inference can be squandered by more myopic forms of planning — such as the provably over-optimistic Thompson Sampling ‐ which fails to account for risk in this task and performs poorly. The experimental results highlight the fact that the Bayesoptimal behavior adapts its exploration strategy as a function of the cost, the horizon, and the uncertainty in a non-trivial way. We also consider an extension of the model to a case of more general subtasks, including subtasks that are themselves small MDPs (in the suppl. material, Section S4). The paper is organized as follows: first, we discuss modelbased Bayesian reinforcement learning (RL), outline some existing planning algorithms for this case and show why Thompson sampling’s over-optimism can be deleterious. Next, we introduce an exploration-exploitation domain that motivates a statistical model for a class of MDPs with shared structure across sequences of tasks. We provide empirical results on a version of the domain that uses real data coming from a popular supervised learning problem (mushroom classification) along with a simulated extension. Finally, we discuss related work.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络