The $f$-Divergence Reinforcement Learning Framework

Chen Gong,Qiang He,Yunpeng Bai,Zhou Yang, Xiaoyu Chen,Xinwen Hou, Xianjie Zhang,Yu Liu,Guoliang Fan

semanticscholar（2021）

引用 0|浏览7

暂无评分

摘要

The framework of deep reinforcement learning (DRL) provides a powerful and widely applicable mathematical formalization for sequential decision-making. This paper present a novel DRL framework, termed f -Divergence Reinforcement Learning (FRL). In FRL, the policy evaluation and policy improvement phases are simultaneously performed by minimizing the f -divergence between the learning policy and sampling policy, which is distinct from conventional DRL algorithms that aim to maximize the expected cumulative rewards. We theoretically prove that minimizing such f -divergence can make the learning policy converge to the optimal policy. Besides, we convert the process of training agents in FRL framework to a saddle-point optimization problem with a specific f function through Fenchel conjugate, which forms new methods for policy evaluation and policy improvement. Through mathematical proofs and empirical evaluation, we demonstrate that the FRL framework has two advantages: (1) policy evaluation and policy improvement processes are performed simultaneously and (2) the issues of overestimating value function are naturally alleviated. To evaluate the effectiveness of the FRL framework, we conduct experiments on Atari 2600 video games and show that agents trained in the FRL framework match or surpass the baseline DRL algorithms. Introduction Deep reinforcement learning (DRL) algorithms, which learn to make decisions from trial and error, have recently achieved successes in a wide variety of fields (Mnih et al. 2015; Levine et al. 2016; Silver et al. 2017). Researchers generally consider reinforcement learning (RL) from views of dynamic programming (Sutton and Barto 2018), Bayes inference (Ghavamzadeh et al. 2015), or linear programming (Nachum et al. 2019a,b). The majority of DRL algorithms train agents to learn a learning policy that can maximize the expected cumulative rewards from the trajectories generated by the sampling policy. The learning policy and sampling policy are defined as follows. Definition 1 The learning policy π is the target policy that an agent tries to learn, which is usually represented by a Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. neural network model. Definition 2 The sampling policy ?̃? is the policy that generates the sampling trajectories during the training phases. The sampling policy is induced by the learning policy through a specific transformation, e.g., the ε-greedy is also used in many previous works, makeing the value function of sampling policy always greater than that of the learning policy (Mnih et al. 2015). In the famous Soft Actor-Critic (SAC) algorithm (Haarnoja et al. 2018b), the policy gradient formulation is derived from making the learning policy close to “old” policies; the latter are represented by a Boltzmann distribution whose energy function of is defined as the negative stateaction value function (Landau and Lifshitz 2013). In the Motivation Section, we show that the sampling policy can be represented as the Boltzmann distribution. As a result, the policy iteration in SAC is a process of minimizing the Kullback-Leibler (KL) divergence between learning policy and sampling policy (Haarnoja et al. 2018b). Besides, we show that when obtaining the optimal policy in a Markov Decision Process (MDP), the Bellman optimality equation, which is induced by the learning policy, and the Bellman expectation equation, which is induced by the sampling policy, should have the same values (see the proof of Eq. (7)). This also inspires our proof (see in Lemma 1) that when minimizing the distance (i.e., f -divergence) between the learning and sampling policy can lead the learning policy converge to the optimal policy. The above facts motivate us to present a novel DRL framework, termed f -Divergence Reinforcement Learning (FRL), which trains agents by minimizing the f−divergence between the learning policy and the sampling policy instead of maximizing the expected cumulative rewards as the conventional DRL algorithms do. We also show that the FRL framework can be compatible with any f functions that satisfy certain properties (convex, proper and semi-continuous) (see Lemma 2). We mathematically show that the objective of minimizing the f−divergence can be transformed into a saddle-point optimization problem using Fenchel conjugate (Nachum and Dai 2020), which makes the FRL framework able to evaluate and improve policy simultaneously while the conventional algorithms like SAC (Haarnoja et al. 2018b) perform the policy evaluation and policy improvear X iv :2 10 9. 11 86 7v 2 [ cs .L G ] 1 4 D ec 2 02 1 ment separately. We also provide the algorithm of performing policy gradient and policy evaluation that can train agents by solving this saddle-point optimization problem. We conduct sufficient experiments to evaluate this proposed framework from different perspectives. On the one hand, the experiments demonstrate that in the Atari 2600 (Mnih et al. 2013) environments, agents trained using FRL consistently match or outperform the baselines algorithms investigated. On the other hand, we empirically show that the notorious problem of overestimating value functions is alleviated in the FRL framework. The above two points highlight the superiority of the FRL framework. The contributions of this paper are summarised as follows. First, this paper proposes a novel framework called f -Divergence Reinforcement Learning that provides a new paradigm of training agents. Second, we provide detailed proofs of the theoretical foundations behind the framework. Third, we provide the algorithm of implementing the framework. Then, experiments are conducted to show that the framework can alleviate the overestimation of value functions and outperforms (or at least matches) baselines algorithms. It is shown that utilizing the convex duality and the usage of the f -divergence concept can benefit the DRL theory, which hopefully provides new perspectives for the development of DRL theory.

查看译文

关键词

reinforcement learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要