# SVQN: Sequential Variational Soft Q-Learning Networks

ICLR, 2020.

EI

Weibo:

Abstract:

Partially Observable Markov Decision Processes (POMDPs) are popular and flexible models for real-world decision-making applications that demand the information from past observations to make optimal decisions. Standard reinforcement learning algorithms for solving Markov Decision Processes (MDP) tasks are not applicable, as they cannot in...More

Highlights

- In recent years, substantial progress has been made in deep reinforcement learning for solving various challenging tasks, including the computer Go game (Silver et al, 2016), Atari games (Mnih et al, 2015), StarCraft (Zambaldi et al, 2018; Pang et al, 2018) and the first-person shooting (FPS) games (Lample & Chaplot, 2017; Wu & Tian, 2016; Huang et al, 2019)
- We propose a novel end-to-end neural network called the sequential variational soft Q-learning network (SVQN), which integrates the learning of hidden states and the optimization of the planning within the same framework
- (2) We propose to tackle the difficulty of the inference of the hidden state and solve the problem of a conditional prior using the generative models
- We summarize some related work in Partially Observable Markov Decision Processes and inference methods for sequential data
- Our work proposes generative models for the algorithm learning, which tackles the difficulty of the inference of hidden states and introduces inductive bias to the network structure
- We propose a novel algorithm named Sequential Variational Soft Q-Learning Networks (SVQN) to solve Partially Observable Markov Decision Processes with the discrete action space

Summary

- Substantial progress has been made in deep reinforcement learning for solving various challenging tasks, including the computer Go game (Silver et al, 2016), Atari games (Mnih et al, 2015), StarCraft (Zambaldi et al, 2018; Pang et al, 2018) and the first-person shooting (FPS) games (Lample & Chaplot, 2017; Wu & Tian, 2016; Huang et al, 2019).
- To infer the hidden states and optimize the planning module jointly, we represent POMDPs as a unified probabilistic graphical model (PGM) and derive a single evidence lower bound (ELBO).
- Contributions: (1) We derive the variational lower bound for POMDPs, which allows us to integrate the optimization of the control problem and learning of the hidden state under a unified graphical model.
- (2) We propose to tackle the difficulty of the inference of the hidden state and solve the problem of a conditional prior using the generative models.
- Some researchers (Hausknecht & Stone, 2015; Zhu et al, 2018) used recurrent neural networks to capture the historical information, but they failed to utilize the Markov property of the state in POMDPs. Our work proposes generative models for the algorithm learning, which tackles the difficulty of the inference of hidden states and introduces inductive bias to the network structure.
- Different from MDPs, the state s of POMDPs in the PGM (shown in Fig. 1(b)) is unobservable, which need to be inferred from the action a and the observation o.
- We need to derive different variational lower bound for POMDPs, which can be used to infer the hidden state and do planning jointly.
- We get two kinds of loss functions for the two generative models, i.e., the reconstruction loss LMES = LiMnnSeEr + LeMlbSoE and the KL-divergence loss LKL = LiKnLner + LeKlbLo. And for the planning algorithm, we use the soft Q-learning algorithm (Levine, 2018).
- Deep Variational Reinforcement Learning (DVRL) (Igl et al, 2018) A method which combines sequential Monte Carlo and A2C (Dhariwal et al, 2017) to solve POMDPs. 5.2 EVALUATION ON FLICKERING ATARI
- Compared with DRQN and ADRQN, our method introduces inductive bias to the network, which helps state estimate and RL planning for POMDPs. Compared with DVRL, our method can achieve competitive performance with lower sampling complexity.
- We propose a novel algorithm named Sequential Variational Soft Q-Learning Networks (SVQN) to solve POMDPs with the discrete action space.
- We apply generative models to deal with the conditional prior of hidden states and use a recurrent neural network to reduce the computational complexity, i.e., with a small length of training data, it can generalize to the test data with an arbitrary length.
- Our designed deep neural network can be trained end-to-end, which optimizes the planning and inference of hidden states jointly.
- We will try to develop algorithms for POMDPs problems with the continuous action space

- Table1: Evaluation results of different models on flickering Atari. The values are the final evaluation scores after training for different algorithms. Values in parentheses indicate the standard deviation. Evaluations use Mann-Whitney rank test and bold numbers indicate statistical significance at the 5% level. Our algorithms outperform other baselines on three of the games and get close score to DVRL on ChopperCommand
- Table2: Evaluation results of different models on ViZDoom. The values are the final evaluation scores after training for different algorithms. Values in parentheses indicate the standard deviation. Evaluations use Mann-Whitney rank test and bold numbers indicate statistical significance at the 5% level. The SVQN models achieve the best performance on these three tasks

Related work

- We summarize some related work in POMDPs and inference methods for sequential data.

Model-Based and Model-Free methods for POMDPs: When the environment model is accessible, POMDPs can be solved by model-based method. Egorov (2015) used model-based methods to solve POMDPs, but their agents need to know the belief-update function and the transition function. When the environment model is unknown, model-free methods should be applied. Recently, some researchers (Hausknecht & Stone, 2015; Zhu et al, 2018) used recurrent neural networks to capture the historical information, but they failed to utilize the Markov property of the state in POMDPs. Our work proposes generative models for the algorithm learning, which tackles the difficulty of the inference of hidden states and introduces inductive bias to the network structure. Igl et al (2018) applied sequential Monte Carlo (SMC) to the POMDPs. They can infer the hidden state from the past observations online. However, they separate the planning algorithm and the inference of the hidden state. Our algorithm is derived from a unified graphical model, which can train the inference model and the planning algorithm jointly.

Funding

- This work was supported by the National Key Research and Development Program of China (No 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Project (No L172037), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration

Full Text

Tags

Comments