DualSMC: Tunneling Differentiable Filtering and Planning under Continuous POMDPs

IJCAI 2020, pp. 4190-4198, 2020.

Cited by: 0|Bibtex|Views194|DOI:https://doi.org/10.24963/ijcai.2020/579
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We provided an end-to-end neural network named Dual Sequential Monte Carlo to solve continuous Partially Observable Markov Decision Processes, which has three advantages

Abstract:

A major difficulty of solving continuous POMDPs is to infer the multi-modal distribution of the unobserved true states and to make the planning algorithm dependent on the perceived uncertainty. We cast POMDP filtering and planning problems as two closely related Sequential Monte Carlo (SMC) processes, one over the real states and the othe...More
0
Introduction
  • Observable Markov Decision Processes (POMDPs) formulate reinforcement learning problems where the agent’s instant observation is insufficient for optimal decision making [Kaelbling et al, 1998].
  • Since conventional POMDP problems usually present an explicit state formulation, executing the planning algorithm in a latent space makes it difficult to adopt any useful prior knowledge.
  • Whenever these models fail to perform well, it is difficult to analyze which part causes the failure as they are less interpretable
Highlights
  • Observable Markov Decision Processes (POMDPs) formulate reinforcement learning problems where the agent’s instant observation is insufficient for optimal decision making [Kaelbling et al, 1998]
  • Approximate solutions to Partially Observable Markov Decision Processes based on deep reinforcement learning can directly encode the history of past observations with deep models like RNNs [Hausknecht and Stone
  • We present a simple but effective model named Dual Sequential Monte Carlo (DualSMC)
  • On the other hand, compared with the existing Bayesian reinforcement learning literature on Partially Observable Markov Decision Processes [Ross et al, 2008], our work focuses more on deep reinforcement learning solutions to continuous Partially Observable Markov Decision Processes
  • We provided an end-to-end neural network named Dual Sequential Monte Carlo to solve continuous Partially Observable Markov Decision Processes, which has three advantages
  • Dual Sequential Monte Carlo combines the richness of neural networks as well as the interpretability of classical sequential Monte Carlo methods
Methods
  • DVRL [Igl et al, 2018].
  • LSTM filter + SMCP [Piche et al, 2018] Regressive PF ( 2, top-1) + SMCP Regressive PF + PI-SMCP.
  • Adversarial PF + SMCP Adversarial PF + PI-SMCP.
  • DualSMC with regressive PF ( 2) DualSMC with regressive PF DualSMC w/o proposer DualSMC with adversarial PF Success # Steps Reg PF + SMCP Adv PF + SMCP.
  • DualSMC with Adv PF PF w/o proposer
Results
  • The authors can see that the adversarial PF significantly outperforms other differentiable state estimation approaches, such as (1) the existing DPFs that perform density estimation [Jonschkowski et al, 2018], and (2) the deterministic LSTM model that was previously used as a strong baseline in [Karkus et al, 2018; Jonschkowski et al, 2018].
Conclusion
  • The authors provided an end-to-end neural network named DualSMC to solve continuous POMDPs, which has three advantages.
  • It learns plausible belief states for highdimensional POMDPs with an adversarial particle filter.
  • DualSMC plans future actions by considering the distributions of the learned belief states.
  • DualSMC combines the richness of neural networks as well as the interpretability of classical sequential Monte Carlo methods.
  • The authors empirically validated the effectiveness of DualSMC on different tasks including visual navigation and control
Summary
  • Introduction:

    Observable Markov Decision Processes (POMDPs) formulate reinforcement learning problems where the agent’s instant observation is insufficient for optimal decision making [Kaelbling et al, 1998].
  • Since conventional POMDP problems usually present an explicit state formulation, executing the planning algorithm in a latent space makes it difficult to adopt any useful prior knowledge.
  • Whenever these models fail to perform well, it is difficult to analyze which part causes the failure as they are less interpretable
  • Methods:

    DVRL [Igl et al, 2018].
  • LSTM filter + SMCP [Piche et al, 2018] Regressive PF ( 2, top-1) + SMCP Regressive PF + PI-SMCP.
  • Adversarial PF + SMCP Adversarial PF + PI-SMCP.
  • DualSMC with regressive PF ( 2) DualSMC with regressive PF DualSMC w/o proposer DualSMC with adversarial PF Success # Steps Reg PF + SMCP Adv PF + SMCP.
  • DualSMC with Adv PF PF w/o proposer
  • Results:

    The authors can see that the adversarial PF significantly outperforms other differentiable state estimation approaches, such as (1) the existing DPFs that perform density estimation [Jonschkowski et al, 2018], and (2) the deterministic LSTM model that was previously used as a strong baseline in [Karkus et al, 2018; Jonschkowski et al, 2018].
  • Conclusion:

    The authors provided an end-to-end neural network named DualSMC to solve continuous POMDPs, which has three advantages.
  • It learns plausible belief states for highdimensional POMDPs with an adversarial particle filter.
  • DualSMC plans future actions by considering the distributions of the learned belief states.
  • DualSMC combines the richness of neural networks as well as the interpretability of classical sequential Monte Carlo methods.
  • The authors empirically validated the effectiveness of DualSMC on different tasks including visual navigation and control
Tables
  • Table1: Training hyper-parameters for the (A) floor positioning, (B) 3D dark-light, and (C) modified reacher domains state = (0.95, 0.8) obs = (0.95, 1.05, 0.3, 0.2)
  • Table2: The success rate and the average number of steps of 1,000 tests in the floor positioning domain (PF is short for particle filter)
  • Table3: The average result of 100 tests for 3D light-dark navigation planning part. The robot changes its plan from taking a detour shown in Figure 5(a) to walking toward the target area directly shown in Figure 5(b). It performs equally well to the standard SMCP, with a 100.0% success rate and an averaged 21.3 steps (v.s. 20.7 steps by SMCP). We may conclude that DualSMC provides policies based on the distribution of filtered particles. We may also conclude that DualSMC trained under POMDPs generalizes well to similar tasks with less uncertainty
  • Table4: Network details of each module in DualSMC
Download tables as Excel
Related work
  • Planning under uncertainty. Due to the high computation cost of POMDPs, many previous approaches used samplingbased techniques for either belief update or planning, or both. For instance, a variety of Monte Carlo tree search methods have shown success in relatively large POMDPs by constructing a search tree of history based on rollout simulations [Silver and Veness, 2010; Somani et al, 2013; Seiler et al, 2015; Sunberg and Kochenderfer, 2018]. Later work further improved the efficiency by limiting the search space or reusing plans [Somani et al, 2013; Kurniawati and Yadav, 2016]. Although considerable progress has been made to enlarge the set of solvable POMDPs, it remains hard for pure sampling-based methods to deal with unknown dynamics and complex observations like visual inputs. Therefore, in this work, we provide one approach to combine the efficiency and interpretability of conventional sampling-based methods with the flexibility of deep learning networks for complex POMDP modeling.
Funding
  • This work is in part supported by ONR MURI N00014-16-12007. A Particle-Independent SMC Planning As shown in Alg 3, it takes the top-M particle states (for computation efficiency) and plans N future trajectories independently based on each particle state
Reference
  • [Beattie et al., 2016] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Kuttler, Andrew Lefrancq, Simon Green, Vıctor Valdes, Amir Sadik, et al. DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.
    Findings
  • [Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • [Doucet and Johansen, 2009] Arnaud Doucet and Adam M Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12(656704):3, 2009.
    Google ScholarLocate open access versionFindings
  • [Goodfellow et al., 2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • [Gordon et al., 1993] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/nonGaussian Bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), pages 107–113, 1993.
    Google ScholarLocate open access versionFindings
  • [Gu et al., 2015] Shixiang Shane Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential Monte Carlo. In NeurIPS, pages 2629–2637, 2015.
    Google ScholarLocate open access versionFindings
  • [Hafner et al., 2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, pages 2555–2565, 2019.
    Google ScholarLocate open access versionFindings
  • [Hausknecht and Stone, 2015] Matthew Hausknecht and Peter Stone. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI Fall Symposium Series, 2015.
    Google ScholarLocate open access versionFindings
  • [Igl et al., 2018] Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for POMDPs. In ICML, pages 2117– 2126, 2018.
    Google ScholarLocate open access versionFindings
  • [Jonschkowski et al., 2018] Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. Differentiable particle filters: End-to-end learning with algorithmic priors. In RSS, 2018.
    Google ScholarLocate open access versionFindings
  • [Kaelbling et al., 1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998.
    Google ScholarLocate open access versionFindings
  • [Kappen et al., 2012] Hilbert J Kappen, Vicenc Gomez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine Learning, 87(2):159–182, 2012.
    Google ScholarLocate open access versionFindings
  • [Karkus et al., 2017] Peter Karkus, David Hsu, and Wee Sun Lee. QMDP-net: Deep learning for planning under partial observability. In NeurIPS, pages 4694–4704, 2017.
    Google ScholarLocate open access versionFindings
  • [Karkus et al., 2018] Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual localization. In CoRL, 2018.
    Google ScholarLocate open access versionFindings
  • [Kempinska and Shawe-Taylor, 2017] Kira Kempinska and John Shawe-Taylor. Adversarial sequential Monte Carlo. In Bayesian Deep Learning (NeurIPS Workshop), 2017.
    Google ScholarLocate open access versionFindings
  • [Kingma and Ba, 2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [Kurniawati and Yadav, 2016] Hanna Kurniawati and Vinay Yadav. An online POMDP solver for uncertainty planning in dynamic environment. In Robotics Research, pages 611– 629. 2016.
    Google ScholarLocate open access versionFindings
  • [Levine and Koltun, 2013] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In NeurIPS, pages 207–215, 2013.
    Google ScholarLocate open access versionFindings
  • [Levine, 2018] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
    Findings
  • [Littman et al., 1995] Michael L Littman, Anthony R Cassandra, and Leslie Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In ICML, 1995.
    Google ScholarLocate open access versionFindings
  • [Maddison et al., 2017] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In NeurIPS, pages 6573–6583, 2017.
    Google ScholarLocate open access versionFindings
  • [Naesseth et al., 2018] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential Monte Carlo. In AISTATS, 2018.
    Google ScholarLocate open access versionFindings
  • [Papadimitriou and Tsitsiklis, 1987] Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987.
    Google ScholarLocate open access versionFindings
  • [Piche et al., 2018] Alexandre Piche, Valentin Thomas, Cyril Ibrahim, Yoshua Bengio, and Chris Pal. Probabilistic planning with sequential Monte Carlo methods. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [Platt Jr et al., 2010] Robert Platt Jr, Russ Tedrake, Leslie Kaelbling, and Tomas Lozano-Perez. Belief space planning assuming maximum likelihood observations. In RSS, 2010.
    Google ScholarLocate open access versionFindings
  • [Ross et al., 2008] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive POMDPs. In NeurIPS, pages 1225–1232, 2008.
    Google ScholarLocate open access versionFindings
  • [Seiler et al., 2015] Konstantin M Seiler, Hanna Kurniawati, and Surya PN Singh. An online and approximate solver for POMDPs with continuous action space. In ICRA, pages 2290–2297, 2015.
    Google ScholarLocate open access versionFindings
  • [Silver and Veness, 2010] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In NeurIPS, pages 2164–2172, 2010.
    Google ScholarLocate open access versionFindings
  • [Somani et al., 2013] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. DESPOT: Online POMDP planning with regularization. In NeurIPS, pages 1772–1780, 2013.
    Google ScholarLocate open access versionFindings
  • [Sunberg and Kochenderfer, 2018] Zachary N Sunberg and Mykel J Kochenderfer. Online algorithms for POMDPs with continuous state, action, and observation spaces. In ICAPS, 2018.
    Google ScholarLocate open access versionFindings
  • [Todorov et al., 2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IROS, pages 5026–5033, 2012.
    Google ScholarLocate open access versionFindings
  • [Todorov, 2008] Emanuel Todorov. General duality between optimal control and estimation. In CDC, pages 4286–4292, 2008.
    Google ScholarLocate open access versionFindings
  • [Toussaint, 2009] Marc Toussaint. Robot trajectory optimization using approximate inference. In ICML, pages 1049– 1056, 2009.
    Google ScholarLocate open access versionFindings
  • [Zhu et al., 2018] Pengfei Zhu, Xin Li, Pascal Poupart, and Guanghui Miao. On improving deep reinforcement learning for POMDPs. arXiv preprint arXiv:1804.06309, 2018.
    Findings
Full Text
Your rating :
0

 

Tags
Comments