Generalized Neural Policies for Relational MDPs

Garg Sankalp
Garg Sankalp
Bajpai Aniket
Bajpai Aniket

ICML 2020, 2020.

Cited by: 0|Bibtex|Views27|Links
Keywords:
Machine LearningMarkov Decision Processprobabilistic planningdeep reactive policyAcademic AdvisingMore(14+)
Weibo:
We present the first neural-method for obtaining a generalized policy for Relational Markov Decision Process represented in Relational Dynamic Decision Language

Abstract:

A Relational Markov Decision Process (RMDP) is a first-order representation to express all instances of a single probabilistic planning domain with possibly unbounded number of objects. Early work in RMDPs outputs generalized (instance-independent) first-order policies or value functions as a means to solve all instances of a domain at ...More

Code:

Data:

0
Introduction
  • A Relational Markov Decision Process (RMDP) (Boutilier et al, 2001) is a first-order representation for expressing all instances of a probabilistic planning domain with a possibly unbounded number of objects.
  • Traditional RMDP planners attempted to find a generalized first-order value function or policy using symbolic dynamic programming (Boutilier et al, 2001), or by approximating them via a function over learned first-order basis functions (Guestrin et al, 2003; Fern et al, 2006; Sanner & Boutilier, 2009)
  • These methods met with rather limited success, for e.g., no relational planner participated in International Probabilistic Planning Competition (IPPC) 1 after 2006, even though all competition domains were relational.
  • These models are compact, and an xi depends only on a small number of other state variables
Highlights
  • A Relational Markov Decision Process (RMDP) (Boutilier et al, 2001) is a first-order representation for expressing all instances of a probabilistic planning domain with a possibly unbounded number of objects
  • We present Symbolic NetWork (SYMNET), the first domainindependent neural planner for computing generalized policies for Relational Markov Decision Process that are expressed in the symbolic representation language of Relational Dynamic Decision Language (Sanner, 2010)
  • We show all our results on nine Relational Dynamic Decision Language domains used International Probabilistic Planning Competition 2014: Academic Advising (AA), Crossing Traffic (CT), Game of Life (GOL), Navigation (NAV), Skill Teaching (ST), Sysadmin (Sys), Tamarisk (Tam), Traffic (Tra), and Wildfire (Wild)
  • We see that Symbolic NetWork with no training achieves over 90% the max reward on 43 instances and over 80% in 52 out of 54 instances
  • This is our main result, and it highlights that Symbolic NetWork takes a major leap towards the goal of computing generalized policies for the whole Relational Markov Decision Process domain, and can work on a new instance out of the box
  • We present the first neural-method for obtaining a generalized policy for Relational Markov Decision Process represented in Relational Dynamic Decision Language
Methods
  • The authors' goal is to estimate the effectiveness of SYMNET outof-the-box policy for a new problem in a domain.
  • To further understand the overall quality of the generalized policy, the authors compare it against several upper bounds that train neural models from scratch on the test instance.
  • The authors compare it against state-of-the-art online planner PROST (Keller & Eyerich, 2012).
Results
  • Comparison against Random Policy: The authors report the values of αsymnet(0) in Table 1.
  • The authors highlight the instances where the method achieves over 90% of the max reward obtained by any algorithm for that instance.
  • The authors show that the method performs the best out-of-the-box in 28 instances
  • This is the main result, and it highlights that SYMNET takes a major leap towards the goal of computing generalized policies for the whole RMDP domain, and can work on a new instance out of the box
Conclusion
  • The authors' method, named SYMNET, converts an RDDL problem instance into an instance graph, on which a graph neural network computes state embeddings and embeddings for important object tuples.
  • These are decoded into scores for each ground action.
  • Even when compared against training deep reactive policies from scratch, SYMNET without training perform better or at par in over half the problem instances
Summary
  • Introduction:

    A Relational Markov Decision Process (RMDP) (Boutilier et al, 2001) is a first-order representation for expressing all instances of a probabilistic planning domain with a possibly unbounded number of objects.
  • Traditional RMDP planners attempted to find a generalized first-order value function or policy using symbolic dynamic programming (Boutilier et al, 2001), or by approximating them via a function over learned first-order basis functions (Guestrin et al, 2003; Fern et al, 2006; Sanner & Boutilier, 2009)
  • These methods met with rather limited success, for e.g., no relational planner participated in International Probabilistic Planning Competition (IPPC) 1 after 2006, even though all competition domains were relational.
  • These models are compact, and an xi depends only on a small number of other state variables
  • Methods:

    The authors' goal is to estimate the effectiveness of SYMNET outof-the-box policy for a new problem in a domain.
  • To further understand the overall quality of the generalized policy, the authors compare it against several upper bounds that train neural models from scratch on the test instance.
  • The authors compare it against state-of-the-art online planner PROST (Keller & Eyerich, 2012).
  • Results:

    Comparison against Random Policy: The authors report the values of αsymnet(0) in Table 1.
  • The authors highlight the instances where the method achieves over 90% of the max reward obtained by any algorithm for that instance.
  • The authors show that the method performs the best out-of-the-box in 28 instances
  • This is the main result, and it highlights that SYMNET takes a major leap towards the goal of computing generalized policies for the whole RMDP domain, and can work on a new instance out of the box
  • Conclusion:

    The authors' method, named SYMNET, converts an RDDL problem instance into an instance graph, on which a graph neural network computes state embeddings and embeddings for important object tuples.
  • These are decoded into scores for each ground action.
  • Even when compared against training deep reactive policies from scratch, SYMNET without training perform better or at par in over half the problem instances
Tables
  • Table1: Table 1
  • Table2: Comparison of SYMNET against SYMNET-s (SYM) architecture trained from scratch and TORPIDO (TOR) architecture trained from scratch. We compare out-of-the-box SYMNET to others after 4 hours of training. INF is used when SYM or TOR achieved minimum possible reward and hence SYMNET was infinitely better
  • Table3: Comparison of TRAPSNET with SYMNET on three domains as published in (<a class="ref-link" id="cGarg_et+al_2019_a" href="#rGarg_et+al_2019_a">Garg et al, 2019</a>). Label: AA - Academic Advising, GOL - Game Of Life, Sys - Sysadmin
  • Table4: Comparison of PROST with SYMNET. INF is used when PROST returned a policy equal to or worse than a random policy
  • Table5: The statistics related to the domains listing the number of
  • Table6: The statistics related to the domains listing the number of UP (Un-Paramataried), Unary and Multiple State Fluents (F) and Non-Fluents (N F) for each domain
  • Table7: The statistics related to the domain instances listing the number of Objects, State Variables and Action Variables for all the instances of the domains. Domain 1, 2, 3 are used for training, 4 for validation and 5, 6, 7, 8, 9, 10 for testing
Download tables as Excel
Reference
  • Atkeson, C. G. and Schaal, S. Robot learning from demonstration. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, pp. 12–20, 1997.
    Google ScholarLocate open access versionFindings
  • Bajpai, A., Garg, S., and Mausam. Transfer of deep reactive policies for mdp planning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 10965–10975. Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • Bellman, R. A Markovian Decision Process. Indiana University Mathematics Journal, 1957.
    Google ScholarLocate open access versionFindings
  • Boutilier, C., Reiter, R., and Price, B. Symbolic dynamic programming for first-order mdps. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI 2001, Seattle, Washington, USA, August 4-10, 2001, pp. 690–700, 2001.
    Google ScholarLocate open access versionFindings
  • Fern, A., Yoon, S. W., and Givan, R. Approximate policy iteration with a policy language bias: Solving relational markov decision processes. J. Artif. Intell. Res., 25:75– 118, 2006.
    Google ScholarLocate open access versionFindings
  • Garg, S., Bajpai, A., and Mausam. Size independent neural transfer for rddl planning. In Proceedings of the International Conference on Automated Planning and Scheduling, pp. 631–636, 2019.
    Google ScholarLocate open access versionFindings
  • Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards deep symbolic reinforcement learning. CoRR, abs/1609.05518, 2016. URL http://arxiv.org/abs/1609.05518.
    Findings
  • Groshev, E., Tamar, A., Goldstein, M., Srivastava, S., and Abbeel, P. Learning generalized reactive policies using deep neural networks. In ICAPS, 2018.
    Google ScholarLocate open access versionFindings
  • Grzes, M., Hoey, J., and Sanner, S. International Probabilistic Planning Competition (IPPC) 2014. In ICAPS, 2014. URL https://cs.uwaterloo.ca/̃mgrzes/IPPC_2014/.
    Locate open access versionFindings
  • Guestrin, C., Koller, D., Gearhart, C., and Kanodia, N. Generalizing plans to new environments in relational mdps. In IJCAI, pp. 1003–1010, 2003.
    Google ScholarLocate open access versionFindings
  • Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 1480–1490, 2017.
    Google ScholarLocate open access versionFindings
  • Issakkimuthu, M., Fern, A., and Tadepalli, P. Training deep reactive policies for probabilistic planning problems. In ICAPS, 2018.
    Google ScholarLocate open access versionFindings
  • Keller, T. and Eyerich, P. PROST: probabilistic planning based on UCT. In Proceedings of the Twenty-Second International Conference on Automated Planning and Scheduling, ICAPS 2012, Atibaia, Sao Paulo, Brazil, June 25-19, 2012, 2012. URL http://www.aaai.org/ocs/index.php/ ICAPS/ICAPS12/paper/view/4715.
    Locate open access versionFindings
  • Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Kolobov, A., Mausam, and Weld, D. S. A theory of goaloriented mdps with dead ends. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, August 14-18, 2012, pp. 438–447, 2012.
    Google ScholarLocate open access versionFindings
  • Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. Teacher-student curriculum learning. CoRR, abs/1707.00183, 2017. URL http://arxiv.org/abs/1707.00183.
    Findings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Parisotto, E., Ba, L. J., and Salakhutdinov, R. Actormimic: Deep multitask and transfer reinforcement learning. CoRR, abs/1511.06342, 2015. URL http://arxiv.org/abs/1511.06342.
    Findings
  • Puterman, M. Markov Decision Processes. John Wiley & Sons, Inc., 1994.
    Google ScholarFindings
  • Ruder, S. An overview of gradient descent optimization algorithms, 2016.
    Google ScholarFindings
  • Language (RDDL): Language Description. 2010.
    Google ScholarFindings
  • http://users.cecs.anu.edu.au/
    Findings
  • Sanner, S. and Boutilier, C. Practical solution techniques for first-order mdps. Artif. Intell., 173(5-6):748–788, 2009.
    Google ScholarLocate open access versionFindings
  • doi: 10.1016/j.artint.2008.11.003. URL https://doi.org/10.1016/j.artint.2008.11.003.
    Findings
  • Shen, W., Trevizan, F., Toyer, S., Thiebaux, S., and Xie, L. Guiding Search with Generalized Policies for Probabilistic Planning. In Proc. of 12th Annual Symp. on Combinatorial Search (SoCS), 2019. URL http://felipe.trevizan.org/papers/shen19b:guiding.pdf.
    Locate open access versionFindings
  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • Sorg, J. and Singh, S. P. Transfer via soft homomorphisms. In 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009), Budapest, Hungary, May 10-15, 2009, Volume 2, pp. 741–748, 2009.
    Google ScholarFindings
  • Sridharan, N. S. (ed.). Proceedings of the 11th International Joint Conference on Artificial Intelligence. Detroit, MI, USA, August 1989, 1989. Morgan Kaufmann. ISBN 1-55860-094-9. URL http://ijcai.org/proceedings/1989-1.
    Locate open access versionFindings
  • Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685, 2009. doi: 10.1145/1577069.1755839. URL http://doi.acm.org/10.1145/1577069.1755839.
    Locate open access versionFindings
  • Toyer, S., Trevizan, F. W., Thiebaux, S., and Xie, L. Action schema networks: Generalised policies with deep learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018.
    Google ScholarLocate open access versionFindings
  • Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. CoRR, abs/1710.10903, 2017. URL http://arxiv.org/abs/1710.10903.
    Findings
  • Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015. URL http://arxiv.org/abs/1505.00853.
    Findings
  • Younes, H. L. S., Littman, M. L., Weissman, D., and Asmuth, J. The first probabilistic track of the international planning competition. J. Artif. Intell. Res., 24:851–887, 2005.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments