# Generalized Neural Policies for Relational MDPs

ICML 2020, 2020.

Keywords:

Machine LearningMarkov Decision Processprobabilistic planningdeep reactive policyAcademic AdvisingMore(14+)

Weibo:

Abstract:

A Relational Markov Decision Process (RMDP) is a first-order representation to express all instances of a single probabilistic planning domain with possibly unbounded number of objects. Early work in RMDPs outputs generalized (instance-independent) first-order policies or value functions as a means to solve all instances of a domain at ...More

Code:

Data:

Introduction

- A Relational Markov Decision Process (RMDP) (Boutilier et al, 2001) is a first-order representation for expressing all instances of a probabilistic planning domain with a possibly unbounded number of objects.
- Traditional RMDP planners attempted to find a generalized first-order value function or policy using symbolic dynamic programming (Boutilier et al, 2001), or by approximating them via a function over learned first-order basis functions (Guestrin et al, 2003; Fern et al, 2006; Sanner & Boutilier, 2009)
- These methods met with rather limited success, for e.g., no relational planner participated in International Probabilistic Planning Competition (IPPC) 1 after 2006, even though all competition domains were relational.
- These models are compact, and an xi depends only on a small number of other state variables

Highlights

- A Relational Markov Decision Process (RMDP) (Boutilier et al, 2001) is a first-order representation for expressing all instances of a probabilistic planning domain with a possibly unbounded number of objects
- We present Symbolic NetWork (SYMNET), the first domainindependent neural planner for computing generalized policies for Relational Markov Decision Process that are expressed in the symbolic representation language of Relational Dynamic Decision Language (Sanner, 2010)
- We show all our results on nine Relational Dynamic Decision Language domains used International Probabilistic Planning Competition 2014: Academic Advising (AA), Crossing Traffic (CT), Game of Life (GOL), Navigation (NAV), Skill Teaching (ST), Sysadmin (Sys), Tamarisk (Tam), Traffic (Tra), and Wildfire (Wild)
- We see that Symbolic NetWork with no training achieves over 90% the max reward on 43 instances and over 80% in 52 out of 54 instances
- This is our main result, and it highlights that Symbolic NetWork takes a major leap towards the goal of computing generalized policies for the whole Relational Markov Decision Process domain, and can work on a new instance out of the box
- We present the first neural-method for obtaining a generalized policy for Relational Markov Decision Process represented in Relational Dynamic Decision Language

Methods

- The authors' goal is to estimate the effectiveness of SYMNET outof-the-box policy for a new problem in a domain.
- To further understand the overall quality of the generalized policy, the authors compare it against several upper bounds that train neural models from scratch on the test instance.
- The authors compare it against state-of-the-art online planner PROST (Keller & Eyerich, 2012).

Results

- Comparison against Random Policy: The authors report the values of αsymnet(0) in Table 1.
- The authors highlight the instances where the method achieves over 90% of the max reward obtained by any algorithm for that instance.
- The authors show that the method performs the best out-of-the-box in 28 instances
- This is the main result, and it highlights that SYMNET takes a major leap towards the goal of computing generalized policies for the whole RMDP domain, and can work on a new instance out of the box

Conclusion

- The authors' method, named SYMNET, converts an RDDL problem instance into an instance graph, on which a graph neural network computes state embeddings and embeddings for important object tuples.
- These are decoded into scores for each ground action.
- Even when compared against training deep reactive policies from scratch, SYMNET without training perform better or at par in over half the problem instances

Summary

## Introduction:

A Relational Markov Decision Process (RMDP) (Boutilier et al, 2001) is a first-order representation for expressing all instances of a probabilistic planning domain with a possibly unbounded number of objects.- Traditional RMDP planners attempted to find a generalized first-order value function or policy using symbolic dynamic programming (Boutilier et al, 2001), or by approximating them via a function over learned first-order basis functions (Guestrin et al, 2003; Fern et al, 2006; Sanner & Boutilier, 2009)
- These methods met with rather limited success, for e.g., no relational planner participated in International Probabilistic Planning Competition (IPPC) 1 after 2006, even though all competition domains were relational.
- These models are compact, and an xi depends only on a small number of other state variables
## Methods:

The authors' goal is to estimate the effectiveness of SYMNET outof-the-box policy for a new problem in a domain.- To further understand the overall quality of the generalized policy, the authors compare it against several upper bounds that train neural models from scratch on the test instance.
- The authors compare it against state-of-the-art online planner PROST (Keller & Eyerich, 2012).
## Results:

Comparison against Random Policy: The authors report the values of αsymnet(0) in Table 1.- The authors highlight the instances where the method achieves over 90% of the max reward obtained by any algorithm for that instance.
- The authors show that the method performs the best out-of-the-box in 28 instances
- This is the main result, and it highlights that SYMNET takes a major leap towards the goal of computing generalized policies for the whole RMDP domain, and can work on a new instance out of the box
## Conclusion:

The authors' method, named SYMNET, converts an RDDL problem instance into an instance graph, on which a graph neural network computes state embeddings and embeddings for important object tuples.- These are decoded into scores for each ground action.
- Even when compared against training deep reactive policies from scratch, SYMNET without training perform better or at par in over half the problem instances

- Table1: Table 1
- Table2: Comparison of SYMNET against SYMNET-s (SYM) architecture trained from scratch and TORPIDO (TOR) architecture trained from scratch. We compare out-of-the-box SYMNET to others after 4 hours of training. INF is used when SYM or TOR achieved minimum possible reward and hence SYMNET was infinitely better
- Table3: Comparison of TRAPSNET with SYMNET on three domains as published in (<a class="ref-link" id="cGarg_et+al_2019_a" href="#rGarg_et+al_2019_a">Garg et al, 2019</a>). Label: AA - Academic Advising, GOL - Game Of Life, Sys - Sysadmin
- Table4: Comparison of PROST with SYMNET. INF is used when PROST returned a policy equal to or worse than a random policy
- Table5: The statistics related to the domains listing the number of
- Table6: The statistics related to the domains listing the number of UP (Un-Paramataried), Unary and Multiple State Fluents (F) and Non-Fluents (N F) for each domain
- Table7: The statistics related to the domain instances listing the number of Objects, State Variables and Action Variables for all the instances of the domains. Domain 1, 2, 3 are used for training, 4 for validation and 5, 6, 7, 8, 9, 10 for testing

Reference

- Atkeson, C. G. and Schaal, S. Robot learning from demonstration. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, pp. 12–20, 1997.
- Bajpai, A., Garg, S., and Mausam. Transfer of deep reactive policies for mdp planning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 10965–10975. Curran Associates, Inc., 2018.
- Bellman, R. A Markovian Decision Process. Indiana University Mathematics Journal, 1957.
- Boutilier, C., Reiter, R., and Price, B. Symbolic dynamic programming for first-order mdps. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI 2001, Seattle, Washington, USA, August 4-10, 2001, pp. 690–700, 2001.
- Fern, A., Yoon, S. W., and Givan, R. Approximate policy iteration with a policy language bias: Solving relational markov decision processes. J. Artif. Intell. Res., 25:75– 118, 2006.
- Garg, S., Bajpai, A., and Mausam. Size independent neural transfer for rddl planning. In Proceedings of the International Conference on Automated Planning and Scheduling, pp. 631–636, 2019.
- Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards deep symbolic reinforcement learning. CoRR, abs/1609.05518, 2016. URL http://arxiv.org/abs/1609.05518.
- Groshev, E., Tamar, A., Goldstein, M., Srivastava, S., and Abbeel, P. Learning generalized reactive policies using deep neural networks. In ICAPS, 2018.
- Grzes, M., Hoey, J., and Sanner, S. International Probabilistic Planning Competition (IPPC) 2014. In ICAPS, 2014. URL https://cs.uwaterloo.ca/̃mgrzes/IPPC_2014/.
- Guestrin, C., Koller, D., Gearhart, C., and Kanodia, N. Generalizing plans to new environments in relational mdps. In IJCAI, pp. 1003–1010, 2003.
- Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 1480–1490, 2017.
- Issakkimuthu, M., Fern, A., and Tadepalli, P. Training deep reactive policies for probabilistic planning problems. In ICAPS, 2018.
- Keller, T. and Eyerich, P. PROST: probabilistic planning based on UCT. In Proceedings of the Twenty-Second International Conference on Automated Planning and Scheduling, ICAPS 2012, Atibaia, Sao Paulo, Brazil, June 25-19, 2012, 2012. URL http://www.aaai.org/ocs/index.php/ ICAPS/ICAPS12/paper/view/4715.
- Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
- Kolobov, A., Mausam, and Weld, D. S. A theory of goaloriented mdps with dead ends. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, August 14-18, 2012, pp. 438–447, 2012.
- Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. Teacher-student curriculum learning. CoRR, abs/1707.00183, 2017. URL http://arxiv.org/abs/1707.00183.
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1928–1937, 2016.
- Parisotto, E., Ba, L. J., and Salakhutdinov, R. Actormimic: Deep multitask and transfer reinforcement learning. CoRR, abs/1511.06342, 2015. URL http://arxiv.org/abs/1511.06342.
- Puterman, M. Markov Decision Processes. John Wiley & Sons, Inc., 1994.
- Ruder, S. An overview of gradient descent optimization algorithms, 2016.
- Language (RDDL): Language Description. 2010.
- http://users.cecs.anu.edu.au/
- Sanner, S. and Boutilier, C. Practical solution techniques for first-order mdps. Artif. Intell., 173(5-6):748–788, 2009.
- doi: 10.1016/j.artint.2008.11.003. URL https://doi.org/10.1016/j.artint.2008.11.003.
- Shen, W., Trevizan, F., Toyer, S., Thiebaux, S., and Xie, L. Guiding Search with Generalized Policies for Probabilistic Planning. In Proc. of 12th Annual Symp. on Combinatorial Search (SoCS), 2019. URL http://felipe.trevizan.org/papers/shen19b:guiding.pdf.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Sorg, J. and Singh, S. P. Transfer via soft homomorphisms. In 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009), Budapest, Hungary, May 10-15, 2009, Volume 2, pp. 741–748, 2009.
- Sridharan, N. S. (ed.). Proceedings of the 11th International Joint Conference on Artificial Intelligence. Detroit, MI, USA, August 1989, 1989. Morgan Kaufmann. ISBN 1-55860-094-9. URL http://ijcai.org/proceedings/1989-1.
- Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685, 2009. doi: 10.1145/1577069.1755839. URL http://doi.acm.org/10.1145/1577069.1755839.
- Toyer, S., Trevizan, F. W., Thiebaux, S., and Xie, L. Action schema networks: Generalised policies with deep learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018.
- Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. CoRR, abs/1710.10903, 2017. URL http://arxiv.org/abs/1710.10903.
- Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015. URL http://arxiv.org/abs/1505.00853.
- Younes, H. L. S., Littman, M. L., Weissman, D., and Asmuth, J. The first probabilistic track of the international planning competition. J. Artif. Intell. Res., 24:851–887, 2005.

Tags

Comments