Lifelong Learning with a Changing Action Set
CoRR, 2019.
EI
Weibo:
Abstract:
In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains u...More
Code:
Data:
Introduction
- New products are constantly added to the stock, and in tutorial recommendation systems, new tutorials are regularly developed, thereby continuously increasing the number of available actions for a recommender engine
- These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that the authors aim to answer: how do the authors develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?
Highlights
- These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that we aim to answer: how do we develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?
- We theoretically analyze the difference between what an algorithm can achieve with only the actions that are available at one point in time, and the best that the algorithm could achieve if it had access to the entire underlying space of actions. Leveraging insights from this theoretical analysis, we study how the structure of the underlying action space can be recovered from interactions with the environment, and how algorithms can be developed to use this structure to facilitate lifelong learning
- A trivial solution would be to ignore the newly available actions and continue only using the previously available actions. This is clearly myopic, and will prevent the agent from achieving the better long term returns that might be possible using the new actions. To address this parameterization-problem, instead of having the policy, π, act directly in the observed action space, A, we propose an approach wherein the agent reasons about the underlying structure of the problem in a way that makes its policy parameterization invariant to the number of actions that are available
- In this work we established first steps towards developing the lifelong Markov decision processes (MDPs) setup for dealing with action sets that change over time
- Our proposed approach leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions
Results
- The plots in Figures 3 and 4 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1) can be attributed to its policy parameterization.
- The decision making component of the policy, β, being invariant to the action cardinality can be readily leveraged after every change without having to be re-initialized
- This demonstrates that efficiently re-using past knowledge can improve data efficiency over the approach that learns from scratch every time.
- Note that even before the first addition of the new set of actions, the proposed method performs better than the baselines
- This can be attributed to the fact that the proposed method efficiently leverages the underlying structure in the action set and learns faster.
- Similar observations have been made previously (Dulac-Arnold et al 2015; He et al 2015; Bajpai, Garg, and others 2018)
Conclusion
- Discussion and Conclusion
In this work the authors established first steps towards developing the lifelong MDP setup for dealing with action sets that change over time. - The authors' proposed approach leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions.
- Superior performances on both synthetic and large-scale realworld environments demonstrate the benefits of the proposed LAICA algorithm.
Summary
Introduction:
New products are constantly added to the stock, and in tutorial recommendation systems, new tutorials are regularly developed, thereby continuously increasing the number of available actions for a recommender engine- These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that the authors aim to answer: how do the authors develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?
Objectives:
These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that the authors aim to answer: how do the authors develop.Results:
The plots in Figures 3 and 4 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1) can be attributed to its policy parameterization.- The decision making component of the policy, β, being invariant to the action cardinality can be readily leveraged after every change without having to be re-initialized
- This demonstrates that efficiently re-using past knowledge can improve data efficiency over the approach that learns from scratch every time.
- Note that even before the first addition of the new set of actions, the proposed method performs better than the baselines
- This can be attributed to the fact that the proposed method efficiently leverages the underlying structure in the action set and learns faster.
- Similar observations have been made previously (Dulac-Arnold et al 2015; He et al 2015; Bajpai, Garg, and others 2018)
Conclusion:
Discussion and Conclusion
In this work the authors established first steps towards developing the lifelong MDP setup for dealing with action sets that change over time.- The authors' proposed approach leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions.
- Superior performances on both synthetic and large-scale realworld environments demonstrate the benefits of the proposed LAICA algorithm.
Related work
- Lifelong learning is a well studied problem (Thrun 1998; Ruvolo and Eaton 2013; Silver, Yang, and Li 2013; Chen and Liu 2016). Predominantly, prior methods aim to address catastrophic forgetting problems in order to leverage prior knowledge for new tasks (French 1999; Kirkpatrick et al 2017; Lopez-Paz and others 2017; Zenke, Poole, and Ganguli 2017). Several meta-reinforcement-learning methods aim at addressing the problem of transfer learning, few-shot shot adaption to new tasks after training on a distribution of similar tasks, and automated hyper-parameter tuning (Xu, van Hasselt, and Silver 2018; Gupta et al 2018; Wang et al 2017; Duan et al 2016; Finn, Abbeel, and Levine 2017). Alternatively, many lifelong RL methods consider learning online in the presence of continuously changing transition dynamics or reward functions (Neu 2013; Gajane, Ortner, and Auer 2018). In our work, we look at a complementary aspect of the lifelong learning problem, wherein the size of the action set available to the agent change over its lifetime.
Our work also draws inspiration from recent works which leverage action embeddings (Dulac-Arnold et al 2015; He et al 2015; Bajpai, Garg, and others 2018; Chandak et al 2019; Tennenholtz and Mannor 2019). Building upon their ideas, we present a new objective for learning structure in the action space, and show that the performance of the policy resulting from using this inferred structure has bounded sub-optimality. Moreover, in contrast to their setup where the size of the action set is fixed, we consider the case of lifelong MDP, where the number of actions changes over time.
Funding
- The research was supported by and partially conducted at Adobe Research
Reference
- [Achiam et al. 2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 22–3JMLR. org.
- [Bajpai, Garg, and others 2018] Bajpai, A. N.; Garg, S.; et al. 2018. Transfer of deep reactive policies for mdp planning. In Advances in Neural Information Processing Systems, 10965–10975.
- [Boutilier et al. 2018] Boutilier, C.; Cohen, A.; Daniely, A.; Hassidim, A.; Mansour, Y.; Meshi, O.; Mladenov, M.; and Schuurmans, D. 2018. Planning and learning with stochastic action sets. In IJCAI.
- [Chandak et al. 2019] Chandak, Y.; Theocharous, G.; Kostas, J.; Jordan, S.; and Thomas, P. S. 2019. Learning action representations for reinforcement learning. International Conference on Machine Learning.
- [Chen and Liu 2016] Chen, Z., and Liu, B. 2016. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 10(3):1–145.
- [Devroye et al. 2017] Devroye, L.; Gyorfi, L.; Lugosi, G.; and Walk, H. 2017. On the measure of voronoi cells. Journal of Applied Probability.
- [Duan et al. 2016] Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. Rl2ˆ: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779.
- [Dulac-Arnold et al. 2015] Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679.
- [Ferreira et al. 2017] Ferreira, L. A.; Bianchi, R. A.; Santos, P. E.; and de Mantaras, R. L. 2017. Answer set programming for non-stationary markov decision processes. Applied Intelligence 47(4):993–1007.
- [Finn, Abbeel, and Levine 2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
- [French 1999] French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3(4):128–135.
- [Gabel and Riedmiller 2008] Gabel, T., and Riedmiller, M. 2008. Reinforcement learning for dec-mdps with changing action sets and partially ordered dependencies. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 3, 1333–1336. International Foundation for Autonomous Agents and Multiagent Systems.
- [Gajane, Ortner, and Auer 2018] Gajane, P.; Ortner, R.; and Auer, P. 2018. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066.
- [Gupta et al. 2018] Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S. 2018. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, 5302–5311.
- [He et al. 2015] He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 20Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636.
- [Higgins et al. 2017] Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3.
- [Kakade and Langford 2002] Kakade, S., and Langford, J. 2002. Approximately optimal approximate reinforcement learning. In ICML, volume 2, 267–274.
- [Kearns and Singh 2002] Kearns, M., and Singh, S. 2002. Nearoptimal reinforcement learning in polynomial time. Machine learning 49(2-3):209–232.
- [Kingma and Welling 2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- [Kirkpatrick et al. 2017] Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114(13):3521–3526.
- [Konidaris, Osentoski, and Thomas 2011] Konidaris, G.; Osentoski, S.; and Thomas, P. S. 2011. Value function approximation in reinforcement learning using the fourier basis. In AAAI, volume 6, 7.
- [Lopez-Paz and others 2017] Lopez-Paz, D., et al. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 6467–6476.
- [Mandel et al. 2017] Mandel, T.; Liu, Y.-E.; Brunskill, E.; and Popovic, Z. 2017. Where to add actions in human-in-the-loop reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence.
- [Nachum et al. 2018] Nachum, O.; Gu, S.; Lee, H.; and Levine, S. 2018. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257.
- [Neu 2013] Neu, G. 2013. Online learning in non-stationary markov decision processes. CoRR.
- [Pirotta et al. 2013] Pirotta, M.; Restelli, M.; Pecorino, A.; and Calandriello, D. 2013. Safe policy iteration. In International Conference on Machine Learning, 307–315.
- [Puterman 2014] Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- [Ruvolo and Eaton 2013] Ruvolo, P., and Eaton, E. 2013. Ella: An efficient lifelong learning algorithm. In International Conference on Machine Learning, 507–515.
- [Shani, Heckerman, and Brafman 2005] Shani, G.; Heckerman, D.; and Brafman, R. I. 2005. An mdp-based recommender system. Journal of Machine Learning Research 6(Sep):1265–1295.
- [Silver et al. 2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML.
- [Silver, Yang, and Li 2013] Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series.
- [Sutton and Barto 2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
- [Sutton et al. 2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
- [Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
- [Tennenholtz and Mannor 2019] Tennenholtz, G., and Mannor, S. 2019. The natural language of actions. International Conference on Machine Learning.
- [Thrun 1998] Thrun, S. 1998. Lifelong learning algorithms. In Learning to learn. Springer. 181–209.
- [Wang et al. 2017] Wang, J.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J.; Munos, R.; Blundell, C.; Kumaran, D.; and Botivnick, M. 2017. Learning to reinforcement learn. arxiv 1611.05763.
- [Xu, van Hasselt, and Silver 2018] Xu, Z.; van Hasselt, H. P.; and Silver, D. 2018. Meta-gradient reinforcement learning. In Advances in neural information processing systems.
- [Zenke, Poole, and Ganguli 2017] Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, 3987–3995. JMLR. org.
- For the purpose of our results, we would require bounding the shift in the state distribution between two policies. Techniques for doing so has been previously studied in literature (Kakade and Langford 2002; Kearns and Singh 2002; Pirotta et al. 2013; Achiam et al. 2017). Specifically, we cover this preliminary result based on the work by Achiam et al. (2017).
Full Text
Best Paper
Best Paper of AAAI, 2019
Tags
Comments