Lifelong Learning with a Changing Action Set

Yash Chandak
Yash Chandak
Chris Nota
Chris Nota

CoRR, 2019.

Cited by: 0|Bibtex|Views74
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
In this work we established first steps towards developing the lifelong Markov decision processes setup for dealing with action sets that change over time

Abstract:

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains u...More

Code:

Data:

0
Introduction
  • New products are constantly added to the stock, and in tutorial recommendation systems, new tutorials are regularly developed, thereby continuously increasing the number of available actions for a recommender engine
  • These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that the authors aim to answer: how do the authors develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?
Highlights
  • These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that we aim to answer: how do we develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?
  • We theoretically analyze the difference between what an algorithm can achieve with only the actions that are available at one point in time, and the best that the algorithm could achieve if it had access to the entire underlying space of actions. Leveraging insights from this theoretical analysis, we study how the structure of the underlying action space can be recovered from interactions with the environment, and how algorithms can be developed to use this structure to facilitate lifelong learning
  • A trivial solution would be to ignore the newly available actions and continue only using the previously available actions. This is clearly myopic, and will prevent the agent from achieving the better long term returns that might be possible using the new actions. To address this parameterization-problem, instead of having the policy, π, act directly in the observed action space, A, we propose an approach wherein the agent reasons about the underlying structure of the problem in a way that makes its policy parameterization invariant to the number of actions that are available
  • In this work we established first steps towards developing the lifelong Markov decision processes (MDPs) setup for dealing with action sets that change over time
  • Our proposed approach leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions
Results
  • The plots in Figures 3 and 4 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1) can be attributed to its policy parameterization.
  • The decision making component of the policy, β, being invariant to the action cardinality can be readily leveraged after every change without having to be re-initialized
  • This demonstrates that efficiently re-using past knowledge can improve data efficiency over the approach that learns from scratch every time.
  • Note that even before the first addition of the new set of actions, the proposed method performs better than the baselines
  • This can be attributed to the fact that the proposed method efficiently leverages the underlying structure in the action set and learns faster.
  • Similar observations have been made previously (Dulac-Arnold et al 2015; He et al 2015; Bajpai, Garg, and others 2018)
Conclusion
  • Discussion and Conclusion

    In this work the authors established first steps towards developing the lifelong MDP setup for dealing with action sets that change over time.
  • The authors' proposed approach leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions.
  • Superior performances on both synthetic and large-scale realworld environments demonstrate the benefits of the proposed LAICA algorithm.
Summary
  • Introduction:

    New products are constantly added to the stock, and in tutorial recommendation systems, new tutorials are regularly developed, thereby continuously increasing the number of available actions for a recommender engine
  • These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that the authors aim to answer: how do the authors develop algorithms that can continually adapt to such changes in the action set over the agent’s lifetime?
  • Objectives:

    These examples capture the broad idea that, for an agent that is deployed in real world settings, the possible decisions it can make changes over time, and motivates the question that the authors aim to answer: how do the authors develop.
  • Results:

    The plots in Figures 3 and 4 present the evaluations on the domains considered. The advantage of LAICA over Baseline(1) can be attributed to its policy parameterization.
  • The decision making component of the policy, β, being invariant to the action cardinality can be readily leveraged after every change without having to be re-initialized
  • This demonstrates that efficiently re-using past knowledge can improve data efficiency over the approach that learns from scratch every time.
  • Note that even before the first addition of the new set of actions, the proposed method performs better than the baselines
  • This can be attributed to the fact that the proposed method efficiently leverages the underlying structure in the action set and learns faster.
  • Similar observations have been made previously (Dulac-Arnold et al 2015; He et al 2015; Bajpai, Garg, and others 2018)
  • Conclusion:

    Discussion and Conclusion

    In this work the authors established first steps towards developing the lifelong MDP setup for dealing with action sets that change over time.
  • The authors' proposed approach leveraged the structure in the action space such that an existing policy can be efficiently adapted to the new set of available actions.
  • Superior performances on both synthetic and large-scale realworld environments demonstrate the benefits of the proposed LAICA algorithm.
Related work
Funding
  • The research was supported by and partially conducted at Adobe Research
Reference
  • [Achiam et al. 2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 22–3JMLR. org.
    Google ScholarLocate open access versionFindings
  • [Bajpai, Garg, and others 2018] Bajpai, A. N.; Garg, S.; et al. 2018. Transfer of deep reactive policies for mdp planning. In Advances in Neural Information Processing Systems, 10965–10975.
    Google ScholarLocate open access versionFindings
  • [Boutilier et al. 2018] Boutilier, C.; Cohen, A.; Daniely, A.; Hassidim, A.; Mansour, Y.; Meshi, O.; Mladenov, M.; and Schuurmans, D. 2018. Planning and learning with stochastic action sets. In IJCAI.
    Google ScholarFindings
  • [Chandak et al. 2019] Chandak, Y.; Theocharous, G.; Kostas, J.; Jordan, S.; and Thomas, P. S. 2019. Learning action representations for reinforcement learning. International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • [Chen and Liu 2016] Chen, Z., and Liu, B. 2016. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 10(3):1–145.
    Google ScholarLocate open access versionFindings
  • [Devroye et al. 2017] Devroye, L.; Gyorfi, L.; Lugosi, G.; and Walk, H. 2017. On the measure of voronoi cells. Journal of Applied Probability.
    Google ScholarLocate open access versionFindings
  • [Duan et al. 2016] Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. Rl2ˆ: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779.
    Findings
  • [Dulac-Arnold et al. 2015] Dulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber, T.; Degris, T.; and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679.
    Findings
  • [Ferreira et al. 2017] Ferreira, L. A.; Bianchi, R. A.; Santos, P. E.; and de Mantaras, R. L. 2017. Answer set programming for non-stationary markov decision processes. Applied Intelligence 47(4):993–1007.
    Google ScholarLocate open access versionFindings
  • [Finn, Abbeel, and Levine 2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
    Google ScholarLocate open access versionFindings
  • [French 1999] French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3(4):128–135.
    Google ScholarLocate open access versionFindings
  • [Gabel and Riedmiller 2008] Gabel, T., and Riedmiller, M. 2008. Reinforcement learning for dec-mdps with changing action sets and partially ordered dependencies. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 3, 1333–1336. International Foundation for Autonomous Agents and Multiagent Systems.
    Google ScholarLocate open access versionFindings
  • [Gajane, Ortner, and Auer 2018] Gajane, P.; Ortner, R.; and Auer, P. 2018. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066.
    Findings
  • [Gupta et al. 2018] Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S. 2018. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, 5302–5311.
    Google ScholarLocate open access versionFindings
  • [He et al. 2015] He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 20Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636.
    Findings
  • [Higgins et al. 2017] Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3.
    Google ScholarLocate open access versionFindings
  • [Kakade and Langford 2002] Kakade, S., and Langford, J. 2002. Approximately optimal approximate reinforcement learning. In ICML, volume 2, 267–274.
    Google ScholarLocate open access versionFindings
  • [Kearns and Singh 2002] Kearns, M., and Singh, S. 2002. Nearoptimal reinforcement learning in polynomial time. Machine learning 49(2-3):209–232.
    Google ScholarLocate open access versionFindings
  • [Kingma and Welling 2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
    Findings
  • [Kirkpatrick et al. 2017] Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114(13):3521–3526.
    Google ScholarLocate open access versionFindings
  • [Konidaris, Osentoski, and Thomas 2011] Konidaris, G.; Osentoski, S.; and Thomas, P. S. 2011. Value function approximation in reinforcement learning using the fourier basis. In AAAI, volume 6, 7.
    Google ScholarLocate open access versionFindings
  • [Lopez-Paz and others 2017] Lopez-Paz, D., et al. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 6467–6476.
    Google ScholarLocate open access versionFindings
  • [Mandel et al. 2017] Mandel, T.; Liu, Y.-E.; Brunskill, E.; and Popovic, Z. 2017. Where to add actions in human-in-the-loop reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • [Nachum et al. 2018] Nachum, O.; Gu, S.; Lee, H.; and Levine, S. 2018. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257.
    Findings
  • [Neu 2013] Neu, G. 2013. Online learning in non-stationary markov decision processes. CoRR.
    Google ScholarLocate open access versionFindings
  • [Pirotta et al. 2013] Pirotta, M.; Restelli, M.; Pecorino, A.; and Calandriello, D. 2013. Safe policy iteration. In International Conference on Machine Learning, 307–315.
    Google ScholarLocate open access versionFindings
  • [Puterman 2014] Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
    Google ScholarFindings
  • [Ruvolo and Eaton 2013] Ruvolo, P., and Eaton, E. 2013. Ella: An efficient lifelong learning algorithm. In International Conference on Machine Learning, 507–515.
    Google ScholarLocate open access versionFindings
  • [Shani, Heckerman, and Brafman 2005] Shani, G.; Heckerman, D.; and Brafman, R. I. 2005. An mdp-based recommender system. Journal of Machine Learning Research 6(Sep):1265–1295.
    Google ScholarLocate open access versionFindings
  • [Silver et al. 2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML.
    Google ScholarFindings
  • [Silver, Yang, and Li 2013] Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series.
    Google ScholarLocate open access versionFindings
  • [Sutton and Barto 2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
    Google ScholarFindings
  • [Sutton et al. 2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
    Google ScholarLocate open access versionFindings
  • [Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
    Google ScholarLocate open access versionFindings
  • [Tennenholtz and Mannor 2019] Tennenholtz, G., and Mannor, S. 2019. The natural language of actions. International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • [Thrun 1998] Thrun, S. 1998. Lifelong learning algorithms. In Learning to learn. Springer. 181–209.
    Google ScholarFindings
  • [Wang et al. 2017] Wang, J.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J.; Munos, R.; Blundell, C.; Kumaran, D.; and Botivnick, M. 2017. Learning to reinforcement learn. arxiv 1611.05763.
    Findings
  • [Xu, van Hasselt, and Silver 2018] Xu, Z.; van Hasselt, H. P.; and Silver, D. 2018. Meta-gradient reinforcement learning. In Advances in neural information processing systems.
    Google ScholarFindings
  • [Zenke, Poole, and Ganguli 2017] Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, 3987–3995. JMLR. org.
    Google ScholarLocate open access versionFindings
  • For the purpose of our results, we would require bounding the shift in the state distribution between two policies. Techniques for doing so has been previously studied in literature (Kakade and Langford 2002; Kearns and Singh 2002; Pirotta et al. 2013; Achiam et al. 2017). Specifically, we cover this preliminary result based on the work by Achiam et al. (2017).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Best Paper
Best Paper of AAAI, 2019
Tags
Comments