Weakly-Supervised Reinforcement Learning for Controllable Behavior

NeurIPS, 2020.

Cited by: 1|Bibtex|Views102
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning

Abstract:

Reinforcement learning (RL) is a powerful framework for learning to take actions to solve tasks. However, in many settings, an agent must winnow down the inconceivably large space of all possible tasks to the single task that it is currently being asked to solve. Can we instead constrain the space of tasks to those that are semantically...More

Code:

Data:

0
Introduction
  • A general purpose agent must be able to efficiently learn a diverse array of tasks through interacting with the real world.
  • The authors aim to accelerate the acquisition of goal-conditioned policies by narrowing the goal space through weak supervision.
  • Answering this question would allow an RL agent to prioritize exploring and learning meaningful tasks, resulting in faster acquisition of behaviors for solving human-specified tasks
Highlights
  • A general purpose agent must be able to efficiently learn a diverse array of tasks through interacting with the real world
  • We proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning
  • To leverage the weak supervision, we proposed a simple two phase approach that first learns a disentangled representation and uses it to guide exploration, propose goals, and inform a distance metric
  • Our experimental results indicate that our approach, weakly-supervised control (WSC), substantially outperforms self-supervised methods that cannot cope with the breadth of the environments
  • Our comparisons suggest that our disentanglement-based approach is critical for effectively leveraging the weak supervision
  • While WSC has the ability to leverage weak labels that can be collected offline with approaches like crowd compute, WSC requires a user to indicate the factors of variation that are relevant for downstream tasks, which may require expertise
Methods
  • The authors sometimes relabel the transition with a corrected goal zg, which is sampled from either the goal distribution p(ZI) in Eq 2, or from a future state in the current trajectory.
  • 2-distance in the disentangled latent space: rt := Rzg := − eI − zg 22.
  • The authors' method outputs a goal-conditioned policy π(a | s, zg) which is trained to go to a state that is close to zg in the disentangled latent space
Conclusion
  • The authors proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning.
  • To leverage the weak supervision, the authors proposed a simple two phase approach that first learns a disentangled representation and uses it to guide exploration, propose goals, and inform a distance metric.
  • Incorporating weak supervision online, in the loop of RL, could address this issue to improve performance.
  • In such settings, the authors expect class imbalance and human-in-the-loop learning to present important, but surmountable challenges
Summary
  • Introduction:

    A general purpose agent must be able to efficiently learn a diverse array of tasks through interacting with the real world.
  • The authors aim to accelerate the acquisition of goal-conditioned policies by narrowing the goal space through weak supervision.
  • Answering this question would allow an RL agent to prioritize exploring and learning meaningful tasks, resulting in faster acquisition of behaviors for solving human-specified tasks
  • Objectives:

    The authors aim to accelerate the acquisition of goal-conditioned policies by narrowing the goal space through weak supervision.
  • Unlike standard RL, which requires hand-designed reward functions that are often expensive to obtain in complex environments, the authors aim to design the weakly-supervised RL problem in a way that provides a convenient form of supervision that scales with the collection of offline data.
  • The authors aim to first and foremost answer the core hypothesis: (1) Does weakly supervised control help guide exploration and learning, for increased performance over prior approaches?
  • The authors aim to first and foremost answer the core hypothesis: (1) Does weakly supervised control help guide exploration and learning, for increased performance over prior approaches? Further the authors investigate: (2) What is the relative importance of the goal generation mechanism vs. the distance metric used in WSC?, (3) Is weak supervision necessary for learning a disentangled state representation?, (4) Is the policy’s behavior interpretable?, and (5) How much weak supervision is needed to learn a sufficiently-disentangled state representation? Questions (1) through (4) are investigated while question (5) is studied in Appendix A.1
  • Methods:

    The authors sometimes relabel the transition with a corrected goal zg, which is sampled from either the goal distribution p(ZI) in Eq 2, or from a future state in the current trajectory.
  • 2-distance in the disentangled latent space: rt := Rzg := − eI − zg 22.
  • The authors' method outputs a goal-conditioned policy π(a | s, zg) which is trained to go to a state that is close to zg in the disentangled latent space
  • Conclusion:

    The authors proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning.
  • To leverage the weak supervision, the authors proposed a simple two phase approach that first learns a disentangled representation and uses it to guide exploration, propose goals, and inform a distance metric.
  • Incorporating weak supervision online, in the loop of RL, could address this issue to improve performance.
  • In such settings, the authors expect class imbalance and human-in-the-loop learning to present important, but surmountable challenges
Tables
  • Table1: Conceptual comparison between our method weaklysupervised control (WSC), and prior visual goal-conditioned RL methods, with their respective latent goal distributions p(Z) and goal-conditioned reward functions Rzg (s ). Our method can be seen as an extension of prior work to the weakly-supervised setting
  • Table2: Is the learned state representation disentangled? We measure the correlation between the true factor value of the input image vs. the latent dimension of the encoded image on the evaluation dataset. We show the 95% confidence interval over 5 seeds. We find that unsupervised VAEs are often insufficient for learning a disentangled representation
  • Table3: Is the learned policy interpretable? We investigate whether latent goals zg align directly with the final state of the trajectory after rolling out π(a | s, zg). We measure the correlation between the true factor value of the final state in the trajectory vs. the corresponding latent dimension of zg. We show the 95% confidence interval over 5 seeds. Our method attains higher correlation between latent goals and final states, meaning that it learns a more interpretable goal-conditioned policy
  • Table4: How many weak labels are needed to learn a sufficiently-disentangled state representation? We trained disentangled representations on varying numbers of weakly-labelled data samples {(s(1i), s(2i), y(i))}Ni=1 (N ∈ {128, 256, . . . , 4096}), then evaluated how well they disentangled the true factors of variation in the data. On the evaluation dataset, we measure the Pearson correlation between the true factor value of the input image vs. the latent dimension of the encoded image. For the VAE (obtained from SkewFit), we took the latent dimension that has the highest correlation with the true factor value. We report the 95% confidence interval over 5 seeds. Even with a small amount of weak supervision (e.g. around 1024 labels), we are able to attain a representation with good disentanglement
  • Table5: Environment-specific hyperparameters: M is the number of training images. “WSC pgoal” is the percentage of goals that are relabelled with zg ∼ p(ZI) in WSC. αDR is the VAE reward coefficient for SkewFit+DR in Eq 4
  • Table6: Disentangled representation model architecture: We slightly modified the disentangled model architecture from (Shu et al.,
  • Table7: VAE architecture & hyperparameters: β is the KL regularization coefficient in the β-VAE loss. We found that a smaller VAE latent dim LVAE ∈ {4, 16} worked best for SkewFit, RIG, and HER (which use the VAE for both hindsight relabelling and for the actor &
Download tables as Excel
Related work
Funding
  • LL is supported by the National Science Foundation (DGE1745016)
  • BE is supported by the Fannie and John Hertz Foundation and the National Science Foundation (DGE1745016)
  • RS is supported by NSF IIS1763562, ONR Grant N000141812861, and US Army
Reference
  • Agrawal, P., Nair, A. V., Abbeel, P., Malik, J., and Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in neural information processing systems, pp. 5074–5082, 2016.
    Google ScholarLocate open access versionFindings
  • Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
    Google ScholarLocate open access versionFindings
  • Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.
    Google ScholarLocate open access versionFindings
  • Bengio, E., Thomas, V., Pineau, J., Precup, D., and Bengio, Y. Independently controllable features. arXiv preprint arXiv:1703.07718, 2017.
    Findings
  • Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
    Findings
  • Brown, D. S., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. arXiv preprint arXiv:1904.06387, 2019.
    Findings
  • Chen, J. and Batmanghelich, K. Weakly supervised disentanglement by pairwise similarities. arXiv preprint arXiv:1906.01044, 2019.
    Findings
  • Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610– 2620, 2018.
    Google ScholarLocate open access versionFindings
  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
    Google ScholarLocate open access versionFindings
  • Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307, 2017.
    Google ScholarLocate open access versionFindings
  • Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., and Finn, C. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
    Findings
  • Dilokthanakul, N., Kaplanis, C., Pawlowski, N., and Shanahan, M. Feature control as intrinsic motivation for hierarchical reinforcement learning. IEEE transactions on neural networks and learning systems, 30(11):3409–3418, 2019.
    Google ScholarLocate open access versionFindings
  • Dosovitskiy, A. and Koltun, V. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.
    Findings
  • Duan, Y., Andrychowicz, M., Stadie, B., Ho, O. J., Schneider, J., Sutskever, I., Abbeel, P., and Zaremba, W. One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098, 2017.
    Google ScholarLocate open access versionFindings
  • Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N., Paige, B., Brooks, D. H., Dy, J., and van de Meent, J.-W. Structured disentangled representations. arXiv preprint arXiv:1804.02086, 2018.
    Findings
  • Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016a.
    Google ScholarLocate open access versionFindings
  • Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 512–519. IEEE, 2016b.
    Google ScholarLocate open access versionFindings
  • Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
    Findings
  • Fu, J., Singh, A., Ghosh, D., Yang, L., and Levine, S. Variational inverse control with events: A general framework for datadriven reward definition. In Advances in Neural Information Processing Systems, pp. 8538–8547, 2018.
    Google ScholarLocate open access versionFindings
  • Gabbay, A. and Hoshen, Y. Latent optimization for nonadversarial representation disentanglement. arXiv preprint arXiv:1906.11796, 2019.
    Findings
  • Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736, 2019.
    Findings
  • Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. Conference on Robot Learning (CoRL), 2019.
    Google ScholarFindings
  • Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., and Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pp. 80–89. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., and Farhadi, A. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098, 2018.
    Google ScholarLocate open access versionFindings
  • Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Botvinick, M., Larochelle, H., Levine, S., and Bengio, Y. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.
    Findings
  • Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
    Google ScholarLocate open access versionFindings
  • Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
    Findings
  • Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an embedding space for transferable robot skills. 2018.
    Google ScholarFindings
  • Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
    Findings
  • Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2(5):6, 2017.
    Google ScholarLocate open access versionFindings
  • Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
    Findings
  • Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
    Findings
  • Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
    Findings
  • Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
    Findings
  • Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.
    Google ScholarLocate open access versionFindings
  • Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning. arXiv preprint arXiv:1703.09327, 2017.
    Findings
  • Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953, 2019a.
    Findings
  • Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., and Salakhutdinov, R. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019b.
    Findings
  • Locatello, F., Bauer, S., Lucic, M., Ratsch, G., Gelly, S., Scholkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.
    Findings
  • Machado, M. C., Bellemare, M. G., and Bowling, M. A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2295–2304. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Mahadevan, S. Proto-value functions: Developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 553–560. ACM, 2005.
    Google ScholarLocate open access versionFindings
  • Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790, 2018.
    Findings
  • Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
    Findings
  • Nachum, O., Gu, S., Lee, H., and Levine, S. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018.
    Findings
  • Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191– 9200, 2018.
    Google ScholarLocate open access versionFindings
  • Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiositydriven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017.
    Google ScholarLocate open access versionFindings
  • Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053, 2018.
    Google ScholarLocate open access versionFindings
  • Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.
    Findings
  • Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
    Findings
  • Pugh, J. K., Soros, L. B., and Stanley, K. O. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016.
    Google ScholarLocate open access versionFindings
  • Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736, 2006.
    Google ScholarLocate open access versionFindings
  • Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.
    Google ScholarLocate open access versionFindings
  • Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
    Findings
  • Shu, R., Chen, Y., Kumar, A., Ermon, S., and Poole, B. Weakly supervised disentanglement with guarantees. arXiv preprint arXiv:1910.09772, 2019.
    Findings
  • Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
    Findings
  • Singh, A., Yang, L., Hartikainen, K., Finn, C., and Levine, S. End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019.
    Findings
  • Stanley, K. O. and Miikkulainen, R. Evolving neural networks through augmenting topologies. Evolutionary computation, 10 (2):99–127, 2002.
    Google ScholarFindings
  • Thomas, V., Pondard, J., Bengio, E., Sarfati, M., Beaudoin, P., Meurs, M.-J., Pineau, J., Precup, D., and Bengio, Y. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
    Findings
  • van Steenkiste, S., Locatello, F., Schmidhuber, J., and Bachem, O. Are disentangled representations helpful for abstract visual reasoning? In Advances in Neural Information Processing Systems, pp. 14222–14235, 2019.
    Google ScholarLocate open access versionFindings
  • Vecerik, M., Sushkov, O., Barker, D., Rothorl, T., Hester, T., and Scholz, J. A practical approach to insertion with variable socket position using deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 754–760. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350– 354, 2019.
    Google ScholarLocate open access versionFindings
  • Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015.
    Google ScholarLocate open access versionFindings
  • Xie, A., Singh, A., Levine, S., and Finn, C. Few-shot goal inference for visuomotor learning and planning. arXiv preprint arXiv:1810.00482, 2018.
    Findings
  • Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., and Fergus, R. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019.
    Findings
  • Yu, T., Shevchuk, G., Sadigh, D., and Finn, C. Unsupervised visuomotor control through distributional planning networks. arXiv preprint arXiv:1902.05542, 2019.
    Findings
  • Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., and Levine, S. Solar: Deep structured latent representations for model-based reinforcement learning. arXiv preprint arXiv:1808.09105, 2018.
    Findings
Full Text
Your rating :
0

 

Tags
Comments