# Weakly-Supervised Reinforcement Learning for Controllable Behavior

NeurIPS, 2020.

EI

Weibo:

Abstract:

Reinforcement learning (RL) is a powerful framework for learning to take actions to solve tasks. However, in many settings, an agent must winnow down the inconceivably large space of all possible tasks to the single task that it is currently being asked to solve. Can we instead constrain the space of tasks to those that are semantically...More

Code:

Data:

Introduction

- A general purpose agent must be able to efficiently learn a diverse array of tasks through interacting with the real world.
- The authors aim to accelerate the acquisition of goal-conditioned policies by narrowing the goal space through weak supervision.
- Answering this question would allow an RL agent to prioritize exploring and learning meaningful tasks, resulting in faster acquisition of behaviors for solving human-specified tasks

Highlights

- A general purpose agent must be able to efficiently learn a diverse array of tasks through interacting with the real world
- We proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning
- To leverage the weak supervision, we proposed a simple two phase approach that first learns a disentangled representation and uses it to guide exploration, propose goals, and inform a distance metric
- Our experimental results indicate that our approach, weakly-supervised control (WSC), substantially outperforms self-supervised methods that cannot cope with the breadth of the environments
- Our comparisons suggest that our disentanglement-based approach is critical for effectively leveraging the weak supervision
- While WSC has the ability to leverage weak labels that can be collected offline with approaches like crowd compute, WSC requires a user to indicate the factors of variation that are relevant for downstream tasks, which may require expertise

Methods

- The authors sometimes relabel the transition with a corrected goal zg, which is sampled from either the goal distribution p(ZI) in Eq 2, or from a future state in the current trajectory.
- 2-distance in the disentangled latent space: rt := Rzg := − eI − zg 22.
- The authors' method outputs a goal-conditioned policy π(a | s, zg) which is trained to go to a state that is close to zg in the disentangled latent space

Conclusion

- The authors proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning.
- To leverage the weak supervision, the authors proposed a simple two phase approach that first learns a disentangled representation and uses it to guide exploration, propose goals, and inform a distance metric.
- Incorporating weak supervision online, in the loop of RL, could address this issue to improve performance.
- In such settings, the authors expect class imbalance and human-in-the-loop learning to present important, but surmountable challenges

Summary

## Introduction:

A general purpose agent must be able to efficiently learn a diverse array of tasks through interacting with the real world.- The authors aim to accelerate the acquisition of goal-conditioned policies by narrowing the goal space through weak supervision.
- Answering this question would allow an RL agent to prioritize exploring and learning meaningful tasks, resulting in faster acquisition of behaviors for solving human-specified tasks
## Objectives:

The authors aim to accelerate the acquisition of goal-conditioned policies by narrowing the goal space through weak supervision.- Unlike standard RL, which requires hand-designed reward functions that are often expensive to obtain in complex environments, the authors aim to design the weakly-supervised RL problem in a way that provides a convenient form of supervision that scales with the collection of offline data.
- The authors aim to first and foremost answer the core hypothesis: (1) Does weakly supervised control help guide exploration and learning, for increased performance over prior approaches?
- The authors aim to first and foremost answer the core hypothesis: (1) Does weakly supervised control help guide exploration and learning, for increased performance over prior approaches? Further the authors investigate: (2) What is the relative importance of the goal generation mechanism vs. the distance metric used in WSC?, (3) Is weak supervision necessary for learning a disentangled state representation?, (4) Is the policy’s behavior interpretable?, and (5) How much weak supervision is needed to learn a sufficiently-disentangled state representation? Questions (1) through (4) are investigated while question (5) is studied in Appendix A.1
## Methods:

The authors sometimes relabel the transition with a corrected goal zg, which is sampled from either the goal distribution p(ZI) in Eq 2, or from a future state in the current trajectory.- 2-distance in the disentangled latent space: rt := Rzg := − eI − zg 22.
- The authors' method outputs a goal-conditioned policy π(a | s, zg) which is trained to go to a state that is close to zg in the disentangled latent space
## Conclusion:

The authors proposed weak supervision as a means to scalably introduce structure into goal-conditioned reinforcement learning.- To leverage the weak supervision, the authors proposed a simple two phase approach that first learns a disentangled representation and uses it to guide exploration, propose goals, and inform a distance metric.
- Incorporating weak supervision online, in the loop of RL, could address this issue to improve performance.
- In such settings, the authors expect class imbalance and human-in-the-loop learning to present important, but surmountable challenges

- Table1: Conceptual comparison between our method weaklysupervised control (WSC), and prior visual goal-conditioned RL methods, with their respective latent goal distributions p(Z) and goal-conditioned reward functions Rzg (s ). Our method can be seen as an extension of prior work to the weakly-supervised setting
- Table2: Is the learned state representation disentangled? We measure the correlation between the true factor value of the input image vs. the latent dimension of the encoded image on the evaluation dataset. We show the 95% confidence interval over 5 seeds. We find that unsupervised VAEs are often insufficient for learning a disentangled representation
- Table3: Is the learned policy interpretable? We investigate whether latent goals zg align directly with the final state of the trajectory after rolling out π(a | s, zg). We measure the correlation between the true factor value of the final state in the trajectory vs. the corresponding latent dimension of zg. We show the 95% confidence interval over 5 seeds. Our method attains higher correlation between latent goals and final states, meaning that it learns a more interpretable goal-conditioned policy
- Table4: How many weak labels are needed to learn a sufficiently-disentangled state representation? We trained disentangled representations on varying numbers of weakly-labelled data samples {(s(1i), s(2i), y(i))}Ni=1 (N ∈ {128, 256, . . . , 4096}), then evaluated how well they disentangled the true factors of variation in the data. On the evaluation dataset, we measure the Pearson correlation between the true factor value of the input image vs. the latent dimension of the encoded image. For the VAE (obtained from SkewFit), we took the latent dimension that has the highest correlation with the true factor value. We report the 95% confidence interval over 5 seeds. Even with a small amount of weak supervision (e.g. around 1024 labels), we are able to attain a representation with good disentanglement
- Table5: Environment-specific hyperparameters: M is the number of training images. “WSC pgoal” is the percentage of goals that are relabelled with zg ∼ p(ZI) in WSC. αDR is the VAE reward coefficient for SkewFit+DR in Eq 4
- Table6: Disentangled representation model architecture: We slightly modified the disentangled model architecture from (Shu et al.,
- Table7: VAE architecture & hyperparameters: β is the KL regularization coefficient in the β-VAE loss. We found that a smaller VAE latent dim LVAE ∈ {4, 16} worked best for SkewFit, RIG, and HER (which use the VAE for both hindsight relabelling and for the actor &

Related work

- Reinforcement learning of complex behaviors in rich environments with high-dimensional observations remains an open problem. Many of the successful applications of RL in prior work Silver et al (2017); Berner et al (2019); Vinyals et al (2019); Gu et al (2017) effectively operate in a regime where the amount of data (i.e., interactions with the environment) dwarfs the complexity of task at hand. Insofar as alternative forms of supervision is the key to success for RL methods, prior work has proposed a number of techniques for making use of various types of ancillary supervision.

A number of prior works incorporate additional supervision beyond rewards to accelerate RL. One common theme is to use the task dynamics itself as supervision, using either forward dynamics (Watter et al, 2015; Finn & Levine, 2017; Hafner et al, 2018; Zhang et al, 2018; Kaiser et al, 2019), some function of forward dynamics (Dosovitskiy & Koltun, 2016), or inverse dynamics (Pathak et al, 2017; Agrawal et al, 2016; Pathak et al, 2018) as a source of labels. Another approach explicitly predicts auxiliary labels (Jaderberg et al, 2016; Shelhamer et al, 2016; Gordon et al, 2018; Dilokthanakul et al, 2019). Compact state representations can also allow for faster learning and planning, and prior work has proposed a number of tools for learning these representations (Mahadevan, 2005; Machado et al, 2017; Finn et al, 2016b; Barreto et al, 2017; Nair et al, 2018; Gelada et al, 2019; Lee et al, 2019a; Yarats et al, 2019). Bengio et al (2017); Thomas et al (2017) propose learning representations using an independent controllability metric, but the joint RL and representation learning scheme has proven difficult to scale in environment complexity. Perhaps most related to our method is prior work that directly learns a compact representation of goals (Goyal et al, 2019; Pong et al, 2019; Nachum et al, 2018). Our work likewise learns a low-dimensional representation of goals, but crucially learns it in such a way that we “bake in” a bias towards meaningful goals, thereby avoiding the problem of accidentally discarding salient state dimensions.

Funding

- LL is supported by the National Science Foundation (DGE1745016)
- BE is supported by the Fannie and John Hertz Foundation and the National Science Foundation (DGE1745016)
- RS is supported by NSF IIS1763562, ONR Grant N000141812861, and US Army

Reference

- Agrawal, P., Nair, A. V., Abbeel, P., Malik, J., and Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in neural information processing systems, pp. 5074–5082, 2016.
- Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
- Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.
- Bengio, E., Thomas, V., Pineau, J., Precup, D., and Bengio, Y. Independently controllable features. arXiv preprint arXiv:1703.07718, 2017.
- Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Brown, D. S., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. arXiv preprint arXiv:1904.06387, 2019.
- Chen, J. and Batmanghelich, K. Weakly supervised disentanglement by pairwise similarities. arXiv preprint arXiv:1906.01044, 2019.
- Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610– 2620, 2018.
- Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307, 2017.
- Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., and Finn, C. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
- Dilokthanakul, N., Kaplanis, C., Pawlowski, N., and Shanahan, M. Feature control as intrinsic motivation for hierarchical reinforcement learning. IEEE transactions on neural networks and learning systems, 30(11):3409–3418, 2019.
- Dosovitskiy, A. and Koltun, V. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.
- Duan, Y., Andrychowicz, M., Stadie, B., Ho, O. J., Schneider, J., Sutskever, I., Abbeel, P., and Zaremba, W. One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098, 2017.
- Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N., Paige, B., Brooks, D. H., Dy, J., and van de Meent, J.-W. Structured disentangled representations. arXiv preprint arXiv:1804.02086, 2018.
- Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. IEEE, 2017.
- Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016a.
- Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 512–519. IEEE, 2016b.
- Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
- Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
- Fu, J., Singh, A., Ghosh, D., Yang, L., and Levine, S. Variational inverse control with events: A general framework for datadriven reward definition. In Advances in Neural Information Processing Systems, pp. 8538–8547, 2018.
- Gabbay, A. and Hoshen, Y. Latent optimization for nonadversarial representation disentanglement. arXiv preprint arXiv:1906.11796, 2019.
- Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736, 2019.
- Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. Conference on Robot Learning (CoRL), 2019.
- Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., and Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pp. 80–89. IEEE, 2018.
- Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., and Farhadi, A. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098, 2018.
- Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Botvinick, M., Larochelle, H., Levine, S., and Bengio, Y. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.
- Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. IEEE, 2017.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
- Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an embedding space for transferable robot skills. 2018.
- Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
- Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2(5):6, 2017.
- Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
- Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
- Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
- Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.
- Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning. arXiv preprint arXiv:1703.09327, 2017.
- Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953, 2019a.
- Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., and Salakhutdinov, R. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019b.
- Locatello, F., Bauer, S., Lucic, M., Ratsch, G., Gelly, S., Scholkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.
- Machado, M. C., Bellemare, M. G., and Bowling, M. A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2295–2304. JMLR. org, 2017.
- Mahadevan, S. Proto-value functions: Developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 553–560. ACM, 2005.
- Mandlekar, A., Zhu, Y., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790, 2018.
- Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
- Nachum, O., Gu, S., Lee, H., and Levine, S. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018.
- Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191– 9200, 2018.
- Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiositydriven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017.
- Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053, 2018.
- Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.
- Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
- Pugh, J. K., Soros, L. B., and Stanley, K. O. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016.
- Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736, 2006.
- Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.
- Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
- Shu, R., Chen, Y., Kumar, A., Ermon, S., and Poole, B. Weakly supervised disentanglement with guarantees. arXiv preprint arXiv:1910.09772, 2019.
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Singh, A., Yang, L., Hartikainen, K., Finn, C., and Levine, S. End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019.
- Stanley, K. O. and Miikkulainen, R. Evolving neural networks through augmenting topologies. Evolutionary computation, 10 (2):99–127, 2002.
- Thomas, V., Pondard, J., Bengio, E., Sarfati, M., Beaudoin, P., Meurs, M.-J., Pineau, J., Precup, D., and Bengio, Y. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
- van Steenkiste, S., Locatello, F., Schmidhuber, J., and Bachem, O. Are disentangled representations helpful for abstract visual reasoning? In Advances in Neural Information Processing Systems, pp. 14222–14235, 2019.
- Vecerik, M., Sushkov, O., Barker, D., Rothorl, T., Hester, T., and Scholz, J. A practical approach to insertion with variable socket position using deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 754–760. IEEE, 2019.
- Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350– 354, 2019.
- Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015.
- Xie, A., Singh, A., Levine, S., and Finn, C. Few-shot goal inference for visuomotor learning and planning. arXiv preprint arXiv:1810.00482, 2018.
- Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., and Fergus, R. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019.
- Yu, T., Shevchuk, G., Sadigh, D., and Finn, C. Unsupervised visuomotor control through distributional planning networks. arXiv preprint arXiv:1902.05542, 2019.
- Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., and Levine, S. Solar: Deep structured latent representations for model-based reinforcement learning. arXiv preprint arXiv:1808.09105, 2018.

Full Text

Tags

Comments