# Invariant Causal Prediction for Block MDPs

international conference on machine learning, 2020.

Weibo:

Abstract:

Generalization across environments is critical to the successful application of reinforcement learning algorithms to real-world challenges. In this paper, we consider the problem of learning abstractions that generalize in block MDPs, families of environments with a shared latent state space and dynamics structure over that latent space, ...More

Code:

Data:

Introduction

- The canonical reinforcement learning (RL) problem assumes an agent interacting with a single MDP with a fixed observation space and dynamics structure
- This assumption is difficult to ensure in practice, where state spaces are often large and infeasible to explore entirely during training.
- One could define a bisimulation relation that equates observations based on the locations and soil levels of dishes in the room and ignores the wallpaper
- These relations can be used to simplify the state space for tasks like policy transfer (Castro and Precup, 2010), and are intimately tied to state abstraction.
- The model-irrelevance abstraction described by Li et al (2006) is precisely characterized as a bisimulation relation

Highlights

- The canonical reinforcement learning (RL) problem assumes an agent interacting with a single MDP with a fixed observation space and dynamics structure
- Though we demonstrate that causal inference methods can be applied to reinforcement learning, this will require some assumption on how causal mechanisms are observed
- End fs, φ, r ← Xe ∇fs,φ[JALL(Xe)] C ← ∇C CE loss(C(φ({xe}e∈E ), {e}e∈E ). We evaluate both linear and non-linear versions of model-irrelevance state abstractions, in corresponding Block MDP settings with both linear and non-linear dynamics
- We look to imitation learning in a rich observation setting (Section 7.2) and show non-linear model-irrelevance state abstractions generalize to new camera angles
- This work has demonstrated that given certain assumptions, we can use causal inference methods in reinforcement learning to learn an invariant causal representation that generalizes across environments with a shared causal structure
- We have provided a framework for defining families of environments, and methods, for both the low dimensional linear value function approximation setting and the deep reinforcement learning setting, which leverage invariant prediction to extract a causal representation of the state

Results

- The authors evaluate both linear and non-linear versions of MISA, in corresponding Block MDP settings with both linear and non-linear dynamics.
- The authors examine model error in environments with low-dimensional (Section 7.1.1) and highdimensional (Section 7.1.2) observations and demonstrate the ability for MISA to zero-shot generalize to unseen environments.
- The authors look to imitation learning in a rich observation setting (Section 7.2) and show non-linear MISA generalize to new camera angles.
- The authors explore endto-end reinforcement learning in the low-dimensional observation setting with correlated noise (Section 7.3) and again show generalization capabilities where single task and multi-task methods fail

Conclusion

- This work has demonstrated that given certain assumptions, the authors can use causal inference methods in reinforcement learning to learn an invariant causal representation that generalizes across environments with a shared causal structure.
- The authors have provided a framework for defining families of environments, and methods, for both the low dimensional linear value function approximation setting and the deep RL setting, which leverage invariant prediction to extract a causal representation of the state.
- The authors see this paper as a first step towards the more significant problem of learning useful representations for generalization across a broader class of environments.
- Some examples of potential applications include third-person imitation learning, sim2real transfer, and, related to sim2real transfer, the use of privileged information in one task as grounding and generalization to new observation spaces (Salter et al, 2019)

Summary

## Introduction:

The canonical reinforcement learning (RL) problem assumes an agent interacting with a single MDP with a fixed observation space and dynamics structure- This assumption is difficult to ensure in practice, where state spaces are often large and infeasible to explore entirely during training.
- One could define a bisimulation relation that equates observations based on the locations and soil levels of dishes in the room and ignores the wallpaper
- These relations can be used to simplify the state space for tasks like policy transfer (Castro and Precup, 2010), and are intimately tied to state abstraction.
- The model-irrelevance abstraction described by Li et al (2006) is precisely characterized as a bisimulation relation
## Objectives:

The goal of this work is to produce representations that will generalize from the training environments to a novel test environment.## Results:

The authors evaluate both linear and non-linear versions of MISA, in corresponding Block MDP settings with both linear and non-linear dynamics.- The authors examine model error in environments with low-dimensional (Section 7.1.1) and highdimensional (Section 7.1.2) observations and demonstrate the ability for MISA to zero-shot generalize to unseen environments.
- The authors look to imitation learning in a rich observation setting (Section 7.2) and show non-linear MISA generalize to new camera angles.
- The authors explore endto-end reinforcement learning in the low-dimensional observation setting with correlated noise (Section 7.3) and again show generalization capabilities where single task and multi-task methods fail
## Conclusion:

This work has demonstrated that given certain assumptions, the authors can use causal inference methods in reinforcement learning to learn an invariant causal representation that generalizes across environments with a shared causal structure.- The authors have provided a framework for defining families of environments, and methods, for both the low dimensional linear value function approximation setting and the deep RL setting, which leverage invariant prediction to extract a causal representation of the state.
- The authors see this paper as a first step towards the more significant problem of learning useful representations for generalization across a broader class of environments.
- Some examples of potential applications include third-person imitation learning, sim2real transfer, and, related to sim2real transfer, the use of privileged information in one task as grounding and generalization to new observation spaces (Salter et al, 2019)

- Table1: A complete overview of used hyper parameters

Related work

- 8.1. Prior Work on Generalization Bounds

Generalization bounds provide guarantees on the test set error attained by an algorithm. Most of these bounds are probabilistic and targeted at the supervised setting, falling into the PAC (Probably Approximately Correct) framework. PAC bounds give probabilistic guarantees on a model’s true error as a function of its train set error and the complexity of the function class encoded by the model. Many measures of hypothesis class complexity exist: the Vapnik-Chernovenkis (VC) dimension (Vapnik and Chervonenkis, 1971), the Lipschitz constant, and classification margin of a neural network (Bartlett et al, 2017b), and second-order properties of the loss landscape (Neyshabur et al, 2019) are just a few of many.

Analogous techniques can be applied to Bayesian methods, giving rise to PAC-Bayes bounds (McAllester, 1999). This family of bounds can be optimized to yield nonvacuous bounds on the test error of over-parametrized neural networks (Dziugaite and Roy, 2017), and have demonstrated strong empirical correlation with model generalization (Jiang* et al, 2020). More recently, Amit and Meir (2018); Yin et al (2019) introduce a PAC-Bayes bound for the multi-task setting dependent on the number of tasks seen at training time.

Funding

- MK has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 834115)

Reference

- Amit, R. and Meir, R. (2018). Meta-learning by adjusting priors based on extended PAC-Bayes theory. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 205– 214, Stockholmsmassan, Stockholm Sweden. PMLR.
- Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant Risk Minimization. arXiv e-prints.
- Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In Krause, A. and Dy, J., editors, 35th International Conference on Machine Learning, ICML 2018, 35th International Conference on Machine Learning, ICML 2018, pages 390–418. International Machine Learning Society (IMLS).
- Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv e-prints.
- Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4055– 406Curran Associates, Inc.
- Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017a). Spectrally-normalized margin bounds for neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 6240–6249. Curran Associates, Inc.
- Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017b). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
- Bertsekas, D. and Castanon, D. (1989). Adaptive aggregation for infinite horizon dynamic programming. Automatic Control, IEEE Transactions on, 34:589 – 598.
- Borsa, D., Graepel, T., and Shawe-Taylor, J. (2016). Learning shared representations in multi-task reinforcement learning. CoRR, abs/1603.02041.
- Brunskill, E. and Li, L. (2013). Sample complexity of multi-task reinforcement learning. Uncertainty in Artificial Intelligence - Proceedings of the 29th Conference, UAI 2013.
- Castro, P. S. and Precup, D. (2010). Using bisimulation for policy transfer in mdps. In Twenty-Fourth AAAI Conference on Artificial Intelligence.
- Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. (2018). Quantifying generalization in reinforcement learning. CoRR, abs/1812.02341.
- D’Eramo, C., Tateo, D., Bonarini, A., Restelli, M., and Peters, J. (2020). Sharing knowledge in multi-task deep reinforcement learning. In International Conference on Learning Representations.
- Du, S. S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudık, M., and Langford, J. (2019). Provably efficient RL with rich observations via latent state decoding. CoRR, abs/1901.09018.
- Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.
- Eberhardt, F. and Scheines, R. (2007). Interventions and causal inference. Philosophy of Science, 74(5):981–995.
- Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. (2019). DeepMDP: Learning continuous latent space models for representation learning. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2170–2179, Long Beach, California, USA. PMLR.
- Givan, R., Dean, T. L., and Greig, M. (2003). Equivalence notions and model minimization in markov decision processes. Artif. Intell., 147:163–223.
- Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870, Stockholmsmassan, Stockholm Sweden. PMLR.
- Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 11:1563–1600.
- Jiang*, Y., Neyshabur*, B., Krishnan, D., Mobahi, H., and Bengio, S. (2020). Fantastic generalization measures and where to find them. In International Conference on Learning Representations.
- Larsen, K. G. and Skou, A. (1989). Bisimulation through probabilistic testing (preliminary report). In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’89, page 344–352, New York, NY, USA. Association for Computing Machinery.
- Lattimore, T. and Hutter, M. (2012). Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pages 320–334. Springer.
- Li, L., Walsh, T., and Littman, M. (2006). Towards a unified theory of state abstraction for mdps.
- Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. (2019). Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations.
- McAllester, D. A. (1999). Pac-bayesian model averaging. In COLT, volume 99, pages 164–170. Citeseer.
- Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. (2019). The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations.
- Pearl, J. (2009). Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition.
- Peters, J., Buhlmann, P., and Meinshausen, N. (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (with discussion), 78(5):947– 1012.
- Roy, B. V. (2006). Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res., 31(2):234–244.
- Salter, S., Rao, D., Wulfmeier, M., Hadsell, R., and Posner, I. (2019). Attention privileged reinforcement learning for domain transfer.
- Scholkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 459–466, Madison, WI, USA. Omnipress.
- Song, X., Jiang, Y., Tu, S., Du, Y., and Neyshabur, B. (2020). Observational overfitting in reinforcement learning. In International Conference on Learning Representations.
- Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. (2006). Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM.
- Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. (2018). DeepMind control suite. Technical report, DeepMind.
- Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4496– 4506. Curran Associates, Inc.
- Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2962–2971, Los Alamitos, CA, USA. IEEE Computer Society.
- Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280.
- Yarats, D. and Kostrikov, I. (2020). Soft actor-critic (sac) implementation in pytorch. https://github.com/denisyarats/pytorch_sac.
- Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn, C. (2019). Meta-learning without memorization. arXiv preprint arXiv:1912.03820.
- Zhang, A., Wu, Y., and Pineau, J. (2018a). Natural environment benchmarks for reinforcement learning. CoRR, abs/1811.06032.
- Zhang, C., Vinyals, O., Munos, R., and Bengio, S. (2018b). A study on overfitting in deep reinforcement learning. CoRR, abs/1804.06893.
- Proposition 1 (Identifiability and Uniqueness of Causal State Abstraction). In the setting of the previous theorem, assume the transition dynamics and reward are linear functions of the current state. If the training environment set Etrain satisfies any of the conditions of Theorem 2 (Peters et al., 2016) with respect to each variable in AN(R), then the causal feature set φS is identifiable. Conversely, if the training environments don’t contain sufficient interventions, then it may be that there exists a φ such that φ is a model irrelevance abstraction over Etrain, but not over E globally.
- Proof. The proof of the first statement follows immediately from the iterative application of the identifiability result of Peters et al. (2016) to each variable in the causal variables set.
- For the model learning experiments we use an almost identical encoder architecture as in Tassa et al. (2018), with two more convolutional layers to the convnet trunk. Secondly, we use ReLU activations after each convolutional layer, instead of ELU. We use kernels of size 3 × 3 with 32 channels for all the convolutional layers and set stride to 1 everywhere, except of the first convolutional layer, which has stride 2. We then take the output of the convolutional net and feed it into a single fully-connected layer normalized by LayerNorm (Ba et al., 2016). Finally, we add tanh nonlinearity to the 50 dimensional output of the fully-connected layer.
- For the reinforcement learning experiments we modify the Soft Actor-Critic PyTorch implementation by Yarats and Kostrikov (2020) and augment with a shared encoder between the actor and critic, the general model fs and task-specific models fηe. The forward models are multi-layer perceptions with ReLU non-linearities and two hidden layers of 200 neurons each. The encoder is a linear layer that maps to a 50-dim hidden representation. We also use L1 regularization on the S latent representation. We add two additional dimensions to the state space, a spurious correlation dimension that is a multiplicative factor of the last dimension of the ground truth state, as well as an environment id. We add Gaussian noise N (0, 0.01) to the original state dimension, similar to how Arjovsky et al. (2019) incorporate noise in the label to make the task harder for the baseline.
- Soft Actor Critic (SAC) (Haarnoja et al., 2018) is an off-policy actor-critic method that uses the maximum entropy framework to derive soft policy iteration. At each iteration, SAC performs soft policy evaluation and improvement steps. The policy evaluation step fits a parametric soft Q-function Q(xt, at) using transitions sampled from the replay buffer D by minimizing the soft Bellman residual, J (Q) = E(xt,xt,rt,xt+1)∼D Q(xt, at) − rt − γV (xt+1).

Full Text

Tags

Comments