Characterizing Optimal Mixed Policies: Where to Intervene and What to Observe

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views7
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
One of the surprising implications of this characterization provided here is that agents following a more standard approach may be hurting themselves, and may never be able to achieve an optimal performance regardless of the number of interactions performed

Abstract:

Intelligent agents are continuously faced with the challenge of optimizing a policy based on what they can observe (see) and which actions they can take (do) in the environment where they are deployed. Most policy can be parametrized in terms of these two dimensions, i.e., as a function of what can be seen and done given a certain situati...More

Code:

Data:

0
Introduction
  • Agents are deployed in complex and uncertain environments where they are bombarded with high volumes of information and are expected to operate efficiently, safely, and rationally.
  • Given an MPS S, there always exists a deterministic mixed policy are disjoint and μ⇡
  • Optimizing a mixed policy involves assessments of the effectiveness of its scope so that an agent can avoid intervening or observing on unnecessary actions or contexts.
Highlights
  • Agents are deployed in complex and uncertain environments where they are bombarded with high volumes of information and are expected to operate efficiently, safely, and rationally
  • We studied the space of mixed policies that emerges through the empowerment of an agent to determine the mode it will interact with the environment — i.e., which variables to intervene on and which contexts it decides to look into
  • Facing new challenges to optimize this new mode of interaction, which has many additional degrees of freedom, we studied the topological structure induced by the different mixed policies, which could in turn be leveraged to determine partial orders across the policy space w.r.t. the maximum expected rewards achievable
  • One of the surprising implications of this characterization provided here is that agents following a more standard approach may be hurting themselves, and may never be able to achieve an optimal performance regardless of the number of interactions performed
  • Our results provide a tool for AI engineers and researchers to identify where the inefficiency of a policy may be coming from, including potentially unintended side effects
  • It is not difficult to imagine that our work will share the common problems with other automated decision making tools and methods such as (i) the system optimized based on an ill-defined reward may harm ‘unknown unknowns’ or (ii) the optimization can be impossible due to the participants of adversarial players
Results
  • The authors characterize the non-redundancy of MPS under optimality, which has practical implications to an agent adapting its suboptimal policy.
  • Given hG, Y , X?, C?i, an MPS S is said to be non-redundant under optimality if there exists an SCM M compatible for every strictly subsumed
  • Given a mixed policy ⇡ ⇠ S optimal with respect to S, if there exist decision rules {⇡0(x|(x0 [ ̇ c0) \ cx)}X2X0 such that yQ0x0 (y, c0)
  • This follows from the definition of non-redundancy under optimality and expected reward.
  • Given an MPS S, which satisfies non-redundancy (Thm. 1), let X0 ✓ X(S), actions of interest, C0 ( CX0 \X0.
  • Given hG, X?, C?, Y i, let S be a set of NRO MPSes. An is said to be possibly-optimal if there exists such that μ⇤
  • Refining the space of MPSes Equipped with the characterizations, the authors can refine the space of MPSes, the space of mixed policies, by filtering out MPSes that are either redundant or dominated by other MPS, eliciting a superset of POMPSes in a given setting.
  • The authors investigate simplifying a mixed policy setting while preserving its POMPSes. First, one may think that the descendants of Y can be ignored since neither intervening action variables among them changes the reward nor observing contextualizable variables among them is feasible.
  • The authors studied the space of mixed policies that emerges through the empowerment of an agent to determine the mode it will interact with the environment — i.e., which variables to intervene on and which contexts it decides to look into.
Conclusion
  • Facing new challenges to optimize this new mode of interaction, which has many additional degrees of freedom, the authors studied the topological structure induced by the different mixed policies, which could in turn be leveraged to determine partial orders across the policy space w.r.t. the maximum expected rewards achievable.
  • The authors provided a general characterization of the space of mixed policies with respect to properties that allow the agent to detect inefficient and suboptimal strategies.
  • The current work does not consider multiple adversarial participants, which is a subject of future research
Summary
  • Agents are deployed in complex and uncertain environments where they are bombarded with high volumes of information and are expected to operate efficiently, safely, and rationally.
  • Given an MPS S, there always exists a deterministic mixed policy are disjoint and μ⇡
  • Optimizing a mixed policy involves assessments of the effectiveness of its scope so that an agent can avoid intervening or observing on unnecessary actions or contexts.
  • The authors characterize the non-redundancy of MPS under optimality, which has practical implications to an agent adapting its suboptimal policy.
  • Given hG, Y , X?, C?i, an MPS S is said to be non-redundant under optimality if there exists an SCM M compatible for every strictly subsumed
  • Given a mixed policy ⇡ ⇠ S optimal with respect to S, if there exist decision rules {⇡0(x|(x0 [ ̇ c0) \ cx)}X2X0 such that yQ0x0 (y, c0)
  • This follows from the definition of non-redundancy under optimality and expected reward.
  • Given an MPS S, which satisfies non-redundancy (Thm. 1), let X0 ✓ X(S), actions of interest, C0 ( CX0 \X0.
  • Given hG, X?, C?, Y i, let S be a set of NRO MPSes. An is said to be possibly-optimal if there exists such that μ⇤
  • Refining the space of MPSes Equipped with the characterizations, the authors can refine the space of MPSes, the space of mixed policies, by filtering out MPSes that are either redundant or dominated by other MPS, eliciting a superset of POMPSes in a given setting.
  • The authors investigate simplifying a mixed policy setting while preserving its POMPSes. First, one may think that the descendants of Y can be ignored since neither intervening action variables among them changes the reward nor observing contextualizable variables among them is feasible.
  • The authors studied the space of mixed policies that emerges through the empowerment of an agent to determine the mode it will interact with the environment — i.e., which variables to intervene on and which contexts it decides to look into.
  • Facing new challenges to optimize this new mode of interaction, which has many additional degrees of freedom, the authors studied the topological structure induced by the different mixed policies, which could in turn be leveraged to determine partial orders across the policy space w.r.t. the maximum expected rewards achievable.
  • The authors provided a general characterization of the space of mixed policies with respect to properties that allow the agent to detect inefficient and suboptimal strategies.
  • The current work does not consider multiple adversarial participants, which is a subject of future research
Funding
  • Acknowledgments and Disclosure of Funding This research is supported in parts by grants from NSF (IIS-1704352 and IIS-1750807 (CAREER))
Reference
  • Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000.
    Google ScholarFindings
  • P. Spirtes, C.N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001.
    Google ScholarFindings
  • Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference. MIT Press, 2017.
    Google ScholarFindings
  • E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113:7345–7352, 2016.
    Google ScholarLocate open access versionFindings
  • Judea Pearl and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018.
    Google ScholarFindings
  • Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
    Google ScholarFindings
  • Csaba Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
    Google ScholarFindings
  • Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
    Google ScholarFindings
  • Junzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandits: a causal approach. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1340–1346. AAAI Press, 2017.
    Google ScholarLocate open access versionFindings
  • J. Zhang and E. Bareinboim. Near-optimal reinforcement learning in dynamic treatment regimes. In Advances in Neural Information Processing Systems 32, pages 13401–13411. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Guy Tennenholtz, Shie Mannor, and Uri Shalit. Off-policy evaluation in partially observable environments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020.
    Google ScholarLocate open access versionFindings
  • Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems, pages 1342–1350, 2015.
    Google ScholarLocate open access versionFindings
  • Andrew Forney, Judea Pearl, and Elias Bareinboim. Counterfactual data-fusion for online reinforcement learners. In International Conference on Machine Learning, pages 1156–1164, 2017.
    Google ScholarLocate open access versionFindings
  • Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions via causal inference. arXiv preprint arXiv:1606.03203, 2016.
    Findings
  • Rajat Sen, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pages 3057–3066, 2017.
    Google ScholarLocate open access versionFindings
  • Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? In Advances in Neural Information Processing Systems 31, pages 2568–2578. Curran Associates, Inc., 2018.
    Google ScholarLocate open access versionFindings
  • Sanghack Lee and Elias Bareinboim. Structural causal bandits with non-manipulable variables. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 4164–4172. AAAI Press, 2019.
    Google ScholarLocate open access versionFindings
  • R.A. Howard and J.E. Matheson. Influence diagrams. Principles and Applications of Decision Analysis, 1981.
    Google ScholarLocate open access versionFindings
  • Dennis Nilsson and Steffen L. Lauritzen. Evaluating influence diagrams using limids. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 436– 445, 2000.
    Google ScholarLocate open access versionFindings
  • Steffen L. Lauritzen and Dennis Nilsson. Representing and solving decision problems with limited information. Management Science, 47:1235–1251, 2001.
    Google ScholarLocate open access versionFindings
  • Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior, 45(1):181–221, 2003.
    Google ScholarLocate open access versionFindings
  • J.M. Robins. A new approach to causal inference in mortality studies with a sustained exposure period – applications to control of the healthy workers survivor effect. Mathematical Modeling, 7:1393–1512, 1986.
    Google ScholarLocate open access versionFindings
  • J. Pearl and J.M. Robins. Probabilistic evaluation of sequential plans from causal models with hidden variables. In P. Besnard and S. Hanks, editors, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI 1995), pages 444–453. Morgan Kaufmann, San Francisco, 1995.
    Google ScholarLocate open access versionFindings
  • S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355, 2003.
    Google ScholarLocate open access versionFindings
  • Vanessa Didelez, A. Philip Dawid, and Sara Geneletti. Direct and indirect effects of sequential treatments. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 138–146. AUAI press, 2006.
    Google ScholarLocate open access versionFindings
  • A.P. Dawid. Influence diagrams for causal modelling and inference. International Statistical Review, 70:161–189, 2002.
    Google ScholarLocate open access versionFindings
  • Jin Tian. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence. AUAI press, 2008.
    Google ScholarLocate open access versionFindings
  • Sanghack Lee and Elias Bareinboim. Characterizing optimal mixed policies: Where to intervene and what to observe. Technical Report R-63, Columbia Causal Artificial Intelligence Lab, Department of Computer Science, Columbia University, 2020.
    Google ScholarFindings
  • Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for learning causal graphs with latent variables. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 7021– 7031, 2017.
    Google ScholarLocate open access versionFindings
  • Murat Kocaoglu, Amin Jaber, Karthikeyan Shanmugam, and Elias Bareinboim. Characterization and learning of causal graphs with latent variables from soft interventions. In Advances in Neural Information Processing Systems 32, pages 14369–14379. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • T.S. Verma. Graphical aspects of causal models. Technical Report R-191, UCLA, Computer Science Department, 1993.
    Google ScholarFindings
  • D. Geiger and J. Pearl. On the logic of causal models. In Proceedings of the 4th Workshop on Uncertainty in Artificial Intelligence, pages 136–147, St Paul, MN, 1988.
    Google ScholarLocate open access versionFindings
  • D. Geiger, T.S. Verma, and J. Pearl. d separation: From theorems to algorithms. In Proceedings, 5th Workshop on Uncertainty in AI, pages 118–124, Windsor, Ontario, Canada, August 1989.
    Google ScholarLocate open access versionFindings
  • Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
    Google ScholarLocate open access versionFindings
  • Juan David Correa and Elias Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the 35nd AAAI Conference on Artificial Intelligence. AAAI Press, 2020.
    Google ScholarLocate open access versionFindings
  • Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
    Google ScholarFindings
  • Philip Dawid and Vanessa Didelez. Identifying optimal sequential decisions. In Proceedings of The Twenty-Fourth Annual Conference on Uncertainty in Artificial Intelligence. AUAI press, 2008.
    Google ScholarLocate open access versionFindings
  • Qiang Liu and Alexander Ihler. Belief propagation for structured decision making. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. AUAI press, 2012.
    Google ScholarLocate open access versionFindings
  • D. Geiger, T.S. Verma, and J. Pearl. Identifying independence in Bayesian networks. In Networks, volume 20, pages 507–534. John Wiley, Sussex, England, 1990.
    Google ScholarLocate open access versionFindings
  • R.D. Shachter. Evaluating influence diagrams. Operations Research, 34(6):871–882, 1986.
    Google ScholarLocate open access versionFindings
  • Nevin Lianwen Zhang. Probabilistic inference in influence diagrams. In Proceedings of the Fourteenth conference on Uncertainty in Artificial Intelligence, pages 514–522, 1998.
    Google ScholarLocate open access versionFindings
  • Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams, part i: single action settings, 2019.
    Google ScholarFindings
  • Ryan Carey, Eric Langlois, Tom Everitt, and Shane Legg. The incentives that shape behaviour, 2020. arXiv.
    Google ScholarFindings
  • J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA, 1988.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments