We evaluated the number of trajectories that must to be added to D before φ incorrectly returns unsafe policies with probability more than δ
Security Analysis of Safe and Seldonian Reinforcement Learning Algorithms
NIPS 2020, (2020)
We analyze the extent to which existing methods rely on accurate training data for a specific class of reinforcement learning (RL) algorithms, known as Safe and Seldonian RL. We introduce a new measure of security to quantify the susceptibility to perturbations in training data by creating an attacker model that represents a worst-case an...更多
下载 PDF 全文
- Reinforcement learning (RL) algorithms have been proposed for many high-risk applications, such as improving type 1 diabetes and sepsis treatments [44; 18].
- The safety test first computes estimates of the expected performance of a new policy from training data, using importance sampling (IS).
- The authors propose a new algorithm that is more robust to anomalies in training data, ensuring safety with high probability when an upper bound on the number of adversarially corrupt data points is known.
- Reinforcement learning (RL) algorithms have been proposed for many high-risk applications, such as improving type 1 diabetes and sepsis treatments [44; 18]
- We focus on a worst-case setting, where an attacker modifies training data to maximize the probability that the Seldonian RL algorithm returns an unsafe policy
- In Corollary 1, we show that when k is chosen correctly, our algorithm meets a user-specified level of security against an attacker, whose optimal strategy does not change even if they know we are using Panacea
- When viewed as an Markov decision process (MDP), actions correspond to directions the agent can move, and states correspond to its current location on the grid
- Our results show that corrupting D collected from adult#003 within type 1 diabetes Mellitus simulator (T1DMS) can cause a Seldonian RL algorithm to select a bad policy, i.e., a new distribution over policies with lower return than πb
- We evaluated the number of trajectories that must to be added to D before φ incorrectly returns unsafe policies with probability more than δ
- The authors focus on a worst-case setting, where an attacker modifies training data to maximize the probability that the Seldonian RL algorithm returns an unsafe policy.
- For a given k, the optimal attack is to create a trajectory that maximizes the value of the IS weight and return.
- The authors describe the algorithm Panacea, named after the Greek goddess of healing, that provides α-security, with a user-specified α, if the number of corrupt trajectories in D is upper bounded.
- In Corollary 1, the authors show that when k is chosen correctly, the algorithm meets a user-specified level of security against an attacker, whose optimal strategy does not change even if they know the authors are using Panacea.
- Let k denote the number of adversarial trajectories added by the attacker, and k denote the input provided to Panacea by the user.
- When this value is upper bounded correctly, k = k, and Panacea computes a clipping weight for the given estimator, denoted by c∗, , using Table 1.
- If the number of adversarial trajectories in D is upper bounded, rewriting (6) in terms of c∗, , and solving (6) for a clean expression for c∗, by substituting α , k and L∗, with the user-specified inputs, equals the clipping weights found in Table 1 for different estimators.
- Instead of finding a trajectory with the largest IS weight and return, the optimal attack strategy is selecting a CR and CF such that the ratio of their joint probability under πe to that of πb is maximized, i.e., arg maxCR ∈[5,50] arg maxCF ∈[1,31] πe(CR , CF , θ3, θ4)/πb(CR , CF , θ1, θ2).
- For Panacea, the authors computed the clipping weights, found in Table 1 per estimator, using the userspecified α and the actual number of adversarial trajectories added by the attacker.
- When creating safe AI algorithms that can directly impact people’s lives, the authors should ensure performance guarantees with high probability, and the development of metrics that evaluate the “quality” of training data, which often reflect systemic biases and human error.
- Table1: α–security of current methods (center); settings for clipping weight, c, for α-security written in terms of a user-specified k and α (right). The minimum IS weight is denoted by imin. By Theorem 1, the security of Panacea is
- Our paper arguably falls into the broad body of work aimed at creating algorithms that can withstand uncertainty introduced at any stage of the RL framework. Some works view a component as an adversary with stochastic behavior. For an overview of risk-averse methods, e.g., those that incorporate stochasticity into the system, refer to the work of Garcia and Fernandez .
Other works incorporate an adversary to model worst-case scenarios. In a model-free setting, Morimoto and Doya  introduced Robust RL that models the environment as an adversary to address parameter uncertainty in MDPs. Pinto et al  extend this work to non-linear policies that are represented as neural networks. Lim et al  consider MDPs with some adversarial state-action pairs. In model-based settings, i.e., those that build an explicit model of the system, although an adversary is not present, worst-case analyses assume different components of an MDP are unknown [34; 46; 39]. Using the definition of safety we have introduced, Ghavamzadeh et al  and Laroche et al  created algorithms for learning safe policies in a model-based setting. Although we are also interested in ensuring security for these approaches, we focus on a model-free setting that requires a different set of assumptions and attacker model.
- This work was supported in part by NSF Award #2018372 and a gift from Adobe
- Research reported in this paper was also sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)
- Daniel Albright, Arrick Lanfranchi, Anwen Fredriksen, William F. Styler IV, Colin Warner, Jena D. Hwang, Jinho D. Choi, Dmitriy Dligach, Rodney D. Nielsen, James Martin, et al. Towards comprehensive syntactic and semantic annotations of the clinical narrative. Journal of the American Medical Informatics Association, 20(5):922–930, 2013.
- Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967.
- Meysam Bastani. Model-free intelligent diabetes management using machine learning. M.S. Thesis, University of Alberta, 2014.
- Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
- Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
- Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
- Javier Garcia and Fernando Fernandez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pages 2298–2306, 2016.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Zhaohan Guo, Philip S. Thomas, and Emma Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2492–2501, 2017.
- Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. arXiv preprint arXiv:1902.08647, 2019.
- Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- Aaron Holmes. Hackers have become so sophisticated that nearly 4 billion records have been stolen from people in the last decade alone. Here are the 10 biggest data breaches of the 2010s., 2019. URL https://www.businessinsider.com/
- biggest-hacks-2010s-facebook-equifax-adobe-marriott-2019-10. Accessed May 1, 2020.
-  Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
-  Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
-  Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640–3649, 2018.
-  Matthieu Komorowski, Leo A. Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24 (11):1716, 2018.
-  Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
-  Boris P. Kovatchev, Marc Breton, Chiara Dalla Man, and Claudio Cobelli. In silico preclinical trials: A proof of concept in closed-loop control of type 1 diabetes, 2009.
-  Ilja Kuzborskij, Claire Vernade, András György, and Csaba Szepesvári. Confident off-policy evaluation and selection through self-normalized importance weighting. arXiv preprint arXiv:2006.10460, 2020.
-  Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pages 3652–3661, 2019.
-  Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision processes. In Advances in Neural Information Processing Systems, pages 701–709, 2013.
-  Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pages 157–163.
-  Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. arXiv preprint arXiv:1905.06494, 2019.
-  Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356–5366, 2018.
-  A. Rupam Mahmood, Hado P. Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 3014–3022, 2014.
-  Chiara Dalla Man, Francesco Micheletto, Dayu Lv, Marc Breton, Boris Kovatchev, and Claudio Cobelli. The UVA/PADOVA type 1 diabetes simulator: New features. Journal of Diabetes Science and Technology, 8(1):26–34, 2014.
-  Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy evaluation across representations with applications to educational games. In AAMAS, pages 1077–1084, 2014.
-  Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
-  Blossom Metevier, Stephen Giguere, Sarah Brockman, Ari Kobren, Yuriy Brun, Emma Brunskill, and Philip S Thomas. Offline contextual bandits with high probability fairness guarantees. In Advances in Neural Information Processing Systems, pages 14893–14904, 2019.
-  Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural Computation, 17(2):335–359, 2005.
-  Lily Hay Newman. The Worst Hacks of the Decade, 2019. URL https://www.wired.com/story/worst-hacks-of-the-decade/. Accessed May 1, 2020.
-  Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
-  Jonathan Petit, Bas Stottelaar, Michael Feiri, and Frank Kargl. Remote attacks on automated vehicles sensors: Experiments on camera and lidar. Black Hat Europe, 11:2015, 2015.
-  Lerrel Pinto, James Davidson, and Abhinav Gupta. Supervision via competition: Robot adversaries for learning tasks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1601–1608. IEEE, 2017.
-  Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
-  Doina Precup. Temporal abstraction in reinforcement learning. 2001.
-  Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
-  Rajneesh Sharma and Madan Gopal. A robust markov game controller for nonlinear systems. Applied Soft Computing, 7(3):818–827, 2007.
-  P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, 2015.
-  Philip S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts Libraries, 2015.
-  Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
-  Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999–1004, 2019. The DNC email leaks expose the rampant elitism that made way for Trump, 2016.
- URL https://qz.com/742394/
- the-dnc-email-leaks-expose-the-rampant-elitism-that-made-way-for-trump/. Accessed May 1, 2020.
-  Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
-  Jinyu Xie. Simglucose v0.2.1 (2018), 2019. URL https://github.com/jxx123/simglucose. Accessed May 1, 2020.
-  Lin Yang,, Mohammad H. Hajiesmaili, M. Sadegh Talebi, John C. S. Lui, and Wing S. Wong. Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. In Advances in Neural Information Processing Systems, 2020.
-  Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 467–475. PMLR, 2019.