## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Adversarial Bandits with Corruptions: Regret Lower Bound and No-regret Algorithm

NIPS 2020, (2020)

EI

摘要

This paper studies adversarial bandits with corruptions. In the basic adversarial bandit setting, the reward of arms is predetermined by an adversary who is oblivious to the learner’s policy. In this paper, we consider an extended setting in which an attacker sits in-between the environment and the learner, and is endowed with a limited b...更多

代码：

数据：

简介

- Multi-armed bandits (MABs) [24] present a powerful online learning framework that is applicable to a broad range of application domains including medical trials, web search advertisement, datacenter design, and recommender systems; see, e.g., [5, 25] and references therein.
- In addition to introducing the above non-stochastic bandits with targeted corruptions, this paper investigates the vulnerability of attack-agnostic algorithms and establishes a regret lower bound for attack-aware algorithms.

重点内容

- Multi-armed bandits (MABs) [24] present a powerful online learning framework that is applicable to a broad range of application domains including medical trials, web search advertisement, datacenter design, and recommender systems; see, e.g., [5, 25] and references therein
- In addition to introducing the above non-stochastic bandits with targeted corruptions, this paper investigates the vulnerability of attack-agnostic algorithms and establishes a regret lower bound for attack-aware algorithms
- Remark 2.1 We mention that there is growing literature on oblivious attack models to stochastic bandit problems; see, e.g., [19, 10]). These papers target at a middle ground of a mixed stochastic and adversarial model that aim to achieve the best of both worlds. Different from these works, our work focuses on targeted attack models for non-stochastic bandits, since an oblivious attacker can be intrinsically captured in the basic setting of adversarial bandits
- Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions
- While there are several recent studies that focus on stochastic MAB problems with corruptions, to the best of our knowledge, this paper is the first that tackles non-stochastic MABs with targeted corruptions

结果

- The authors derive a regret lower bound (Theorem 3) for attack-aware algorithms for non-stochastic bandits with corruption as a function of the corruption bud√get Φ.
- The proof of this theorem, provided in §C in the supplementary, constructs an instance of a stochastic bandit problem and considers the setting that the reward on each arm is subject to a fixed and unknown distribution.
- Theorem 1 demonstrates that to develop a robust algorithm for non-stochastic bandits with corruptions, it is inevitable to provide the algorithm with the information of the existence and budget of attacker.
- The following result provides a lower bound on the regret of any attack-aware algorithm for non-stochastic bandits with a Φ-corrupted attacker.
- The high-level idea of robustification is two-fold: (i) the authors introduce a compensate variable δ(t) to augment the estimated reward of the selected arm and mitigate the risks of underestimation and overestimation of the actual reward; and (ii) the authors introduce a robustness parameter γ that could be tuned based on the budget of the attacker, to determine the design space of learner in biasing the estimated reward.
- The algorithmic nuggets of setting the compensate variable are as follows: (i) as in Line 5 of ExpRb, δ(t) is set only when pIt (t) < pIt , since otherwise, the algorithm has already biased the estimated reward of It in previous rounds; (ii) δ(t) is capped to at most 1, since the value of a(t), i.e., the attacker’s corruption, is at most 1; (iii) δ(t) is a function of γ that determines how much bias is required; γ has a direct relationship to the budget of attacker, i.e., the greater the budget of the attacker, the greater the robustness parameter γ; and last (iv) the larger the difference between pIt (t) and pIt , the greater the δ(t).
- Remark 5.1 The result in Theorem 4 uses the modified definition of regret in Eq (3), where the attacker corrupts the actual reward observed by the learner.
- Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions.

结论

- It first showed that under targeted corruptions, existing attack-agnostic algorithms for non-stochastic bandits, e.g., Exp3, are vulnerable against targeted corruptions with limited budget, and fail to achieve a sublinear regret.
- While there are several recent studies that focus on stochastic MAB problems with corruptions, to the best of the knowledge, this paper is the first that tackles non-stochastic MABs with targeted corruptions

总结

- Multi-armed bandits (MABs) [24] present a powerful online learning framework that is applicable to a broad range of application domains including medical trials, web search advertisement, datacenter design, and recommender systems; see, e.g., [5, 25] and references therein.
- In addition to introducing the above non-stochastic bandits with targeted corruptions, this paper investigates the vulnerability of attack-agnostic algorithms and establishes a regret lower bound for attack-aware algorithms.
- The authors derive a regret lower bound (Theorem 3) for attack-aware algorithms for non-stochastic bandits with corruption as a function of the corruption bud√get Φ.
- The proof of this theorem, provided in §C in the supplementary, constructs an instance of a stochastic bandit problem and considers the setting that the reward on each arm is subject to a fixed and unknown distribution.
- Theorem 1 demonstrates that to develop a robust algorithm for non-stochastic bandits with corruptions, it is inevitable to provide the algorithm with the information of the existence and budget of attacker.
- The following result provides a lower bound on the regret of any attack-aware algorithm for non-stochastic bandits with a Φ-corrupted attacker.
- The high-level idea of robustification is two-fold: (i) the authors introduce a compensate variable δ(t) to augment the estimated reward of the selected arm and mitigate the risks of underestimation and overestimation of the actual reward; and (ii) the authors introduce a robustness parameter γ that could be tuned based on the budget of the attacker, to determine the design space of learner in biasing the estimated reward.
- The algorithmic nuggets of setting the compensate variable are as follows: (i) as in Line 5 of ExpRb, δ(t) is set only when pIt (t) < pIt , since otherwise, the algorithm has already biased the estimated reward of It in previous rounds; (ii) δ(t) is capped to at most 1, since the value of a(t), i.e., the attacker’s corruption, is at most 1; (iii) δ(t) is a function of γ that determines how much bias is required; γ has a direct relationship to the budget of attacker, i.e., the greater the budget of the attacker, the greater the robustness parameter γ; and last (iv) the larger the difference between pIt (t) and pIt , the greater the δ(t).
- Remark 5.1 The result in Theorem 4 uses the modified definition of regret in Eq (3), where the attacker corrupts the actual reward observed by the learner.
- Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions.
- It first showed that under targeted corruptions, existing attack-agnostic algorithms for non-stochastic bandits, e.g., Exp3, are vulnerable against targeted corruptions with limited budget, and fail to achieve a sublinear regret.
- While there are several recent studies that focus on stochastic MAB problems with corruptions, to the best of the knowledge, this paper is the first that tackles non-stochastic MABs with targeted corruptions

- Table1: Summary of prior literature and this work

基金

- Acknowledgments and Disclosure of Funding Lin Yang and Wing Shing Wong acknowledge the support from Schneider Electric, Lenovo Group (China) Limited and the Hong Kong Innovation and Technology Fund (ITS/066/17FP) under the HKUST-MIT Research Alliance Consortium
- Mohammad Hajiesmaili’s research is supported by NSF CNS-1908298
- Lui is supported in part by the GRF 14201819. Our work fits within the broad direction of research concerning safety issues in AI/ML at large

引用论文

- [19] Gupta et al.
- [13] Liu et al. [17]
- [1] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proc. of COLT, pages 217–226, 2009.
- [2] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785–2836, 2010.
- [3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- [4] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 45–53, 2004.
- [5] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
- [6] R. Combes, M. S. Talebi Mazraeh Shahi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited. In Proc. of NIPS, pages 2116–2124, 2015.
- [7] G. V. Cormack et al. Email spam filtering: A systematic review. Foundations and Trends R in Information Retrieval, 1(4):335–455, 2008.
- [8] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
- [9] Z. Feng, D. C. Parkes, and H. Xu. The intrinsic robustness of stochastic bandits to strategic manipulation. arXiv preprint arXiv:1906.01528, 2019.
- [10] A. Gupta, T. Koren, and K. Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Proc. of COLT, 2019.
- [11] A. György, T. Linder, and G. Ottucsak. The shortest path problem under partial monitoring. In G. Lugosi and H. U. Simon, editors, Learning Theory, volume 4005 of Lecture Notes in Computer Science, pages 468–482. Springer Berlin Heidelberg, 2006.
- [12] A. Heydari, M. ali Tavakoli, N. Salim, and Z. Heydari. Detection of review spam: A survey. Expert Systems with Applications, 42(7):3634–3642, 2015.
- [13] K.-S. Jun, L. Li, Y. Ma, and J. Zhu. Adversarial attacks on stochastic bandits. In Proc. of NIPS, pages 3640–3649, 2018.
- [14] W. Z. Khan, M. K. Khan, F. T. B. Muhaya, M. Y. Aalsalem, and H.-C. Chao. A comprehensive study of email spam botnet detection. IEEE Communications Surveys & Tutorials, 17(4).
- [15] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- [16] Y. Li, E. Y. Lou, and L. Shan. Stochastic linear optimization with adversarial corruption. arXiv preprint arXiv:1909.02109, 2019.
- [17] F. Liu and N. Shroff. Data poisoning attacks on stochastic bandits. In Proc. of ICML, 2019.
- [18] M. Luca and G. Zervas. Fake it till you make it: Reputation, competition, and yelp review fraud. Management Science, 62(12):3412–3427, 2016.
- [19] T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proc. of ACM STOC, pages 114–122, 2018.
- [20] Y. Ma, K.-S. Jun, L. Li, and X. Zhu. Data poisoning attacks in contextual bandits. In Proc. of GameSec, pages 186–204.
- [21] G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Proc. of NIPS, pages 3168–3176, 2015.
- [22] G. Neu, A. Gyorgy, and C. Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813, 2012.
- [23] P. Ozisik and P. S. Thomas. Security analysis of safe and seldonian reinforcement learning algorithms. In In Advances in Neural Information Processing Systems, 2020.
- [24] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
- [25] A. Slivkins. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272, 2019.
- [26] M. S. Talebi, Z. Zou, R. Combes, A. Proutiere, and M. Johansson. Stochastic online shortest path routing: The value of feedback. IEEE Transactions on Automatic Control, 63(4):915–930, 2017.
- [27] K. C. Wilbur and Y. Zhu. Click fraud. Marketing Science, 28(2):293–308, 2009.
- [28] X. Wu, Y. Dong, J. Tao, C. Huang, and N. V. Chawla. Reliable fake review detection via modeling temporal and behavioral patterns. In Proc. of IEEE Big Data, pages 494–499, 2017.
- [29] X. Zhang and X. Zhu. Online data poisoning attack. arXiv preprint arXiv:1903.01666, 2019.
- [30] P. Zhou, J. Xu, W. Wang, Y. Hu, D. O. Wu, and S. Ji. Toward optimal adaptive online shortest path routing with acceleration under jamming attack. IEEE/ACM Transactions on Networking, 27(5):1815–1829, 2019.
- [31] J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In Proc. of AISTATS, pages 467–475, 2019.

标签

评论