AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions

Adversarial Bandits with Corruptions: Regret Lower Bound and No-regret Algorithm

NIPS 2020, (2020)

被引用0|浏览41
EI
下载 PDF 全文
引用
微博一下

摘要

This paper studies adversarial bandits with corruptions. In the basic adversarial bandit setting, the reward of arms is predetermined by an adversary who is oblivious to the learner’s policy. In this paper, we consider an extended setting in which an attacker sits in-between the environment and the learner, and is endowed with a limited b...更多

代码

数据

简介
  • Multi-armed bandits (MABs) [24] present a powerful online learning framework that is applicable to a broad range of application domains including medical trials, web search advertisement, datacenter design, and recommender systems; see, e.g., [5, 25] and references therein.
  • In addition to introducing the above non-stochastic bandits with targeted corruptions, this paper investigates the vulnerability of attack-agnostic algorithms and establishes a regret lower bound for attack-aware algorithms.
重点内容
  • Multi-armed bandits (MABs) [24] present a powerful online learning framework that is applicable to a broad range of application domains including medical trials, web search advertisement, datacenter design, and recommender systems; see, e.g., [5, 25] and references therein
  • In addition to introducing the above non-stochastic bandits with targeted corruptions, this paper investigates the vulnerability of attack-agnostic algorithms and establishes a regret lower bound for attack-aware algorithms
  • Remark 2.1 We mention that there is growing literature on oblivious attack models to stochastic bandit problems; see, e.g., [19, 10]). These papers target at a middle ground of a mixed stochastic and adversarial model that aim to achieve the best of both worlds. Different from these works, our work focuses on targeted attack models for non-stochastic bandits, since an oblivious attacker can be intrinsically captured in the basic setting of adversarial bandits
  • Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions
  • While there are several recent studies that focus on stochastic MAB problems with corruptions, to the best of our knowledge, this paper is the first that tackles non-stochastic MABs with targeted corruptions
结果
  • The authors derive a regret lower bound (Theorem 3) for attack-aware algorithms for non-stochastic bandits with corruption as a function of the corruption bud√get Φ.
  • The proof of this theorem, provided in §C in the supplementary, constructs an instance of a stochastic bandit problem and considers the setting that the reward on each arm is subject to a fixed and unknown distribution.
  • Theorem 1 demonstrates that to develop a robust algorithm for non-stochastic bandits with corruptions, it is inevitable to provide the algorithm with the information of the existence and budget of attacker.
  • The following result provides a lower bound on the regret of any attack-aware algorithm for non-stochastic bandits with a Φ-corrupted attacker.
  • The high-level idea of robustification is two-fold: (i) the authors introduce a compensate variable δ(t) to augment the estimated reward of the selected arm and mitigate the risks of underestimation and overestimation of the actual reward; and (ii) the authors introduce a robustness parameter γ that could be tuned based on the budget of the attacker, to determine the design space of learner in biasing the estimated reward.
  • The algorithmic nuggets of setting the compensate variable are as follows: (i) as in Line 5 of ExpRb, δ(t) is set only when pIt (t) < pIt , since otherwise, the algorithm has already biased the estimated reward of It in previous rounds; (ii) δ(t) is capped to at most 1, since the value of a(t), i.e., the attacker’s corruption, is at most 1; (iii) δ(t) is a function of γ that determines how much bias is required; γ has a direct relationship to the budget of attacker, i.e., the greater the budget of the attacker, the greater the robustness parameter γ; and last (iv) the larger the difference between pIt (t) and pIt , the greater the δ(t).
  • Remark 5.1 The result in Theorem 4 uses the modified definition of regret in Eq (3), where the attacker corrupts the actual reward observed by the learner.
  • Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions.
结论
  • It first showed that under targeted corruptions, existing attack-agnostic algorithms for non-stochastic bandits, e.g., Exp3, are vulnerable against targeted corruptions with limited budget, and fail to achieve a sublinear regret.
  • While there are several recent studies that focus on stochastic MAB problems with corruptions, to the best of the knowledge, this paper is the first that tackles non-stochastic MABs with targeted corruptions
总结
  • Multi-armed bandits (MABs) [24] present a powerful online learning framework that is applicable to a broad range of application domains including medical trials, web search advertisement, datacenter design, and recommender systems; see, e.g., [5, 25] and references therein.
  • In addition to introducing the above non-stochastic bandits with targeted corruptions, this paper investigates the vulnerability of attack-agnostic algorithms and establishes a regret lower bound for attack-aware algorithms.
  • The authors derive a regret lower bound (Theorem 3) for attack-aware algorithms for non-stochastic bandits with corruption as a function of the corruption bud√get Φ.
  • The proof of this theorem, provided in §C in the supplementary, constructs an instance of a stochastic bandit problem and considers the setting that the reward on each arm is subject to a fixed and unknown distribution.
  • Theorem 1 demonstrates that to develop a robust algorithm for non-stochastic bandits with corruptions, it is inevitable to provide the algorithm with the information of the existence and budget of attacker.
  • The following result provides a lower bound on the regret of any attack-aware algorithm for non-stochastic bandits with a Φ-corrupted attacker.
  • The high-level idea of robustification is two-fold: (i) the authors introduce a compensate variable δ(t) to augment the estimated reward of the selected arm and mitigate the risks of underestimation and overestimation of the actual reward; and (ii) the authors introduce a robustness parameter γ that could be tuned based on the budget of the attacker, to determine the design space of learner in biasing the estimated reward.
  • The algorithmic nuggets of setting the compensate variable are as follows: (i) as in Line 5 of ExpRb, δ(t) is set only when pIt (t) < pIt , since otherwise, the algorithm has already biased the estimated reward of It in previous rounds; (ii) δ(t) is capped to at most 1, since the value of a(t), i.e., the attacker’s corruption, is at most 1; (iii) δ(t) is a function of γ that determines how much bias is required; γ has a direct relationship to the budget of attacker, i.e., the greater the budget of the attacker, the greater the robustness parameter γ; and last (iv) the larger the difference between pIt (t) and pIt , the greater the δ(t).
  • Remark 5.1 The result in Theorem 4 uses the modified definition of regret in Eq (3), where the attacker corrupts the actual reward observed by the learner.
  • Motivated by the recent interests in making the online learning algorithms robust against manipulation attacks, this paper studied non-stochastic multi-armed bandit problems with targeted corruptions.
  • It first showed that under targeted corruptions, existing attack-agnostic algorithms for non-stochastic bandits, e.g., Exp3, are vulnerable against targeted corruptions with limited budget, and fail to achieve a sublinear regret.
  • While there are several recent studies that focus on stochastic MAB problems with corruptions, to the best of the knowledge, this paper is the first that tackles non-stochastic MABs with targeted corruptions
表格
  • Table1: Summary of prior literature and this work
Download tables as Excel
基金
  • Acknowledgments and Disclosure of Funding Lin Yang and Wing Shing Wong acknowledge the support from Schneider Electric, Lenovo Group (China) Limited and the Hong Kong Innovation and Technology Fund (ITS/066/17FP) under the HKUST-MIT Research Alliance Consortium
  • Mohammad Hajiesmaili’s research is supported by NSF CNS-1908298
  • Lui is supported in part by the GRF 14201819. Our work fits within the broad direction of research concerning safety issues in AI/ML at large
引用论文
  • [19] Gupta et al.
    Google ScholarFindings
  • [13] Liu et al. [17]
    Google ScholarFindings
  • [1] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proc. of COLT, pages 217–226, 2009.
    Google ScholarLocate open access versionFindings
  • [2] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785–2836, 2010.
    Google ScholarLocate open access versionFindings
  • [3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
    Google ScholarLocate open access versionFindings
  • [4] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 45–53, 2004.
    Google ScholarLocate open access versionFindings
  • [5] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
    Google ScholarLocate open access versionFindings
  • [6] R. Combes, M. S. Talebi Mazraeh Shahi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited. In Proc. of NIPS, pages 2116–2124, 2015.
    Google ScholarLocate open access versionFindings
  • [7] G. V. Cormack et al. Email spam filtering: A systematic review. Foundations and Trends R in Information Retrieval, 1(4):335–455, 2008.
    Google ScholarLocate open access versionFindings
  • [8] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
    Google ScholarLocate open access versionFindings
  • [9] Z. Feng, D. C. Parkes, and H. Xu. The intrinsic robustness of stochastic bandits to strategic manipulation. arXiv preprint arXiv:1906.01528, 2019.
    Findings
  • [10] A. Gupta, T. Koren, and K. Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Proc. of COLT, 2019.
    Google ScholarLocate open access versionFindings
  • [11] A. György, T. Linder, and G. Ottucsak. The shortest path problem under partial monitoring. In G. Lugosi and H. U. Simon, editors, Learning Theory, volume 4005 of Lecture Notes in Computer Science, pages 468–482. Springer Berlin Heidelberg, 2006.
    Google ScholarLocate open access versionFindings
  • [12] A. Heydari, M. ali Tavakoli, N. Salim, and Z. Heydari. Detection of review spam: A survey. Expert Systems with Applications, 42(7):3634–3642, 2015.
    Google ScholarLocate open access versionFindings
  • [13] K.-S. Jun, L. Li, Y. Ma, and J. Zhu. Adversarial attacks on stochastic bandits. In Proc. of NIPS, pages 3640–3649, 2018.
    Google ScholarLocate open access versionFindings
  • [14] W. Z. Khan, M. K. Khan, F. T. B. Muhaya, M. Y. Aalsalem, and H.-C. Chao. A comprehensive study of email spam botnet detection. IEEE Communications Surveys & Tutorials, 17(4).
    Google ScholarLocate open access versionFindings
  • [15] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
    Google ScholarLocate open access versionFindings
  • [16] Y. Li, E. Y. Lou, and L. Shan. Stochastic linear optimization with adversarial corruption. arXiv preprint arXiv:1909.02109, 2019.
    Findings
  • [17] F. Liu and N. Shroff. Data poisoning attacks on stochastic bandits. In Proc. of ICML, 2019.
    Google ScholarLocate open access versionFindings
  • [18] M. Luca and G. Zervas. Fake it till you make it: Reputation, competition, and yelp review fraud. Management Science, 62(12):3412–3427, 2016.
    Google ScholarLocate open access versionFindings
  • [19] T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proc. of ACM STOC, pages 114–122, 2018.
    Google ScholarLocate open access versionFindings
  • [20] Y. Ma, K.-S. Jun, L. Li, and X. Zhu. Data poisoning attacks in contextual bandits. In Proc. of GameSec, pages 186–204.
    Google ScholarLocate open access versionFindings
  • [21] G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Proc. of NIPS, pages 3168–3176, 2015.
    Google ScholarLocate open access versionFindings
  • [22] G. Neu, A. Gyorgy, and C. Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813, 2012.
    Google ScholarLocate open access versionFindings
  • [23] P. Ozisik and P. S. Thomas. Security analysis of safe and seldonian reinforcement learning algorithms. In In Advances in Neural Information Processing Systems, 2020.
    Google ScholarLocate open access versionFindings
  • [24] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
    Google ScholarLocate open access versionFindings
  • [25] A. Slivkins. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272, 2019.
    Findings
  • [26] M. S. Talebi, Z. Zou, R. Combes, A. Proutiere, and M. Johansson. Stochastic online shortest path routing: The value of feedback. IEEE Transactions on Automatic Control, 63(4):915–930, 2017.
    Google ScholarLocate open access versionFindings
  • [27] K. C. Wilbur and Y. Zhu. Click fraud. Marketing Science, 28(2):293–308, 2009.
    Google ScholarLocate open access versionFindings
  • [28] X. Wu, Y. Dong, J. Tao, C. Huang, and N. V. Chawla. Reliable fake review detection via modeling temporal and behavioral patterns. In Proc. of IEEE Big Data, pages 494–499, 2017.
    Google ScholarLocate open access versionFindings
  • [29] X. Zhang and X. Zhu. Online data poisoning attack. arXiv preprint arXiv:1903.01666, 2019.
    Findings
  • [30] P. Zhou, J. Xu, W. Wang, Y. Hu, D. O. Wu, and S. Ji. Toward optimal adaptive online shortest path routing with acceleration under jamming attack. IEEE/ACM Transactions on Networking, 27(5):1815–1829, 2019.
    Google ScholarLocate open access versionFindings
  • [31] J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In Proc. of AISTATS, pages 467–475, 2019.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
小科