A One-Size-Fits-All Solution to Conservative Bandit Problems

Cited by: 0|Views16
Weibo:
We study a family of conservative bandit problems with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time

Abstract:

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • The multi-armed bandit (MAB) problem (Thompson 1933; Auer, Cesa-Bianchi, and Fischer 2002) is a classic online learning model that characterizes the explorationexploitation trade-off in sequential decision making.
  • While existing bandit algorithms achieve satisfactory regret bounds over the whole learning processes, they can perform wildly and lose much in the initial exploratory phase.
  • This limitation has hindered their applications in real-world scenarios such as health sciences, marketing and finance, where it is important to guarantee safe and smooth algorithm behavior in initialization.
  • The learning’s objective is to minimize the expected cumulative regret, while ensuring that the received cumulative reward must stay above a fixed percentage of what one can obtain by always playing the default arm
Highlights
  • The multi-armed bandit (MAB) problem (Thompson 1933; Auer, Cesa-Bianchi, and Fischer 2002) is a classic online learning model that characterizes the explorationexploitation trade-off in sequential decision making
  • We propose a general one-size-fits-all solution General Solution to Conservative Bandits (GenCB) for conservative bandit problems (CBPs), and present its applications to three important CBP problems, i.e., conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB)
  • We extend our results to the mean-variance setting (Markowitz et al 1952; Sani, Lazaric, and Munos 2012), called conservative mean-variance bandit problem (MVCBP), which focuses on the balance between the expected reward and variability with safe exploration
  • We extend CBPs to the mean-variance (Sani, Lazaric, and Munos 2012; Maillard 2013; Cardoso and Xu 2019) setting (MV-CBP), which focuses on finding arms that achieve effective trade-off between the expected reward and variability
  • Compared to previous CBP algorithms, our schemes achieve significantly better performance, since we play the default arm less and enjoy a lower conservative regret
  • We propose a general solution to a family of conservative bandit problems (CBPs) with sample-path reward constraints, and present its applications to three encompassed problems, i.e., conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB)
Methods
  • The authors conduct experiments for the algorithms in four problems, i.e., CMAB, CLB, CCCB and MV-CBP, with a wide range of parameter settings.
  • Since previous CBP algorithms use lower confidence bounds to check the constraints, they are forced to play the default arm more and act more conservatively compared to ours.
  • Compared to previous CBP algorithms, the schemes achieve significantly better performance, since the authors play the default arm less and enjoy a lower conservative regret.
  • One sees that MV-CUCB achieves this with only an additional constant overall regret compared to MV-UCB, which matches the T -independent bound of conservative regret
Results
  • The results match the theoretical bounds and demonstrate that the algorithms achieve the performance superiority compared to existing algorithms.
  • Compared to previous CBP algorithms, the schemes achieve significantly better performance, since the authors play the default arm less and enjoy a lower conservative regret
Conclusion
  • Conclusion and Future Works

    In this paper, the authors propose a general solution to a family of conservative bandit problems (CBPs) with sample-path reward constraints, and present its applications to three encompassed problems, i.e., conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB).
  • The authors study a novel extension of the conservative bandit problem to the meanvariance setting (MV-CBP) and develop an algorithm with O(1/T ) normalized conservative regret (T -independent in the cumulative form).
  • The authors validate this result through empirical evaluation.
  • Another direction is to consider other practical conservative constraints which capture the safe exploration requirement in real-world applications
Summary
  • Introduction:

    The multi-armed bandit (MAB) problem (Thompson 1933; Auer, Cesa-Bianchi, and Fischer 2002) is a classic online learning model that characterizes the explorationexploitation trade-off in sequential decision making.
  • While existing bandit algorithms achieve satisfactory regret bounds over the whole learning processes, they can perform wildly and lose much in the initial exploratory phase.
  • This limitation has hindered their applications in real-world scenarios such as health sciences, marketing and finance, where it is important to guarantee safe and smooth algorithm behavior in initialization.
  • The learning’s objective is to minimize the expected cumulative regret, while ensuring that the received cumulative reward must stay above a fixed percentage of what one can obtain by always playing the default arm
  • Methods:

    The authors conduct experiments for the algorithms in four problems, i.e., CMAB, CLB, CCCB and MV-CBP, with a wide range of parameter settings.
  • Since previous CBP algorithms use lower confidence bounds to check the constraints, they are forced to play the default arm more and act more conservatively compared to ours.
  • Compared to previous CBP algorithms, the schemes achieve significantly better performance, since the authors play the default arm less and enjoy a lower conservative regret.
  • One sees that MV-CUCB achieves this with only an additional constant overall regret compared to MV-UCB, which matches the T -independent bound of conservative regret
  • Results:

    The results match the theoretical bounds and demonstrate that the algorithms achieve the performance superiority compared to existing algorithms.
  • Compared to previous CBP algorithms, the schemes achieve significantly better performance, since the authors play the default arm less and enjoy a lower conservative regret
  • Conclusion:

    Conclusion and Future Works

    In this paper, the authors propose a general solution to a family of conservative bandit problems (CBPs) with sample-path reward constraints, and present its applications to three encompassed problems, i.e., conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB).
  • The authors study a novel extension of the conservative bandit problem to the meanvariance setting (MV-CBP) and develop an algorithm with O(1/T ) normalized conservative regret (T -independent in the cumulative form).
  • The authors validate this result through empirical evaluation.
  • Another direction is to consider other practical conservative constraints which capture the safe exploration requirement in real-world applications
Tables
  • Table1: Comparison of regret bounds for CBPs. “Type” refers to the type of regret bounds. “E” and “H” denote the expected and high probability bounds, respectively. Here H =
Download tables as Excel
Related work
Funding
  • This work is supported in part by the National Natural Science Foundation of China Grant 61672316, the Zhongguancun Haihua Institute for Frontier Information Technology and the Turing AI Institute of Nanjing
Reference
  • Abbasi-yadkori, Y.; Pal, D.; and Szepesvari, C. 201Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems, 2312– 2320.
    Google ScholarLocate open access versionFindings
  • Agrawal, S.; and Goyal, N. 201Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, 39–1.
    Google ScholarLocate open access versionFindings
  • Amani, S.; Alizadeh, M.; and Thrampoulidis, C. 2019. Linear Stochastic Bandits Under Safety Constraints. In Advances in Neural Information Processing Systems, 9256– 9266.
    Google ScholarLocate open access versionFindings
  • Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3): 235–256.
    Google ScholarLocate open access versionFindings
  • Bubeck, S.; Perchet, V.; and Rigollet, P. 2013. Bounded regret in stochastic multi-armed bandits. In Conference on Learning Theory, 122–134.
    Google ScholarLocate open access versionFindings
  • Cardoso, A. R.; and Xu, H. 2019. Risk-averse stochastic convex bandit. In International Conference on Artificial Intelligence and Statistics, 39–47.
    Google ScholarLocate open access versionFindings
  • Dani, V.; Hayes, T.; and Kakade, S. M. 2008. Stochastic Linear Optimization under Bandit Feedback. In Conference on Learning Theory.
    Google ScholarFindings
  • Garcelon, E.; Ghavamzadeh, M.; Lazaric, A.; and Pirotta, M. 2020. Improved Algorithms for Conservative Exploration in Bandits. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Katariya, S.; Kveton, B.; Wen, Z.; and Potluru, V. 201Conservative Exploration using Interleaving. In International Conference on Artificial Intelligence and Statistics.
    Google ScholarLocate open access versionFindings
  • Kazerouni, A.; Ghavamzadeh, M.; Yadkori, Y. A.; and Van Roy, B. 2017. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, 3910– 3919.
    Google ScholarLocate open access versionFindings
  • Khezeli, K.; and Bitar, E. 2020. Safe Linear Stochastic Bandits. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Locatelli, A.; Gutzeit, M.; and Carpentier, A. 2016. An optimal algorithm for the Thresholding Bandit Problem. In International Conference on Machine Learning, 1690–1698.
    Google ScholarLocate open access versionFindings
  • Maillard, O.-A. 20Robust risk-averse stochastic multiarmed bandits. In International Conference on Algorithmic Learning Theory, 218–233. Springer.
    Google ScholarLocate open access versionFindings
  • Markowitz, H. M.; et al. 1952. Portfolio Selection. Journal of Finance 7(1): 77–91.
    Google ScholarLocate open access versionFindings
  • Qin, L.; Chen, S.; and Zhu, X. 2014. Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation. In International Conference on Data Mining, 461–469.
    Google ScholarLocate open access versionFindings
  • Sani, A.; Lazaric, A.; and Munos, R. 2012. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, 3275–3283.
    Google ScholarLocate open access versionFindings
  • Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4): 285–294.
    Google ScholarLocate open access versionFindings
  • Vakili, S.; Boukouvalas, A.; and Zhao, Q. 2019. Decision variance in risk-averse online learning. In Conference on Decision and Control, 2738–2744. IEEE.
    Google ScholarLocate open access versionFindings
  • Wu, Y.; Shariff, R.; Lattimore, T.; and Szepesvari, C. 2016. Conservative bandits. In International Conference on Machine Learning, 1254–1262.
    Google ScholarLocate open access versionFindings
  • Zhang, X.; Li, S.; and Liu, W. 2019. Contextual Combinatorial Conservative Bandits. arXiv preprint:1911.11337.
    Findings
Your rating :
0

 

Tags
Comments