Fiduciary Bandits

ICML 2020, 2019.

Cited by: 3|Bibtex|Views21
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone

Abstract:

Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether explorat...More

Code:

Data:

Introduction
  • Multi-armed bandits [9, 11] is a well-studied problem domain in online learning. In that setting, several arms (i.e., actions) are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled.
  • The planner selects arms sequentially, aiming to maximize her sum of rewards
  • This often involves a tradeoff between exploiting arms that have been observed to yield good rewards and exploring arms that could yield even higher rewards.
  • Many variations of this model exist, including stochastic [1, 21], Bayesian [2], contextual [13, 29], adversarial [3] and non-stationary [8, 23] bandits.
  • Users are not eager to perform such exploration; they are self-interested in the sense that they care more about minimizing their own travel times than they do about conducting surveillance about traffic conditions for the system
Highlights
  • Multi-armed bandits ( MABs) [9, 11] is a well-studied problem domain in online learning
  • Several arms are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled
  • This paper considers a setting motivated by recommender systems
  • We propose a more strict form of individual rationality, ex-post individual rationality (EPIR)
  • This paper introduces a model in which a recommender system must manage an explorationexploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system
  • From a technical point of view, our algorithmic results are limited to discrete reward distributions
Conclusion
  • This paper introduces a model in which a recommender system must manage an explorationexploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system.

    The authors see considerable scope for follow-up work.
  • The authors see natural extensions of EPIR and EAIR to stochastic settings [10], either by assuming a prior and requiring the conditions w.r.t. the posterior distribution or by requiring the conditions to hold with high probability.
  • The authors are intrigued by non-stationary settings [8]—where e.g., rewards follow a Markov process—since the planner would be able to sample a priori inferior arms with high probability assuming the rewards change fast enough, thereby reducing regret
Summary
  • Introduction:

    Multi-armed bandits [9, 11] is a well-studied problem domain in online learning. In that setting, several arms (i.e., actions) are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled.
  • The planner selects arms sequentially, aiming to maximize her sum of rewards
  • This often involves a tradeoff between exploiting arms that have been observed to yield good rewards and exploring arms that could yield even higher rewards.
  • Many variations of this model exist, including stochastic [1, 21], Bayesian [2], contextual [13, 29], adversarial [3] and non-stationary [8, 23] bandits.
  • Users are not eager to perform such exploration; they are self-interested in the sense that they care more about minimizing their own travel times than they do about conducting surveillance about traffic conditions for the system
  • Conclusion:

    This paper introduces a model in which a recommender system must manage an explorationexploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system.

    The authors see considerable scope for follow-up work.
  • The authors see natural extensions of EPIR and EAIR to stochastic settings [10], either by assuming a prior and requiring the conditions w.r.t. the posterior distribution or by requiring the conditions to hold with high probability.
  • The authors are intrigued by non-stationary settings [8]—where e.g., rewards follow a Markov process—since the planner would be able to sample a priori inferior arms with high probability assuming the rewards change fast enough, thereby reducing regret
Related work
  • Related work Background on

    MABs can be found in Cesa-Bianchi and Lugosi [11] and a recent survey [9]. Kremer et al [22] is the first work of which we are aware that investigated the problem of incentivizing exploration. The authors considered two deterministic arms, a prior known both to the agents and the planner, and an arrival order that is common knowledge among all agents, and presented an optimal IC mechanism. Cohen and Mansour [14] extended this optimality result to several arms under further assumptions. This setting has also been extended to regret minimization [26], social networks [4, 5], and heterogeneous agents [12, 19]. All of this literature disallows paying agents; monetary incentives for exploration are discussed in e.g., [12, 16]. None of this work considers an individual rationality constraint as we do here.
Funding
  • Tennenholtz is funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n◦ 740435)
  • Leyton-Brown is funded by the NSERC Discovery Grants program, DND/NSERC Discovery Grant Supplement, Facebook Research and Canada CIFAR AI Chair Amii
  • Leyton-Brown was a visiting researcher at Technion - Israeli Institute of Science and was partially funded by the European Union’s Horizon 2020 research and innovation programme (grant agreement n◦ 740435)
Reference
  • Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
    Google ScholarLocate open access versionFindings
  • S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT),, pages 1–39, 2012.
    Google ScholarLocate open access versionFindings
  • P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE, 1995.
    Google ScholarLocate open access versionFindings
  • G. Bahar, R. Smorodinsky, and M. Tennenholtz. Economic recommendation systems: One page abstract. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, pages 757–757, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3936-0. doi: 10.1145/2940716.2940719. URL http://doi.acm.org/10.1145/2940716.2940719.
    Locate open access versionFindings
  • G. Bahar, R. Smorodinsky, and M. Tennenholtz. Social learning and the innkeeper challenge. In ACM Conf. on Economics and Computation (EC), 2019.
    Google ScholarLocate open access versionFindings
  • A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995.
    Google ScholarLocate open access versionFindings
  • O. Ben-Porat and M. Tennenholtz. A game-theoretic approach to recommendation systems with strategic content providers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montreal, Canada., pages 1118–1128, 2018.
    Google ScholarLocate open access versionFindings
  • O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems (NIPS), pages 199–207, 2014.
    Google ScholarLocate open access versionFindings
  • S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
    Google ScholarLocate open access versionFindings
  • S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
    Google ScholarLocate open access versionFindings
  • N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Press, 2006.
    Google ScholarFindings
  • B. Chen, P. Frazier, and D. Kempe. Incentivizing exploration by heterogeneous users. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 798–818. PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/chen18a.html.
    Locate open access versionFindings
  • W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
    Google ScholarLocate open access versionFindings
  • L. Cohen and Y. Mansour. Optimal algorithm for bayesian incentive-compatible. In ACM Conf. on Economics and Computation (EC), 2019.
    Google ScholarLocate open access versionFindings
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science conference (ITCS), pages 214–226. ACM, 2012.
    Google ScholarLocate open access versionFindings
  • P. Frazier, D. Kempe, J. Kleinberg, and R. Kleinberg. Incentivizing exploration. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, pages 5–22, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2565-3. doi: 10.1145/2600057.2602897. URL http://doi.acm.org/10.1145/2600057.2602897.
    Locate open access versionFindings
  • J. Garcıa and F. Fernandez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
    Google ScholarLocate open access versionFindings
  • M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), pages 3315–3323, 2016.
    Google ScholarLocate open access versionFindings
  • N. Immorlica, J. Mao, A. Slivkins, and Z. S. Wu. Bayesian exploration with heterogeneous agents, 2019.
    Google ScholarFindings
  • M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 325–333. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6355-fairness-in-learning-classic-and-contextual-bandits.pdf.
    Locate open access versionFindings
  • Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238–1246, 2013.
    Google ScholarLocate open access versionFindings
  • I. Kremer, Y. Mansour, and M. Perry. Implementing the wisdom of the crowd. Journal of Political Economy, 122:988–1012, 2014.
    Google ScholarLocate open access versionFindings
  • N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3074–3083. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6900-rotting-bandits.pdf.
    Locate open access versionFindings
  • L. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning, pages 3156–3164, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Liu, G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes. Calibrated fairness in bandits, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesian incentive-compatible bandit exploration. In ACM Conf. on Economics and Computation (EC), 2015.
    Google ScholarLocate open access versionFindings
  • N. Nisan and A. Ronen. Algorithmic mechanism design. In Proceedings of the thirty-first annual ACM Symposium on Theory of Computing (STOC), pages 129–140. ACM, 1999.
    Google ScholarLocate open access versionFindings
  • N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic game theory, volume 1. Cambridge University Press Cambridge, 2007.
    Google ScholarFindings
  • A. Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568, 2014.
    Google ScholarLocate open access versionFindings
  • 2. Let ǫ > 0, K, H
    Google ScholarFindings
  • 2. First, notice that ǫ<
    Google ScholarFindings
  • 2. While the length of his less than n: 2.1 Draw ai ∼ M (h). If the reward of ai was already observed and Ri ≤ R1, recommend a1 and set h = ̃h ⊕ (ai, Ri). Else, act as M (h) and update haccordingly.
    Google ScholarFindings
  • 2. While his not auspicious: 2.1 Act as M (1)(h) and update haccordingly.
    Google ScholarFindings
  • 3. If his auspicious: 3.1 Use an oracle to reveal the best arm, a∗. From here on, recommend a∗ to all users.
    Google ScholarFindings
  • 2. Else, if α(O) < β(O), then As = ∅. 3.
    Google ScholarFindings
  • 2. For every state s, W (π′, s) ≥ W (π, s).
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments