# Fiduciary Bandits

ICML 2020, 2019.

EI

Weibo:

Abstract:

Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether explorat...More

Code:

Data:

Introduction

- Multi-armed bandits [9, 11] is a well-studied problem domain in online learning. In that setting, several arms (i.e., actions) are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled.
- The planner selects arms sequentially, aiming to maximize her sum of rewards
- This often involves a tradeoff between exploiting arms that have been observed to yield good rewards and exploring arms that could yield even higher rewards.
- Many variations of this model exist, including stochastic [1, 21], Bayesian [2], contextual [13, 29], adversarial [3] and non-stationary [8, 23] bandits.
- Users are not eager to perform such exploration; they are self-interested in the sense that they care more about minimizing their own travel times than they do about conducting surveillance about traffic conditions for the system

Highlights

- Multi-armed bandits ( MABs) [9, 11] is a well-studied problem domain in online learning
- Several arms are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled
- This paper considers a setting motivated by recommender systems
- We propose a more strict form of individual rationality, ex-post individual rationality (EPIR)
- This paper introduces a model in which a recommender system must manage an explorationexploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system
- From a technical point of view, our algorithmic results are limited to discrete reward distributions

Conclusion

- This paper introduces a model in which a recommender system must manage an explorationexploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system.

The authors see considerable scope for follow-up work. - The authors see natural extensions of EPIR and EAIR to stochastic settings [10], either by assuming a prior and requiring the conditions w.r.t. the posterior distribution or by requiring the conditions to hold with high probability.
- The authors are intrigued by non-stationary settings [8]—where e.g., rewards follow a Markov process—since the planner would be able to sample a priori inferior arms with high probability assuming the rewards change fast enough, thereby reducing regret

Summary

## Introduction:

Multi-armed bandits [9, 11] is a well-studied problem domain in online learning. In that setting, several arms (i.e., actions) are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled.- The planner selects arms sequentially, aiming to maximize her sum of rewards
- This often involves a tradeoff between exploiting arms that have been observed to yield good rewards and exploring arms that could yield even higher rewards.
- Many variations of this model exist, including stochastic [1, 21], Bayesian [2], contextual [13, 29], adversarial [3] and non-stationary [8, 23] bandits.
- Users are not eager to perform such exploration; they are self-interested in the sense that they care more about minimizing their own travel times than they do about conducting surveillance about traffic conditions for the system
## Conclusion:

This paper introduces a model in which a recommender system must manage an explorationexploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system.

The authors see considerable scope for follow-up work.- The authors see natural extensions of EPIR and EAIR to stochastic settings [10], either by assuming a prior and requiring the conditions w.r.t. the posterior distribution or by requiring the conditions to hold with high probability.
- The authors are intrigued by non-stationary settings [8]—where e.g., rewards follow a Markov process—since the planner would be able to sample a priori inferior arms with high probability assuming the rewards change fast enough, thereby reducing regret

Related work

**Related work Background on**

MABs can be found in Cesa-Bianchi and Lugosi [11] and a recent survey [9]. Kremer et al [22] is the first work of which we are aware that investigated the problem of incentivizing exploration. The authors considered two deterministic arms, a prior known both to the agents and the planner, and an arrival order that is common knowledge among all agents, and presented an optimal IC mechanism. Cohen and Mansour [14] extended this optimality result to several arms under further assumptions. This setting has also been extended to regret minimization [26], social networks [4, 5], and heterogeneous agents [12, 19]. All of this literature disallows paying agents; monetary incentives for exploration are discussed in e.g., [12, 16]. None of this work considers an individual rationality constraint as we do here.

Funding

- Tennenholtz is funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n◦ 740435)
- Leyton-Brown is funded by the NSERC Discovery Grants program, DND/NSERC Discovery Grant Supplement, Facebook Research and Canada CIFAR AI Chair Amii
- Leyton-Brown was a visiting researcher at Technion - Israeli Institute of Science and was partially funded by the European Union’s Horizon 2020 research and innovation programme (grant agreement n◦ 740435)

Reference

- Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT),, pages 1–39, 2012.
- P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE, 1995.
- G. Bahar, R. Smorodinsky, and M. Tennenholtz. Economic recommendation systems: One page abstract. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, pages 757–757, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3936-0. doi: 10.1145/2940716.2940719. URL http://doi.acm.org/10.1145/2940716.2940719.
- G. Bahar, R. Smorodinsky, and M. Tennenholtz. Social learning and the innkeeper challenge. In ACM Conf. on Economics and Computation (EC), 2019.
- A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995.
- O. Ben-Porat and M. Tennenholtz. A game-theoretic approach to recommendation systems with strategic content providers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montreal, Canada., pages 1118–1128, 2018.
- O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems (NIPS), pages 199–207, 2014.
- S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
- S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
- N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Press, 2006.
- B. Chen, P. Frazier, and D. Kempe. Incentivizing exploration by heterogeneous users. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 798–818. PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/chen18a.html.
- W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
- L. Cohen and Y. Mansour. Optimal algorithm for bayesian incentive-compatible. In ACM Conf. on Economics and Computation (EC), 2019.
- C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science conference (ITCS), pages 214–226. ACM, 2012.
- P. Frazier, D. Kempe, J. Kleinberg, and R. Kleinberg. Incentivizing exploration. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, pages 5–22, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2565-3. doi: 10.1145/2600057.2602897. URL http://doi.acm.org/10.1145/2600057.2602897.
- J. Garcıa and F. Fernandez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), pages 3315–3323, 2016.
- N. Immorlica, J. Mao, A. Slivkins, and Z. S. Wu. Bayesian exploration with heterogeneous agents, 2019.
- M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 325–333. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6355-fairness-in-learning-classic-and-contextual-bandits.pdf.
- Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238–1246, 2013.
- I. Kremer, Y. Mansour, and M. Perry. Implementing the wisdom of the crowd. Journal of Political Economy, 122:988–1012, 2014.
- N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3074–3083. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6900-rotting-bandits.pdf.
- L. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning, pages 3156–3164, 2018.
- Y. Liu, G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes. Calibrated fairness in bandits, 2017.
- Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesian incentive-compatible bandit exploration. In ACM Conf. on Economics and Computation (EC), 2015.
- N. Nisan and A. Ronen. Algorithmic mechanism design. In Proceedings of the thirty-first annual ACM Symposium on Theory of Computing (STOC), pages 129–140. ACM, 1999.
- N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic game theory, volume 1. Cambridge University Press Cambridge, 2007.
- A. Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568, 2014.
- 2. Let ǫ > 0, K, H
- 2. First, notice that ǫ<
- 2. While the length of his less than n: 2.1 Draw ai ∼ M (h). If the reward of ai was already observed and Ri ≤ R1, recommend a1 and set h = ̃h ⊕ (ai, Ri). Else, act as M (h) and update haccordingly.
- 2. While his not auspicious: 2.1 Act as M (1)(h) and update haccordingly.
- 3. If his auspicious: 3.1 Use an oracle to reveal the best arm, a∗. From here on, recommend a∗ to all users.
- 2. Else, if α(O) < β(O), then As = ∅. 3.
- 2. For every state s, W (π′, s) ≥ W (π, s).

Full Text

Tags

Comments