# Latent Bandits Revisited

NIPS 2020, 2020.

Weibo:

Abstract:

A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. The primary goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning---complex models can be learn...More

Code:

Data:

Introduction

- Many online platforms, such as search engines or recommender systems, display results based observed properties of the user and their query.
- A user’s behavior is often influenced by latent state not explicitly revealed to the system.
- This might be user intent in search, or user preferences in a recommender.
- The unobserved latent state in each case influences the user response of the displayed results.
- The authors are interested in designing exploration policies that allow the agent to quickly maximize its per-round reward by resolving relevant latent state uncertainty.
- The authors want policies that have low n-round regret

Highlights

- Many online platforms, such as search engines or recommender systems, display results based observed properties of the user and their query
- We provide a unified framework that combines offline-learned models with online exploration for both upper confidence bounds (UCBs) and Thompson sampling algorithms, and propose practical, analyzable algorithms that are contextual and robust to natural forms of model imprecision
- We study a latent bandit problem, where the learning agent interacts with an environment over n rounds
- UCB1 and Thompson sampling (TS) are used for non-contextual problems, while LinUCB and LinTS are used for contextual bandit experiments
- We studied the latent bandits problem, where the rewards are parameterized by a discrete, latent state
- We adopted a framework in which an offline-learned model is combined with UCB and Thompson sampling exploration to quickly identify the latent state

Methods

- The authors evaluate the algorithms on both synthetic and real-world datasets.
- In contrast to the methods, the UCB and TS baselines do not use an offline learned model.
- UCB1 and TS are used for non-contextual problems, while LinUCB and LinTS are used for contextual bandit experiments.
- EXP4 uses the offline-learned model as a mixture-of-experts, where each expert plays the best arm given context under its corresponding latent state.
- Because the authors measure “fast personalization," the authors use short horizons of at most 500

Results

- The authors assess the empirical performance of the algorithms on MovieLens 1M [16], a large-scale, collaborative filtering dataset, comprising 6040 users rating 3883 movies.
- The authors randomly select 50% of all ratings as our “offline" training set, and use the remaining 50% as a test set, giving sparse ratings matrices Mtrain and Mtest.
- The learned factors are Mtrain = U V T and Mtest = U V T.
- User i and movie j correspond to row Ui and Vj, respectively, in the matrix factors

Conclusion

- The authors studied the latent bandits problem, where the rewards are parameterized by a discrete, latent state.
- The authors adopted a framework in which an offline-learned model is combined with UCB and Thompson sampling exploration to quickly identify the latent state.
- The authors' approach handles both context and misspecified models.
- A natural extension of the work is to use temporal models to handle latent state dynamics
- This is useful for applications where user preferences, tasks or intents change fairly quickly.
- For TS, the authors can take the dynamics into the account when computing the posterior

Summary

## Introduction:

Many online platforms, such as search engines or recommender systems, display results based observed properties of the user and their query.- A user’s behavior is often influenced by latent state not explicitly revealed to the system.
- This might be user intent in search, or user preferences in a recommender.
- The unobserved latent state in each case influences the user response of the displayed results.
- The authors are interested in designing exploration policies that allow the agent to quickly maximize its per-round reward by resolving relevant latent state uncertainty.
- The authors want policies that have low n-round regret
## Methods:

The authors evaluate the algorithms on both synthetic and real-world datasets.- In contrast to the methods, the UCB and TS baselines do not use an offline learned model.
- UCB1 and TS are used for non-contextual problems, while LinUCB and LinTS are used for contextual bandit experiments.
- EXP4 uses the offline-learned model as a mixture-of-experts, where each expert plays the best arm given context under its corresponding latent state.
- Because the authors measure “fast personalization," the authors use short horizons of at most 500
## Results:

The authors assess the empirical performance of the algorithms on MovieLens 1M [16], a large-scale, collaborative filtering dataset, comprising 6040 users rating 3883 movies.- The authors randomly select 50% of all ratings as our “offline" training set, and use the remaining 50% as a test set, giving sparse ratings matrices Mtrain and Mtest.
- The learned factors are Mtrain = U V T and Mtest = U V T.
- User i and movie j correspond to row Ui and Vj, respectively, in the matrix factors
## Conclusion:

The authors studied the latent bandits problem, where the rewards are parameterized by a discrete, latent state.- The authors adopted a framework in which an offline-learned model is combined with UCB and Thompson sampling exploration to quickly identify the latent state.
- The authors' approach handles both context and misspecified models.
- A natural extension of the work is to use temporal models to handle latent state dynamics
- This is useful for applications where user preferences, tasks or intents change fairly quickly.
- For TS, the authors can take the dynamics into the account when computing the posterior

Related work

- Latent bandits. Latent contextual bandits admit faster personalization than standard contextual bandit strategies, such as LinUCB [1] or linear TS [4, 2]. The closest work to ours is that of Maillard and Mannor [23], which proposes and analyzes non-contextual UCB algorithms under the assumption that the mean rewards for each latent state are known. Zhou and Brunskill [31] extend this formulation to the contextual bandits case, but consider offline-learned policies deployed as a mixture via EXP4. Bayesian policy reuse (BPR) [25] selects offline-learned policies by maintaining a belief over the optimality of each policy, but no analysis exists. Our work subsumes prior work by providing contextual, uncertainty-aware UCB and TS algorithms and a unified analysis of the two.

Reference

- Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. NeurIPS, 2011.
- Marc Abeille and Alessandro Lazaric. Linear thompson sampling revisited. Electronic Journal of Statistics, 2016.
- Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. CoRR, abs/1111.1797, 2011.
- Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. ICML, 2013.
- Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. JMLR, 2014.
- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 2002.
- Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.
- Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 2013.
- Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. NeurIPS, 2011.
- Merlise Clyde and Edward I. George. Model uncertainty. Statistical Science, 2004.
- Arnaud Doucet, Neil Gordon, and Nando de Freitas. Sequential Monte Carlo Methods in Practice. Springer New York, 2013.
- Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. CoRR, abs/0805.3415, 2008.
- Claudio Gentile, Shuai Li, and Giovanni Zapella. Online clustering of bandits. ICML, 2014.
- Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. ICML, 2017.
- Samarth Gupta, Shreyas Chaudhari, Subhojyoti Mukherjee, Gauri Joshi, and Osman Yagan. A unified approach to translate classical bandit algorithms to the structured bandit setting. CoRR, abs/1810.08164, 2018.
- F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 2015.
- Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Bernoulli rank-1 bandits for click feedback. IJCAI, 2017.
- Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. AISTATS, 2017.
- Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efficient thompson sampling for online matrix-factorization recommendation. NeurIPS, 2015.
- Tor Lattimore and Remi Munos. Bounded regret for finite-armed structured bandits. NeurIPS, 2014.
- Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
- Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. SIGIR, 2016.
- Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. ICML, 2014.
- Trong T. Nguyen and Hady W. Lauw. Dynamic clustering of contextual multi-armed bandits.
- Benjamin Rosman, Majd Hawasly, and Subramanian Ramamoorthy. Bayesian policy reuse. Machine Learning, 2016.
- Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. CoRR, abs/1301.2609, 2013.
- Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. NeurIPS, 2008.
- Simo Särkkä. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
- Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. AISTATS, 2017.
- Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent recommender networks. WSDM, 2017.
- Li Zhou and Emma Brunskill. Latent contextual bandits and their application to personalized recommendations for new users. IJCAI, 2016.

Full Text

Tags

Comments