Shrinkage Estimators in Online Experiments

KDD, 2019.

Cited by: 3|Bibtex|Views97
EI
Other Links: academic.microsoft.com|dl.acm.org|dblp.uni-trier.de|arxiv.org
Weibo:
The empirical Bayes estimators presented here are used as our default estimator for many-armed experiments and automated experimentation services, such as those used to optimize creatives like the ones featured in Figure 1a

Abstract:

We develop and analyze empirical Bayes Stein-type estimators for use in the estimation of causal effects in large-scale online experiments. While online experiments are generally thought to be distinguished by their large sample size, we focus on the multiplicity of treatment groups. The typical analysis practice is to use simple differen...More

Code:

Data:

0
Introduction
  • The routine use of online experiments at Internet firms motivates the development of tools and methods which are robust and reliable when used for thousands of experiments per day, either through the use of manual self-serve tools by non-experts, or automated experimentation techniques like multi-armed bandit optimization.
  • These experiments commonly have many variants or arms; this.
  • Whenever there is agreement within a group or organization about what outcomes should be improved, experimentation can be used as a tool to achieve these ends, whether those outcomes are revenue, voting behavior, or public health
Highlights
  • The routine use of online experiments at Internet firms motivates the development of tools and methods which are robust and reliable when used for thousands of experiments per day, either through the use of manual self-serve tools by non-experts, or automated experimentation techniques like multi-armed bandit optimization
  • This paper has introduced an approach which increases the accuracy and reliability of online experimentation in both one-shot and sequential experimentation
  • The empirical Bayes estimators presented here are used as our default estimator for many-armed experiments and automated experimentation services, such as those used to optimize creatives like the ones featured in Figure 1a
  • The shrinkage estimator examined here did not take into account the factorial structure observed in the data
  • We can see that most arms have higher than nominal coverage (60% of points have higher than nominal coverage, and 90% have at least 90% coverage)
  • Online experiments typically collect a wide variety of outcome variables
Methods
  • In the simulation studies that follow, the authors take the means and variances from seventeen recent routine online experiments conducted on Facebook intended to optimize content in the normal course of operation of the platform.
  • These experiments appear at the top of News Feed as in figure 1a.
  • In the example in figure 1a, the action would be whether the user chose to create a charitable fundraiser or not
Results
  • The typical conversion rates were less than 10%, with the sample size weighted average being approximately 2.5%.
  • The authors can see that most arms have higher than nominal coverage (60% of points have higher than nominal coverage, and 90% have at least 90% coverage)
Conclusion
  • This paper has introduced an approach which increases the accuracy and reliability of online experimentation in both one-shot and sequential experimentation
  • It provides clear gains in accuracy, and it tends to do a better job of optimization and best-arm identification than do the MLEs. the method is easy to describe and implement with little technical debt.
  • Interesting avenues of future work might attempt to provide similar methods which provide analogous guarantees while taking into account slightly more structure in the data
  • Two such opportunities present themselves.
  • Constructing an estimator using a multi-variate shrinkage method like Curds-and-whey [8] would likely result in further gains by considering the correlation structure
Summary
  • Introduction:

    The routine use of online experiments at Internet firms motivates the development of tools and methods which are robust and reliable when used for thousands of experiments per day, either through the use of manual self-serve tools by non-experts, or automated experimentation techniques like multi-armed bandit optimization.
  • These experiments commonly have many variants or arms; this.
  • Whenever there is agreement within a group or organization about what outcomes should be improved, experimentation can be used as a tool to achieve these ends, whether those outcomes are revenue, voting behavior, or public health
  • Objectives:

    The authors' goal is to take these highly aggregated quantities and construct an estimator better than mk by using the fact that each of these true effects share some underlying common distribution with a central tendency.
  • Methods:

    In the simulation studies that follow, the authors take the means and variances from seventeen recent routine online experiments conducted on Facebook intended to optimize content in the normal course of operation of the platform.
  • These experiments appear at the top of News Feed as in figure 1a.
  • In the example in figure 1a, the action would be whether the user chose to create a charitable fundraiser or not
  • Results:

    The typical conversion rates were less than 10%, with the sample size weighted average being approximately 2.5%.
  • The authors can see that most arms have higher than nominal coverage (60% of points have higher than nominal coverage, and 90% have at least 90% coverage)
  • Conclusion:

    This paper has introduced an approach which increases the accuracy and reliability of online experimentation in both one-shot and sequential experimentation
  • It provides clear gains in accuracy, and it tends to do a better job of optimization and best-arm identification than do the MLEs. the method is easy to describe and implement with little technical debt.
  • Interesting avenues of future work might attempt to provide similar methods which provide analogous guarantees while taking into account slightly more structure in the data
  • Two such opportunities present themselves.
  • Constructing an estimator using a multi-variate shrinkage method like Curds-and-whey [8] would likely result in further gains by considering the correlation structure
Reference
  • Shipra Agrawal and Navin Goyal. 2012. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory. 39–1.
    Google ScholarLocate open access versionFindings
  • Shipra Agrawal and Navin Goyal. 2013. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics. 99–107.
    Google ScholarLocate open access versionFindings
  • Jean-Yves Audibert and Sébastien Bubeck. 2010. Best arm identification in multiarmed bandits. In COLT-23th Conference on Learning Theory-2010. 13–p.
    Google ScholarLocate open access versionFindings
  • Eytan Bakshy, Dean Eckles, Rong Yan, and Itamar Rosenn. 2012. Social influence in social advertising: evidence from field experiments. In Proceedings of the 13th ACM Conference on Electronic Commerce. ACM, 146–161.
    Google ScholarLocate open access versionFindings
  • Shlomo Benartzi, John Beshears, Katherine L Milkman, Cass R Sunstein, Richard H Thaler, Maya Shankar, Will Tucker-Ray, William J Congdon, and Steven Galing.
    Google ScholarFindings
  • 2017. Should Governments Invest More in Nudging? Psychological Science (2017), 0956797617702501.
    Google ScholarLocate open access versionFindings
  • [6] Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet S Sekhon, and Bin Yu. 2016. Lasso adjustments of treatment effect estimates in randomized experiments. Proceedings of the National Academy of Sciences 113, 27 (2016), 7383–7390.
    Google ScholarLocate open access versionFindings
  • [7] George EP Box, J Stuart Hunter, and William Gordon Hunter. 2005. Statistics for experimenters: design, innovation, and discovery. Vol. 2. Wiley-Interscience New York.
    Google ScholarFindings
  • [8] Leo Breiman and Jerome H Friedman. 1997. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59, 1 (1997), 3–54.
    Google ScholarLocate open access versionFindings
  • [9] Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems. 2249–2257.
    Google ScholarFindings
  • [10] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 123–132.
    Google ScholarLocate open access versionFindings
  • [11] Dean Eckles, René F Kizilcec, and Eytan Bakshy. 2016. Estimating peer effects in networks with peer encouragement designs. Proceedings of the National Academy of Sciences 113, 27 (2016), 7316–7322.
    Google ScholarLocate open access versionFindings
  • [12] Bradley Efron. 2012. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Vol. 1. Cambridge University Press.
    Google ScholarFindings
  • [13] Bradley Efron and Carl Morris. 1971. Limiting the risk of Bayes and empirical Bayes estimatorsâĂŤpart I: the Bayes case. J. Amer. Statist. Assoc. 66, 336 (1971), 807–815.
    Google ScholarLocate open access versionFindings
  • [14] Bradley Efron and Carl Morris. 1975. Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 70, 350 (1975), 311–319.
    Google ScholarLocate open access versionFindings
  • [15] Andrew Gelman and John Carlin. 2014. Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science 9, 6 (2014), 641–651.
    Google ScholarLocate open access versionFindings
  • [16] Andrew Gelman, John B Carlin, Hal S Stern, and David B Dunson. 2014. Bayesian data analysis. Vol. 3.
    Google ScholarFindings
  • [17] Andrew Gelman, Jennifer Hill, and Masanao Yajima. 2012. Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5, 2 (2012), 189–211.
    Google ScholarLocate open access versionFindings
  • [18] Andrew Gelman, Jennifer Hill, and Masanao Yajima. 2012. Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5, 2 (2012), 189–211.
    Google ScholarLocate open access versionFindings
  • [19] Malay Ghosh, Georgios Papageorgiou, and Janet Forrester. 2009. Multivariate limited translation hierarchical Bayes estimators. Journal of multivariate analysis 100, 7 (2009), 1398–1411.
    Google ScholarLocate open access versionFindings
  • [20] Kosuke Imai and Aaron Strauss. 2010. Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the get-out-the-vote campaign. Political Analysis 19, 1 (2010), 1–19.
    Google ScholarLocate open access versionFindings
  • [21] Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
    Google ScholarFindings
  • [22] Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. 2018. Constrained Bayesian Optimization with Noisy Experiments. Bayesian Anal. (2018). https://doi.org/10.1214/18-BA1110
    Locate open access versionFindings
  • [23] Winston Lin et al. 2013. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics 7, 1 (2013), 295–318.
    Google ScholarLocate open access versionFindings
  • [24] Brigitte C Madrian. 2014. Applying insights from behavioral economics to policy design. Annu. Rev. Econ. 6, 1 (2014), 663–688.
    Google ScholarLocate open access versionFindings
  • [25] Carl N. Morris. 1983. Parametric Empirical Bayes Inference: Theory and Applications. J. Amer. Statist. Assoc. 78, 381 (1983), 47–55. http://www.jstor.org/stable/2287098
    Locate open access versionFindings
  • [26] Saralees Nadarajah and Samuel Kotz. 2008. Exact distribution of the max/min of two Gaussian random variables. IEEE Transactions on very large scale integration (VLSI) systems 16, 2 (2008), 210–212.
    Google ScholarLocate open access versionFindings
  • [27] Daniel Russo. 2016. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory. 1417–1418.
    Google ScholarFindings
  • [28] Steven L Scott. 2010. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26, 6 (2010), 639–658.
    Google ScholarLocate open access versionFindings
  • [29] Charles Stein et al. 1956. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, Vol. 1. 197–206.
    Google ScholarLocate open access versionFindings
  • [30] A. J. van der Merwe, P. C. N. Groenewald, and C. A. van der Merwe. 1988. Approximated Bayes and empirical Bayes confidence intervals—The known variance case. Annals of the Institute of Statistical Mathematics 40, 4 (Dec 1988), 747–767.
    Google ScholarLocate open access versionFindings
  • [31] Baqun Zhang, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. 2012. A robust method for estimating optimal treatment regimes. Biometrics 68, 4 (2012), 1010–1018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments