# Ad click prediction: a view from the trenches

KDD, 2013.

EI

Weibo:

Abstract:

Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional su...More

Code:

Data:

Introduction

- Online advertising is a multi-billion dollar industry that has served as one of the great success stories for machine learning.
- A typical industrial model may provide predictions on billions of events per day, using a correspondingly large feature space, and learn from the resulting mass of data.
- The authors explore issues of memory savings, performance analysis, confidence in predictions, calibration, and feature management with the same rigor that is traditionally given to the problem of designing an effective learning algorithm.

Highlights

- Online advertising is a multi-billion dollar industry that has served as one of the great success stories for machine learning
- We present a series of case studies drawn from recent experiments in the setting of the deployed system used at Google to predict ad click–through rates for sponsored search advertising
- The question, is can we get both the sparsity provided by Regularized Dual Averaging and the improved accuracy of Online gradient descent1? The answer is yes, using the “Follow The (Proximally) Regularized Leader” algorithm, or FTRL-Proximal
- In earlier experiments on smaller prototyping versions of our data, McMahan [24] showed that FTRL-Proximal with L1 regularization significantly outperformed both Regularized Dual Averaging and FOBOS in terms of the size-versusaccuracy tradeoffs produced; these previous results are summarized in Table 1, rows 2 and 3
- In q2.13 encoding, we reserve two bits to the left of the binary decimal point, thirteen bits to the right of the binary decimal point, and a bit for the sign, for a total of 16 bits used per value. This reduced precision could create a problem with accumulated roundoff error in an Online gradient descent1 setting, which requires the accumulation of a large number of tiny steps. a simple randomized rounding strategy corrects for this at the cost of a small added regret term [14]
- We discarded the real click labels, and sampled new labels taking the predictions of the ground-truth model as the true click–through rates

Results

- The authors could run 10 independent copies of online gradient descent, where the algorithm instance for problem i would use a learning rate like ηt,i = √n1t,i where nt,i is the number of times coin i has been flipped so far.
- The authors observe no measurable loss comparing results from a model using q2.13 encoding instead of 64-bit floating point values, and the authors save 75% of the RAM for coefficient storage.
- (The authors enforce that by setting the learning rate for those features to 0.) Since the authors train together only highly similar models, the memory savings from not representing the key and the counts per model is much larger than the loss from features not in common.
- For a given update in OGD, each model variant computes its prediction and loss using the subset of coordinates that it includes, drawing on the stored single value for each coefficient.
- Each model that uses i computes a new desired value for the given coefficient.
- The authors evaluated this heuristic by comparing large groups of model variants trained with the single value structure against the same variants trained exactly with the set up from Section 4.3.
- The online loss has considerably better statistics than a held-out validation set, because the authors can use 100% of the data for both training and testing.
- Features for which ni is large get a smaller learning rate, precisely because the authors believe the current coefficient values are more likely to be accurate.

Conclusion

- If the authors assume feature vectors are normalized so |xt,i| ≤ 1, the authors can bound the change in the log-odds prediction due to observing a single training example (x, y).
- Additional experiments showed the uncertainty scores performed comparably to the much more expensive estimates obtained via a bootstrap of 32 models trained on random subsamples of data.
- Systematic bias can be caused by a variety of factors, e.g., inaccurate modeling assumptions, deficiencies in the learning algorithm, or hidden features not available at training and/or serving time.

Summary

- Online advertising is a multi-billion dollar industry that has served as one of the great success stories for machine learning.
- A typical industrial model may provide predictions on billions of events per day, using a correspondingly large feature space, and learn from the resulting mass of data.
- The authors explore issues of memory savings, performance analysis, confidence in predictions, calibration, and feature management with the same rigor that is traditionally given to the problem of designing an effective learning algorithm.
- The authors could run 10 independent copies of online gradient descent, where the algorithm instance for problem i would use a learning rate like ηt,i = √n1t,i where nt,i is the number of times coin i has been flipped so far.
- The authors observe no measurable loss comparing results from a model using q2.13 encoding instead of 64-bit floating point values, and the authors save 75% of the RAM for coefficient storage.
- (The authors enforce that by setting the learning rate for those features to 0.) Since the authors train together only highly similar models, the memory savings from not representing the key and the counts per model is much larger than the loss from features not in common.
- For a given update in OGD, each model variant computes its prediction and loss using the subset of coordinates that it includes, drawing on the stored single value for each coefficient.
- Each model that uses i computes a new desired value for the given coefficient.
- The authors evaluated this heuristic by comparing large groups of model variants trained with the single value structure against the same variants trained exactly with the set up from Section 4.3.
- The online loss has considerably better statistics than a held-out validation set, because the authors can use 100% of the data for both training and testing.
- Features for which ni is large get a smaller learning rate, precisely because the authors believe the current coefficient values are more likely to be accurate.
- If the authors assume feature vectors are normalized so |xt,i| ≤ 1, the authors can bound the change in the log-odds prediction due to observing a single training example (x, y).
- Additional experiments showed the uncertainty scores performed comparably to the much more expensive estimates obtained via a bootstrap of 32 models trained on random subsamples of data.
- Systematic bias can be caused by a variety of factors, e.g., inaccurate modeling assumptions, deficiencies in the learning algorithm, or hidden features not available at training and/or serving time.

- Table1: FTRL results, showing the relative number of non-zero coefficient values and AucLoss (1−AUC)) for competing approaches (smaller numbers are better for both). Overall, FTRL gives better sparsity for the same or better accuracy (a detriment of 0.6% is significant for our application). RDA and FOBOS were compared to FTRL on a smaller prototyping dataset with millions of examples, while OGD-Count was compared to FTRL on a full-scale data set
- Table2: Effect Probabilistic Feature Inclusion. Both methods are effective, but the bloom filtering approach gives better tradeoffs between RAM savings and prediction accuracy

Reference

- D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimating click-through rate. In Proceedings of the 18th international conference on World wide web, pages 21–30. ACM, 2009.
- R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. In SIGMOD Conference, 2013. To appear.
- R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. 2011.
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), July 1970.
- A. Blum, A. Kalai, and J. Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In COLT, 1999.
- O. Chapelle. Click modeling for display advertising. In AdML: 2012 ICML Workshop on Online Advertising, 2012.
- C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory. In ALT, 2008.
- J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
- T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–157, 2000.
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. In COLT, 2010.
- J. Duchi and Y. Singer. Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems 22, pages 495–503. 2009.
- L. Fan, P. Cao, J. Almeida, and A. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3), jun 2000.
- T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
- D. Golovin, D. Sculley, H. B. McMahan, and M. Young. Large-scale learning with a small-scale footprint. In ICML, 2013. To appear.
- T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsofts bing search engine. In Proc. 27th Internat. Conf. on Machine Learning, 2010.
- D. Hillard, S. Schroedl, E. Manavoglu, H. Raghavan, and C. Leggetter. Improving ad relevance in sponsored search. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, pages 361–370, 2010.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
- D. W. Hosmer and S. Lemeshow. Applied logistic regression. Wiley-Interscience Publication, 2000.
- H. A. Koepke and M. Bilenko. Fast prediction of new feature utility. In ICML, 2012.
- J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. JMLR, 10, 2009.
- S.-M. Li, M. Mahdian, and R. P. McAfee. Value of learning in sponsored search auctions. In WINE, 2010.
- W. Li, X. Wang, R. Zhang, Y. Cui, J. Mao, and R. Jin. Exploitation and exploration in a performance based contextual advertising system. In KDD, 2010.
- R. Luss, S. Rosset, and M. Shahar. Efficient regularized isotonic regression with application to gene–gene interaction search. Ann. Appl. Stat., 6(1), 2012.
- H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In AISTATS, 2011.
- H. B. McMahan and O. Muralidharan. On calibrated predictions for auction selection mechanisms. CoRR, abs/1211.3955, 2012.
- H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010.
- A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML, ICML ’05, 2005.
- M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521–530. ACM, 2007.
- M. J. Streeter and H. B. McMahan. Less regret via online conditioning. CoRR, abs/1002.4862, 2010.
- D. Tang, A. Agarwal, D. O’Brien, and M. Meyer. Overlapping experiment infrastructure: more, better, faster experimentation. In KDD, pages 17–26, 2010.
- K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In ICML, pages 1113–1120. ACM, 2009.
- L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. In NIPS, 2009.
- Z. A. Zhu, W. Chen, T. Minka, C. Zhu, and Z. Chen. A novel click model and its applications to online advertising. In Proceedings of the third ACM international conference on Web search and data mining, pages 321–330. ACM, 2010.
- M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

Full Text

Tags

Comments