XGBoost: A Scalable Tree Boosting System.

KDD, (2016): 785-794

被引用9847|浏览639
EI
下载 PDF 全文
引用
微博一下

摘要

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and ...更多

代码

数据

0
简介
  • Machine learning and data-driven approaches are becoming very important in many areas.
  • Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines in many applications.
  • Besides being used as a stand-alone predictor, it is incorporated into real-world production pipelines for ad click through rate prediction [15]
  • It is the defacto choice of ensemble method and is used in challenges such as the Netflix prize [3]
重点内容
  • Machine learning and data-driven approaches are becoming very important in many areas
  • Smart spam classifiers protect our email by learning from massive amounts of spam data and user feedback; advertising systems learn to match the right ads with the right context; fraud detection systems protect banks from malicious attackers; anomaly event detection systems help experimental physicists to find events that lead to new physics
  • Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines in many applications
  • We described the lessons we learnt when building XGBoost, a scalable tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems
  • We proposed a novel sparsity aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate learning
  • These lessons can be applied to other machine learning systems as well
方法
  • 0.6224 ments, the authors use a randomly selected subset of the data either due to slow baselines or to demonstrate the performance of the algorithm with varying dataset size.
  • The authors simplified the task to only predict the likelihood of an insurance claim.
  • This dataset is used to evaluate the impact of sparsity-aware algorithm in Sec. 3.4.
  • The authors randomly select 10M instances as training set and use the rest as evaluation set
结果
  • In most of the dataset the authors tested, the authors achieve roughly a 26% to 29% compression ratio.
结论
  • The authors described the lessons the authors learnt when building XGBoost, a scalable tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems.
  • The authors' experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting.
  • These lessons can be applied to other machine learning systems as well.
  • XGBoost is able to solve realworld scale problems using a minimal amount of resources
表格
  • Table1: Comparison of major tree boosting systems
  • Table2: Dataset used in the Experiments
  • Table3: Comparison of Exact Greedy Methods with
  • Table4: Comparison of Learning to Rank with 500 trees on Yahoo! LTRC Dataset
Download tables as Excel
基金
  • This work was supported in part by ONR (PECASE) N000141010672, NSF IIS 1258741 and the TerraSwarm Research Center sponsored by MARCO and DARPA
研究对象与分析
data sets: 2
A good choice of block size balances these two factors. We compared various choices of block size on two data sets. The results are given in Fig. 9

datasets: 4
6.2 Dataset and Setup. We used four datasets in our experiments. A summary of these datasets is given in Table 2

documents: 22
The third dataset is the Yahoo! learning to rank challenge dataset [6], which is one of the most commonly used benchmarks in learning to rank algorithms. The dataset contains 20K web search queries, with each query corresponding to a list of around 22 documents. The task is to rank the documents according to relevance of the query

datasets: 3
The entire dataset is more than one terabyte in LibSVM format. We use the first three datasets for the single machine parallel setting, and the last dataset for the distributed and out-of-core settings. All the single machine experiments are conducted on a Dell PowerEdge R420 with two eight-core Intel Xeon (E5-2470) (2.3GHz) and 64GB of memory

引用论文
  • [2] R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011.
    Google ScholarFindings
  • [3] J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3–6, New York, Aug. 2007.
    Google ScholarLocate open access versionFindings
  • [4] L. Breiman. Random forests. Maching Learning, 45(1):5–32, Oct. 2001.
    Google ScholarLocate open access versionFindings
  • [5] C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23–581, 2010.
    Google ScholarLocate open access versionFindings
  • [6] O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1–24, 2011.
    Google ScholarLocate open access versionFindings
  • [7] T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML’13), volume 1, pages 436–444, 2013.
    Google ScholarLocate open access versionFindings
  • [8] T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS’15), volume 1, 2015.
    Google ScholarLocate open access versionFindings
  • [9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
    Google ScholarLocate open access versionFindings
  • [10] J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189–1232, 2001.
    Google ScholarLocate open access versionFindings
  • [11] J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
    Google ScholarLocate open access versionFindings
  • [12] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–407, 2000.
    Google ScholarLocate open access versionFindings
  • [13] J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.
    Google ScholarFindings
  • [14] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58–66, 2001.
    Google ScholarLocate open access versionFindings
  • [15] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD’14, 2014.
    Google ScholarLocate open access versionFindings
  • [16] P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI’10), pages 302–311, 2010.
    Google ScholarLocate open access versionFindings
  • [17] P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems 20, pages 897–904. 2008.
    Google ScholarLocate open access versionFindings
  • [18] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1–7, 2016.
    Google ScholarLocate open access versionFindings
  • [19] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426–1437, Aug. 2009.
    Google ScholarLocate open access versionFindings
  • [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
    Google ScholarLocate open access versionFindings
  • [22] S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387–396. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • [23] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09.
    Google ScholarLocate open access versionFindings
  • [24] Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007.
    Google ScholarLocate open access versionFindings
  • [25] T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科