## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Smooth And Consistent Probabilistic Regression Trees

NIPS 2020, (2020)

EI

摘要

We propose here a generalization of regression trees, referred to as Probabilistic Regression (PR) trees, that adapt to the smoothness of the prediction function relating input and output variables while preserving the interpretability of the prediction and being robust to noise. In PR trees, an observation is associated to all regions of...更多

代码：

数据：

简介

- Classification and regression trees (CART) [3] and the ensemble methods based on them, as Random Forests [2] and Gradient Boosted Trees [11, 8], have been successfully used for regression problems

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

in many applications and machine learning competitions. - In a 2017 survey conducted by Kaggle1, decision trees and random forests are respectively the second and third most used machine learning methods in industries after logistic regression
- It is well known that standard decision/regression trees, based on piece-wise constant functions with hard assignments of data points to regions, may have difficulties to adapt to the smoothness of the link functions as well as to the noise in the input data.
- There is no free lunch, and these additional properties come with a computational cost, as described in Appendix A.2

重点内容

- Classification and regression trees (CART) [3] and the ensemble methods based on them, as Random Forests [2] and Gradient Boosted Trees [11, 8], have been successfully used for regression problems

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

in many applications and machine learning competitions - Our contributions are fourfold: (1) we introduce new regression trees, called PR trees for Probabilistic Regression trees, that can adapt to noisy dataset as well as to the smoothness of the prediction function relating input and output variables while preserving the interpretability of the prediction and being robust to noise; (2) we prove the consistency of the PR trees obtained, (3) we extend these trees to Random Forests and Gradient Boosted Trees and (4) we show, experimentally, their benefits in terms of performance, interpretability and robustness to noise
- We have studied here approximations, based on regression trees, of smooth link functions
- We have shown that the probabilistic regression trees obtained are consistent, i.e., limn→+∞ E[Ts(n)(X) − E(Y |X)]2 = 0, with Ts(n) the probabilistic regression tree learned from a training set of size n, and outperform, on a variety of collections, previously proposed trees in terms of performance and robustness to noise
- We have proposed versions of Random Forests and Gradient Boosted Trees based on Probabilistic Regression trees and shown that these versions outperform the state-of-the-art
- We want to extend the consistency results of probabilistic regression trees to their ensemble extensions, and derive Bayesian versions of these extensions, following the work conducted in Chipman et al [5] and Linero and Yang [20]

方法

- For Soft trees, the authors use the implementation available in github6 with the default parameters.
- For STR trees and BooST, its extension to GBT, the authors use the implementation available in github7.
- Unless otherwise specified, the authors use the Normal distribution for Ψ (Eq 3)
- For both PR and standard regression trees, the stopping criterion is the same in all experiments: all leaves should contain at least 10% of the training data.
- All the results are evaluated using the Root Mean Squared Error (RMSE)

结果

- Unless otherwise specified, the authors use the Normal distribution for Ψ (Eq 3).
- For both PR and standard regression trees, the stopping criterion is the same in all experiments: all leaves should contain at least 10% of the training data.
- For the three data sets and for most observations, roughly 75% of their distribution is concentrated on these three regions, which are by far the most important ones and can be used to provide a first explanation for the values predicted

结论

- The authors have studied here approximations, based on regression trees, of smooth link functions.
- While being a basic building block for two of the most popular and efficient ensemble methods, Random Forests and Gradient Boosted Trees, are based on constant piece-wise functions and may fail to accommodate the smoothness of the link function.
- To solve this problem, the authors have introduced functions that relate, through sufficiently regular probability density functions, data points to different regions of the tree and smooth the predictions made.
- The authors plan to investigate knowledge distillation, as introduced in Frosst and Hinton [12]

总结

## Introduction:

Classification and regression trees (CART) [3] and the ensemble methods based on them, as Random Forests [2] and Gradient Boosted Trees [11, 8], have been successfully used for regression problems

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

in many applications and machine learning competitions.- In a 2017 survey conducted by Kaggle1, decision trees and random forests are respectively the second and third most used machine learning methods in industries after logistic regression
- It is well known that standard decision/regression trees, based on piece-wise constant functions with hard assignments of data points to regions, may have difficulties to adapt to the smoothness of the link functions as well as to the noise in the input data.
- There is no free lunch, and these additional properties come with a computational cost, as described in Appendix A.2
## Objectives:

Closer to the objectives are Soft trees, introduced in Irsoy et al [14], Fuzzy Trees, introduced in Suarez and Lutsko [26], and Smooth Transition Regression trees (STR trees), introduced in [6].- One of the objectives of this study is to use PR trees in ensemble methods
## Methods:

For Soft trees, the authors use the implementation available in github6 with the default parameters.- For STR trees and BooST, its extension to GBT, the authors use the implementation available in github7.
- Unless otherwise specified, the authors use the Normal distribution for Ψ (Eq 3)
- For both PR and standard regression trees, the stopping criterion is the same in all experiments: all leaves should contain at least 10% of the training data.
- All the results are evaluated using the Root Mean Squared Error (RMSE)
## Results:

Unless otherwise specified, the authors use the Normal distribution for Ψ (Eq 3).- For the three data sets and for most observations, roughly 75% of their distribution is concentrated on these three regions, which are by far the most important ones and can be used to provide a first explanation for the values predicted
## Conclusion:

The authors have studied here approximations, based on regression trees, of smooth link functions.- While being a basic building block for two of the most popular and efficient ensemble methods, Random Forests and Gradient Boosted Trees, are based on constant piece-wise functions and may fail to accommodate the smoothness of the link function.
- To solve this problem, the authors have introduced functions that relate, through sufficiently regular probability density functions, data points to different regions of the tree and smooth the predictions made.
- The authors plan to investigate knowledge distillation, as introduced in Frosst and Hinton [12]

相关工作

- Several researchers have tried to adapt decision trees so as to explicitly take into account the noise and uncertainty present in the input data. For example, Fuzzy Decision Trees [4, 30, 15, 22], designed for classification purposes, assume that the values for some features and classes are associated with membership functions that allow one to associate an example to different rules and predict several classes (with various degrees) with a rule. The approach presented in Ma et al [21] fits within the same framework and aims at reducing data uncertainty by querying adequate data while learning the tree. The possibilistic trees considered in Elouedi et al [9], Jenhani et al [16] also follow the same principles, the uncertainty being this time modeled with belief functions. Another approach to model uncertainty in the input is based on Uncertain Decision Trees [28, 23, 19], also designed for classification, that assume that examples can take on several values through explicit probability density functions (pdf) on given intervals (or explicit probability tables for categorical values). An example may belong, with a certain probability, to several leaves depending on its value range. All these approaches rely on assumptions on the information available for each input (membership scores and pdfs or probability tables) and do not directly aim at adapting to the smoothness of the true prediction function.

基金

- Acknowledgments and Disclosure of Funding This work has been partially supported by MIAI@Grenoble Alpes (ANR-19-P3IA-0003) and by the french PIA project Lorraine Université d’Excellence (ANR-15-IDEX-04-LUE)

研究对象与分析

data sets: 13

Note that the results obtained with this and higher values are close to the ones obtained when considering all variables (see the Supplementary Material). Data sets We make use here of 13 data sets of various size, namely (ordered by increasing sample size) Riboflavin (RI), Ozone (OZ), Diabetes (DI), Abalone (AB), Boston (BO), Bike-Day (BD), E2006, Skill (SK), Ailerons (AL), Bike-Hour (BH), Super Conductor (SC), Facebook Comments (FC) and Video Transcoding (VT), all commonly used in regression tasks. In the experiments reported here, we use the original data sets, without any modification, to illustrate the fact that one can gain by treating real data sets as noisy

data sets: 3

Interpretability To illustrate how observations are linked to the different regions, one can see in the figure on the right, for OZ, DI and BD, the boxplots for all observations of the probability to belong to the three most probable regions (K1∗ denotes the most probable region for any observation, K2∗ the second most probable region and K3∗ the third most probable region). For the three data sets and for most observations, roughly 75% of their distribution is concentrated on these three regions, which are by far the most important ones and can be used to provide a first explanation for the values predicted. Adpatability to the underlying probability distribution Another advantage of PR trees is that different distribution functions can be used to smooth the prediction (function φ in Eq 3)

datasets with six different distributions: 7

The choice of φ can be made according to some a priori knowledge on the nature of the errors or by testing different distributions and selecting the best one on a validation set. We provide in the Supplementary Material the results of an experiment conducted on seven datasets with six different distributions (two variants of the Gamma distribution, the Laplace distribution, the Lognormal distribution, the Normal distribution and two variants of the Student distribution). As one can expect, the choice of the best distribution depends on the collection, the Student distribution with 3 degrees of freedom being, for example, particularly adapted to the AL, BO and DI collections

observations: 100

Illustration As an illustration, we consider a toy example based on Y = cos(X) + ε, with X ∼ U([0, 5]) and ε ∼ N (0, 0.052). A training set of 100 observations is generated from this model. 3Any standard stopping criterion, as tree depth or number of examples in a leaf, can be used here

引用论文

- Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, page 20152033.
- Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
- Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classification and Regression Trees. Chapman & Hall, New York.
- Chang, R. and Pavlidis, T. (1977). Fuzzy decision tree algorithms. IEEE Transactions on systems, man, and cybernetics, 7:28–35.
- Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010). Bart: Bayesian additive regression trees. Annals of Applied Statistics, pages 266–298.
- da Rosa, J. C., Veiga, A., and Medeiros, M. C. (2008). Tree-structured smooth transition regression models. Computational Statistics and Data Analysis, 58:2469–2488.
- Devore, R. and Ron, A. (2010). Approximation using scattered shifts of a multivariate function. Transactions of the American Mathematical Society, 362(12):6205–6229.
- Elith, J., Leathwick, J. R., and Hastie, T. (2008). A working guide to boosted regression trees. Animal Ecology, 77(4):802–813.
- Elouedi, Z., Mellouli, K., and Smets, P. (2001). Belief decision trees: theoretical foundations. International Journal of Approximate Reasoning, 28(2):91 – 124.
- Fonseca, Y., Medeiros, M., Vasconcelos, G., and Veiga, A. (2019). Boost: Boosting smooth trees for partial effect estimation in nonlinear regressions. arXiv:1808.03698v1.
- Friedman, J. H. (2000). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29:1189–1232.
- Frosst, N. and Hinton, G. (2017). Distilling a neural network into a soft decision tree. In Proceedings of the First International Workshop on Comprehensibility and Explanation in AI and ML 2017 co-located with 16th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2017).
- Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer.
- Irsoy, O., Yildiz, O. T., and Alpaydin, E. (2012). Soft decision trees. In International Conference on Pattern Recognition.
- Janikow, C. Z. (1998). Fuzzy decision trees: issues and methods. IEEE Transactions on systems, man, and cybernetics. Part B, Cybernetics: a publication of the IEEE Systems, Man, and Cybernetics Society, 28 1:1–14.
- Jenhani, I., Amor, N. B., and Elouedi, Z. (2008). Decision trees as possibilistic classifiers. International Journal of Approximate Reasoning, 48(3):784 – 807. Special Section on Choquet Integration in honor of Gustave Choquet (1915–2006) and Special Section on Nonmonotonic and Uncertain Reasoning.
- Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm. Neural Comput., 6(2):181–214.
- Kontschieder, P., Fiterau, M., Criminisi, A., and Bulò, S. R. (2016). Deep neural decision forests. In Kambhampati, S., editor, Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 4190–4194. IJCAI/AAAI Press.
- Liang, C., Zhang, Y., and Song, Q. (2010). Decision tree for dynamic and uncertain data streams. In Sugiyama, M. and Yang, Q., editors, Proceedings of 2nd Asian Conference on Machine Learning, volume 13 of Proceedings of Machine Learning Research, pages 209–224, Tokyo, Japan. PMLR.
- Linero, A. R. and Yang, Y. (2018). Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1087–1110.
- Ma, L., Destercke, S., and Wang, Y. (2016). Online active learning of decision trees with evidential data. Pattern Recogn., 52(C):33–45.
- Olaru, C. and Wehenkel, L. (1999). A complete fuzzy decision tree technique. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21:1297–1311.
- Qin, B., Xia, Y., and Li, F. (2009). Dtu: A decision tree for uncertain data. In Theeramunkong, T., Kijsirikul, B., Cercone, N., and Ho, T.-B., editors, Advances in Knowledge Discovery and Data Mining, pages 4–15, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Schaback, R. (1995). Multivariate interpolation and approximation by translates of a basis function. Series In Approximations and Decompositions 6, pages 491–514.
- Scornet, E., Biau, G., and Vert, J.-P. (2015). Consistency of random forests. The Annals of Statistics, 43(4):1716–1741.
- Suarez, A. and Lutsko, F. (2003). Globally fuzzy decision trees for classification and regression. Fuzzy sets and systems, 138:221–254.
- Tanno, R., Arulkumaran, K., Alexander, D., Criminisi, A., and Nori, A. (2019). Adaptive neural trees. volume 97 of Proceedings of Machine Learning Research, pages 6166–6175, Long Beach, California, USA. PMLR.
- Tsang, S., Kao, B., Yip, K. Y., Ho, W.-S., and Lee, S. D. (2009). Decision trees for uncertain data. IEEE transactions on knowledge and data engineering, 23(1):64–78.
- Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., and Mueller, A. (2015). Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Computing and Communications, 19(1):29–33.
- Yuan, Y. and Shaw, M. J. (1995). Induction of fuzzy decision trees. Fuzzy Sets and systems, 69(2):125–139.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn