## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# TREX: Tree-Ensemble Representer-Point Explanations

Keywords

Abstract

How can we identify the training examples that contribute most to the prediction of a tree ensemble? In this paper, we introduce TREX, an explanation system that provides instance-attribution explanations for tree ensembles, such as random forests and gradient boosted trees. TREX builds on the representer point framework previously deve...More

Introduction

- Tree ensembles, including random forests (Breiman 2001) and gradient boosted trees (Friedman 2001), remain one of the most effective approaches to classification.
- Instance-Attribution Explanations A naive but intractable approach to evaluating the impact of each individual training sample on a given prediction involves leave-one-out retraining (Belsley, Kuh, and Welsch 2005) for every training instance
- To avoid this problem, Koh and Liang (2017) derived an approximation of the influence functions framework from classical statistics (Cook and Weisberg 1980) for deep learning models; their approach allows one to compute the influence of each training instance on a given model prediction without having to retrain the model from scratch.
- This can be quite an expensive operation, Yeh et al (2018) introduce the representer point framework for deep neural networks, a method based on the representer theorem (Scholkopf, Herbrich, and Smola 2001) as a way to efficiently decompose the pre-activation predictions of a neural network into a linear combination of the training samples

Highlights

- Tree ensembles, including random forests (Breiman 2001) and gradient boosted trees (Friedman 2001), remain one of the most effective approaches to classification
- We introduce TREX, a method for generating global and local explanations for tree ensembles by using a kernelized surrogate model within the representer point framework
- We evaluate TREX on four benchmark datasets, demonstrating that: (1) the surrogate models accurately approximate the original tree ensembles; (2) in a data cleaning setting, TREX identifies noisy instances better than the previous state-of-the-art; (3) in a remove-and-retrain setting, TREX identifies the training examples most important for good performance; (4) TREX is orders of magnitude faster than other methods; (5) TREX’s local explanations can identify and explain errors due to domain mismatch
- Prototype selection methods do not provide any information about which training examples are most influential for a given model prediction, known as local explanations; to address this issue, we turn to instance-attribution explanations
- In this work we have developed TREX, a method of explaining tree ensemble predictions via the training data
- We demonstrated that this model is capable of closely approximating the predictive behavior of the tree ensemble, 4We see similar results with TREX-support vector machines (SVM), but focused on a smaller number of points due to the sparse SVM solution

Methods

- The authors' experiments use CatBoost (Prokhorenkova et al 2018), an open source implementation of gradient boosted trees1.
- When training TREX with logistic regression, the authors use liblinear (Fan et al 2008) on the tree-ensemble feature representation to solve the L2 regularized dual problem from equation (2).
- For SVMs, the authors use the SVC solver from liblinear (Pedregosa et al 2011) to solve equation (3).
- Source code for TREX and all experiments is available at https://github.com/jjbrophy47/trex

Results

- Inspired by a recent approach that measures explanation quality for feature attribution techniques, the authors adapt the ROAR (RemOve And Retrain) framework (Hooker et al 2019) from measuring feature importance to measuring the influence of training samples on a set of model predictions
- In this experiment, each method generates and aggregates a set of instance-attribution explanations for a randomly selected set of n = 50 test instances, and orders the training data from most positively influential to most negatively influential.

Conclusion

- A future direction could include an in-depth investigation into the robustness of TREX and other instance-attribution methods to adversarial perturbations; having robust explanations is especially useful if the authors want predictive modeling to become more widely adopted in certain domains.

since TREX provides explanations from the perspective of influential training samples, it can be combined with other explanation methods such as feature-attribution explanations to provide the most comprehensive view of all the elements that contribute towards a certain prediction.

Overall, understanding how individual predictions are made can affect all levels of the machine learning pipeline. - The authors extended the representer point framework (Yeh et al 2018) to work for non-differentiable tree ensembles by exploiting the tree-ensemble structure to create a new tree-based kernel, from which the authors can train a kernelized model
- The authors demonstrated that this model is capable of closely approximating the predictive behavior of the tree ensemble, 17-year olds and can be used to aid in dataset debugging better understand model behavior; TREX is significantly faster than alternative methods in terms of setup and explanation costs

Summary

## Introduction:

Tree ensembles, including random forests (Breiman 2001) and gradient boosted trees (Friedman 2001), remain one of the most effective approaches to classification.- Instance-Attribution Explanations A naive but intractable approach to evaluating the impact of each individual training sample on a given prediction involves leave-one-out retraining (Belsley, Kuh, and Welsch 2005) for every training instance
- To avoid this problem, Koh and Liang (2017) derived an approximation of the influence functions framework from classical statistics (Cook and Weisberg 1980) for deep learning models; their approach allows one to compute the influence of each training instance on a given model prediction without having to retrain the model from scratch.
- This can be quite an expensive operation, Yeh et al (2018) introduce the representer point framework for deep neural networks, a method based on the representer theorem (Scholkopf, Herbrich, and Smola 2001) as a way to efficiently decompose the pre-activation predictions of a neural network into a linear combination of the training samples
## Objectives:

The authors' goal is to find a function f : X → {−1, +1} that maps each instance to either the positive (+1) or negative (−1) class.## Methods:

The authors' experiments use CatBoost (Prokhorenkova et al 2018), an open source implementation of gradient boosted trees1.- When training TREX with logistic regression, the authors use liblinear (Fan et al 2008) on the tree-ensemble feature representation to solve the L2 regularized dual problem from equation (2).
- For SVMs, the authors use the SVC solver from liblinear (Pedregosa et al 2011) to solve equation (3).
- Source code for TREX and all experiments is available at https://github.com/jjbrophy47/trex
## Results:

Inspired by a recent approach that measures explanation quality for feature attribution techniques, the authors adapt the ROAR (RemOve And Retrain) framework (Hooker et al 2019) from measuring feature importance to measuring the influence of training samples on a set of model predictions- In this experiment, each method generates and aggregates a set of instance-attribution explanations for a randomly selected set of n = 50 test instances, and orders the training data from most positively influential to most negatively influential.
## Conclusion:

A future direction could include an in-depth investigation into the robustness of TREX and other instance-attribution methods to adversarial perturbations; having robust explanations is especially useful if the authors want predictive modeling to become more widely adopted in certain domains.

since TREX provides explanations from the perspective of influential training samples, it can be combined with other explanation methods such as feature-attribution explanations to provide the most comprehensive view of all the elements that contribute towards a certain prediction.

Overall, understanding how individual predictions are made can affect all levels of the machine learning pipeline.- The authors extended the representer point framework (Yeh et al 2018) to work for non-differentiable tree ensembles by exploiting the tree-ensemble structure to create a new tree-based kernel, from which the authors can train a kernelized model
- The authors demonstrated that this model is capable of closely approximating the predictive behavior of the tree ensemble, 17-year olds and can be used to aid in dataset debugging better understand model behavior; TREX is significantly faster than alternative methods in terms of setup and explanation costs

- Table1: Test Accuracy of GBDT vs. Interpretable Models
- Table2: Time (in seconds) to compute the impact of all training instances on the prediction of a single test instance. MAPLE did not finish (DNF) training on the Census dataset after running for 12 hours, so the test time for that dataset is not applicable (N/A). Each experiment is repeated 5 times to obtain average runtimes, but standard deviations are omitted for clarity and can be found in Table 3 in § A.1 of the appendix
- Table3: Time (in seconds) to compute the impact of all training instances on a single test instance. MAPLE did not finish (DNF) training on the Census dataset after running for 12 hours, so the test time for that dataset is not applicable (N/A). Each experiment is repeated 5 times to obtain average runtimes and standard deviations (S.D.)
- Table4: Hyperparameters used for the CatBoost GBDT model

Funding

- We observe that each dataset is quite robust to the deletion of training instances, maintaining relatively high accuracy even as 90% of the training data is randomly removed (Figure 4)

Study subjects and analysis

data: 7

Our experiments show that TREX's surrogate model accurately approximates the tree ensemble; its global importance weights are more effective in dataset debugging than the previous state-of-the-art; its explanations identify the most influential samples better than alternative methods under the remove and retrain evaluation framework; it runs orders of magnitude faster than alternative methods; and its local explanations can identify and explain errors due to domain mismatch. We conduct a range of experiments to evaluate the effectiveness of TREX in a variety of contexts.

Datasets and Framework Our evaluation uses the following datasets, several of which are the same as in previous work (Sharchilev et al 2018): Churn (n = 7, 043, d = 19) (Kaggle 2018), which tracks customer retention; Amazon (n = 32, 769, d = 9) (Kaggle 2013), where the task is to predict employee access for certain tasks; Adult (n = 48, 842, d = 14) (Dua and Graff 2019), a dataset containing information about personal incomes; and Census (n = 299, 285, d = 41) (Dua and Graff 2019), a population survey dataset collected by the U.S Census Bureau. The Churn dataset does not have a predefined train/test split, so we randomly select 20% to use as a test set.

Our experiments use CatBoost (Prokhorenkova et al 2018), an open source implementation of gradient boosted trees1

Datasets and Framework Our evaluation uses the following datasets, several of which are the same as in previous work (Sharchilev et al 2018): Churn (n = 7, 043, d = 19) (Kaggle 2018), which tracks customer retention; Amazon (n = 32, 769, d = 9) (Kaggle 2013), where the task is to predict employee access for certain tasks; Adult (n = 48, 842, d = 14) (Dua and Graff 2019), a dataset containing information about personal incomes; and Census (n = 299, 285, d = 41) (Dua and Graff 2019), a population survey dataset collected by the U.S Census Bureau. The Churn dataset does not have a predefined train/test split, so we randomly select 20% to use as a test set.

Our experiments use CatBoost (Prokhorenkova et al 2018), an open source implementation of gradient boosted trees1

benchmark datasets: 4

3. We evaluate TREX on four benchmark datasets, demonstrating that: (1) the surrogate models accurately approximate the original tree ensembles; (2) in a data cleaning setting, TREX identifies noisy instances better than the previous state-of-the-art; (3) in a remove-and-retrain setting, TREX identifies the training examples most important for good performance; (4) TREX is orders of magnitude faster than other methods; (5) TREX’s local explanations can identify and explain errors due to domain mismatch. There are a number works in the interpretable machine learning literature (Adadi and Berrada 2018; Miller 2019) that analyze the importance of different features from a global model perspective (Kazemitabar et al 2017) as well as a local perspective on specific model predictions; these typically pertain to feature-attribution methods (Saabas 2014; Ribeiro, Singh, and Guestrin 2016; Lundberg and Lee 2017; Lundberg, Erion, and Lee 2018) or counterfactuals (Guidotti et al 2019)

data: 7

We conduct a range of experiments to evaluate the effectiveness of TREX in a variety of contexts. Datasets and Framework Our evaluation uses the following datasets, several of which are the same as in previous work (Sharchilev et al 2018): Churn (n = 7, 043, d = 19) (Kaggle 2018), which tracks customer retention; Amazon (n = 32, 769, d = 9) (Kaggle 2013), where the task is to predict employee access for certain tasks; Adult (n = 48, 842, d = 14) (Dua and Graff 2019), a dataset containing information about personal incomes; and Census (n = 299, 285, d = 41) (Dua and Graff 2019), a population survey dataset collected by the U.S Census Bureau. The Churn dataset does not have a predefined train/test split, so we randomly select 20% to use as a test set

people: 395

To evaluate the utility of TREX in explaining individual predictions, we created a domain mismatch within the Adult dataset. In the original training set, all 395 people under the age of 18 are labeled as negative, indicating that they make less than or equal to $50,000 per year. We reduced that set to 98 people and flipped 83 of the labels, so that 83 out of 98 17-year-olds in the training set are positive

people: 98

In the original training set, all 395 people under the age of 18 are labeled as negative, indicating that they make less than or equal to $50,000 per year. We reduced that set to 98 people and flipped 83 of the labels, so that 83 out of 98 17-year-olds in the training set are positive. This inevitably caused incorrect predictions in the test set, where 17-yearolds were predicted to have incomes over $50,000 per year

Reference

- Adadi, A.; and Berrada, M. 2018. Peeking inside the blackbox: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 6: 52138–52160.
- Belsley, D. A.; Kuh, E.; and Welsch, R. E. 2005. Regression diagnostics: Identifying influential data and sources of collinearity, volume 571. John Wiley & Sons.
- Bien, J.; and Tibshirani, R. 2011. Prototype selection for interpretable classification. The Annals of Applied Statistics 2403–2424.
- Bloniarz, A.; Talwalkar, A.; Yu, B.; and Wu, C. 2016. Supervised neighborhoods for distributed nonparametric regression. In Artificial Intelligence and Statistics, 1450–1459.
- Breiman, L. 2001. Random forests. Machine Learning 45(1): 5–32.
- Chen, T.; and Guestrin, C. 201Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. ACM.
- Cook, R. D.; and Weisberg, S. 1980. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22(4): 495–508.
- Cortes, C.; and Vapnik, V. 1995. Support-vector networks. Machine Learning 20(3): 273–297.
- Davies, A.; and Ghahramani, Z. 2014. The random forest kernel and other kernels for big data from random partitions. arXiv preprint arXiv:1402.4293.
- Dua, D.; and Graff, C. 2019. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
- Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and Lin, C.-J. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9(Aug): 1871–1874.
- Friedman, J. H. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics 1189–1232.
- Guidotti, R.; Monreale, A.; Giannotti, F.; Pedreschi, D.; Ruggieri, S.; and Turini, F. 2019. Factual and counterfactual explanations for black box decision making. IEEE Intelligent Systems 34(6): 14–23.
- Gurumoorthy, K. S.; Dhurandhar, A.; Cecchi, G.; and Aggarwal, C. 2019. Efficient Data Representation by Selecting Prototypes with Importance Weights. In 2019 IEEE International Conference on Data Mining (ICDM), 260–269. IEEE.
- He, X.; Pan, J.; Jin, O.; Xu, T.; Liu, B.; Xu, T.; Shi, Y.; Atallah, A.; Herbrich, R.; Bowers, S.; et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, 1–9. ACM.
- Hooker, S.; Erhan, D.; Kindermans, P.-J.; and Kim, B. 2019. A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, 9734–9745.
- Kaggle. 2013. Amazon.com - Employee Access Challenge. https://www.kaggle.com/c/amazon-employee-accesschallenge/data.[Online; accessed 28-April-2020].
- Kaggle. 20Dataset Surgical binary classification. https://www.kaggle.com/omnamahshivai/surgicaldataset-binary-classification.[Online; accessed 16-April2020].
- Kazemitabar, J.; Amini, A.; Bloniarz, A.; and Talwalkar, A. S. 2017. Variable importance using decision trees. In Advances in Neural Information Processing Systems, 426–435. Curran Associates, Inc.
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; and Liu, T.-Y. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 3146–3154. Curran Associates, Inc.
- Kim, B.; Khanna, R.; and Koyejo, O. O. 2016. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, 2280– 2288.
- Koh, P. W.; and Liang, P. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, 1885– 1894. JMLR.
- Lundberg, S. M.; Erion, G. G.; and Lee, S.-I. 2018. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888.
- Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 4765–4774.
- Miller, T. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267: 1–38.
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(Oct): 2825–2830.
- Plumb, G.; Molitor, D.; and Talwalkar, A. S. 2018. Model agnostic supervised local explanations. In Advances in Neural Information Processing Systems, 2515–2524. Curran Associates, Inc.
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A. V.; and Gulin, A. 2018. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, 6638–6648. Curran Associates, Inc.
- Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144. ACM.
- Saabas, A. 2014. Interpreting Random Forests. https://blog.datadive.net/interpreting-random-forests/.[Online; accessed 31-August-2020].
- Scholkopf, B.; Herbrich, R.; and Smola, A. J. 2001. A generalized representer theorem. In International Conference on Computational Learning Theory, 416–426. Springer.
- Sharchilev, B.; Ustinovskiy, Y.; Serdyukov, P.; and de Rijke, M. 2018. Finding Influential Training Samples for Gradient Boosted Decision Trees. In Proceedings of the 35th International Conference on Machine Learning, 4577–4585. PMLR.
- Tan, S.; Soloviev, M.; Hooker, G.; and Wells, M. T. 2020. Tree space prototypes: Another look at making tree ensembles interpretable. In ACM-IMS Foundations of Data Science. ACM.
- Yeh, C.-K.; Kim, J.; Yen, I. E.-H.; and Ravikumar, P. K. 2018. Representer point selection for explaining deep neural networks. In Advances in Neural Information Processing Systems, 9291–9301. Curran Associates, Inc.
- Yu, H.-F.; Huang, F.-L.; and Lin, C.-J. 2011. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning 85(1-2): 41–75.

Tags

Comments