AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We prove that Group Lasso+AGL is feature-selection-consistent for all analytic deep networks that interact with inputs through a finite set of linear units
Consistent feature selection for analytic deep neural networks
NIPS 2020, (2020)
One of the most important steps toward interpretability and explainability of neural network models is feature selection, which aims to identify the subset of relevant features. Theoretical results in the field have mostly focused on the prediction aspect of the problem with virtually no work on feature selection consistency for deep ne...More
PPT (Upload PPT)
- Neural networks have become one of the most popular models for learning systems for their strong approximation properties and superior predictive performance.
- 2 Feature selection with analytic deep neural networks
- Given an input x that belongs to be a bounded open set X ⊂ Rd0 , the output map fα(x) of an L-layer neural network with parameters α = (P, p, S, Q, q) is defined by
- In recent years, neural networks have become one of the most popular models for learning systems for their strong approximation properties and superior predictive performance
- One of the most important steps toward model interpretability is feature selection, which aims to identify the subset of relevant features with respect to an outcome
- We argue that the results can be extended to general analytic deep networks
- We prove that Group Lasso (GL)+AGL is feature-selection-consistent for all analytic deep networks that interact with inputs through a finite set of linear units
- To the best of our knowledge, this is the first work that establishes selection consistency for deep networks. This is in contrast to Dinh and Ho , Liang et al  and Ye and Sun , which only provide results for shallow networks with one hidden layer, or Polson and Rocková , Feng and Simon , Liu , which focus on posterior concentration, prediction consistency, parameter-estimation consistency and convergence of feature importance
- While simulations seem to hint that Group Lasso may not be optimal for feature selection with neural networks, a rigorous answer to this hypothesis requires a deeper understanding of the behavior of the estimator that is out of the scope of this paper
- This framework allows interactions across layers of the network architecture and only requires that (i) the model interacts with inputs through a finite set of linear units, and (ii) the activation functions are analytic.
- One popular method for feature selection with neural networks is the Group Lasso (GL).
- The authors note that Lemhadri et al  identifies 11 of the original predictors along with 2 random predictors as the optimal set of features for prediction, which is consistent with the performance of GL+AGL.
- The authors prove that GL+AGL is feature-selection-consistent for all analytic deep networks that interact with inputs through a finite set of linear units.
- Both theoretical and simulation results of the work advocate the use of GL+AGL over the popular Group Lasso for feature selection.
- This is in contrast to Dinh and Ho , Liang et al  and Ye and Sun , which only provide results for shallow networks with one hidden layer, or Polson and Rocková , Feng and Simon , Liu , which focus on posterior concentration, prediction consistency, parameter-estimation consistency and convergence of feature importance.
- 3http://lib.stat.cmu.edu/datasets/boston regularizing parameter λn → 0 (Zou , Lemmas 2 and 3), but is not feature-selection consistent for λn ∼ n−1/2 (Zou , Proposition 1) or for all choices of λn if some necessary condition on the covariance matrix is not satisfied (Zou , Theorem 1).
- For both linear model and neural network, parameter-estimation consistency directly implies prediction consistency and convergence of feature importance.
- While simulations seem to hint that Group Lasso may not be optimal for feature selection with neural networks, a rigorous answer to this hypothesis requires a deeper understanding of the behavior of the estimator that is out of the scope of this paper.
- There have been existing results of this type for neural networks from the prediction aspect of the problem Feng and Simon , Farrell et al  and it would be of general interest how analyses of selection consistency apply in those cases.
- To the best of the knowledge, this is the first work that establishes feature selection consistency, an important cornerstone of interpretable statistical inference, for deep learning.
- Ye and Sun  takes Assumption 6 as given while the authors can avoid this Assumption using Lemma 3.2
- LSTH was supported by startup funds from Dalhousie University, the Canada Research Chairs program, and the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2018-05447
- VD was supported by a startup fund from University of Delaware and National Science Foundation grant DMS-1951474
Study subjects and analysis
The input consists of 50 features, 10 of which are significant while the others are rendered insignificant by setting the corresponding weights to zero. We generate 100 datasets of size n = 5000 from the generic model Y = fα∗ (X) + where ∼ N (0, 1) and non-zero weights of α∗ are sampled independently from N (0, 1). We perform GL and GL+AGL on each simulated dataset with regularizing constants chosen using average test errors from three random three-fold train-test splits
Next, we apply the methods to the Boston housing dataset 3. This dataset consists of 506 observations of house prices and 13 predictors. To analyze the data, we consider a network with three hidden layers of 10 nodes
such datasets: 100
GL identifies all 13 predictors as important, while GL+AGL only selects 11 of them. To further investigate the robustness of the results, we follow the approach of Lemhadri et al  to add 13 random Gaussian noise predictors to the original dataset for analysis. 100 such datasets are created to compare the performance of GL against GL+AGL using the same experimental setting as above. The results are presented in Figure 2, for which we observe that GL struggles to distinguish the random noises from the correct predictors
- Samuel Ainsworth, Nicholas Foti, Adrian KC Lee, and Emily Fox. Interpretable VAEs for nonlinear group factor analysis. arXiv preprint arXiv:1802.06765, 2018.
- Peter J Bickel, Yaácov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
- Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
- An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces. Neural computation, 5(6):910–927, 1993.
- Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, and Michael M Hoffman. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141):20170387, 2018.
- Vu Dinh and Lam Ho. Consistent feature selection for neural networks via Adaptive Group Lasso. arXiv preprint arXiv:2006.00334, 2020.
- Hasan Fallahgoul, Vincentius Franstianto, and Gregoire Loeper. Towards explaining the ReLU feed-forward network. Available at SSRN, 2019.
- Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. arXiv preprint arXiv:1809.09953, 2018.
- Charles Fefferman and Scott Markel. Recovering a feed-forward net from its output. In Advances in Neural Information Processing Systems, pages 335–342, 1994.
- Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592, 2017.
- Enguerrand Horel and Kay Giesecke. Towards Explainable AI: Significance tests for neural networks. arXiv preprint arXiv:1902.06021, 2019.
- Rania Ibrahim, Noha A Yousri, Mohamed A Ismail, and Nagwa M El-Makky. Multi-level gene/MiRNA feature selection using deep belief nets and active learning. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 3957–3960. IEEE, 2014.
- Shanyu Ji, János Kollár, and Bernard Shiffman. A global Lojasiewicz inequality for algebraic varieties. Transactions of the American Mathematical Society, 329(2):813–818, 1992.
- Ismael Lemhadri, Feng Ruan, and Robert Tibshirani. A neural network with feature sparsity. arXiv preprint arXiv:1907.12207, 2019.
- Yifeng Li, Chih-Yu Chen, and Wyeth W Wasserman. Deep feature selection: theory and application to identify enhancers and promoters. Journal of Computational Biology, 23(5):322–336, 2016.
- Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972, 2018.
- Jeremiah Zhe Liu. Variable selection with rigorous uncertainty quantification using deep bayesian neural networks: Posterior concentration and bernstein-von mises phenomenon. arXiv preprint arXiv:1912.01189, 2019.
- Yang Lu, Yingying Fan, Jinchi Lv, and William Stafford Noble. DeepPINK: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8676–8686, 2018.
- Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
- Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
- Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270, 2009.
- Milad Zafar Nezhad, Dongxiao Zhu, Xiangrui Li, Kai Yang, and Phillip Levy. SAFS: A deep feature selection approach for precision medicine. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 501–506. IEEE, 2016.
- Nicholas G Polson and Veronika Rocková. Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems, pages 930–941, 2018.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pages 1135–1144, 2016.
- Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
- Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
- Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153, 2017.
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Aboozar Taherkhani, Georgina Cosma, and T Martin McGinnity. Deep-FS: A feature selection algorithm for Deep Boltzmann Machines. Neurocomputing, 322:22–37, 2018.
- Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily Fox. Neural Granger causality for nonlinear time series. arXiv preprint arXiv:1802.05842, 2018.
- Hansheng Wang and Chenlei Leng. A note on Adaptive Group Lasso. Computational Statistics & Data Analysis, 52(12):5277–5286, 2008.
- Mao Ye and Yan Sun. Variable selection via penalized neural network: a drop-out-one loss approach. In International Conference on Machine Learning, pages 5620–5629, 2018.
- Cheng Zhang, Vu Dinh, and Frederick A Matsen IV. Non-bifurcating phylogenetic tree inference via the Adaptive Lasso. Journal of the American Statistical Association. arXiv preprint arXiv:1805.11073, 2018.
- Cun-Hui Zhang and Jian Huang. The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
- Huaqing Zhang, Jian Wang, Zhanquan Sun, Jacek M Zurada, and Nikhil R Pal. Feature selection for neural networks using Group Lasso regularization. IEEE Transactions on Knowledge and Data Engineering, 2019.
- Lei Zhao, Qinghua Hu, and Wenwu Wang. Heterogeneous feature selection with multi-modal deep neural networks and Sparse Group Lasso. IEEE Transactions on Multimedia, 17(11):1936–1948, 2015.
- Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine learning Research, 7(Nov):2541–2563, 2006.
- Hui Zou. The Adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.