AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We prove that Group Lasso+AGL is feature-selection-consistent for all analytic deep networks that interact with inputs through a finite set of linear units

Consistent feature selection for analytic deep neural networks

NIPS 2020, (2020)

Cited by: 0|Views16
EI
Full Text
Bibtex
Weibo

Abstract

One of the most important steps toward interpretability and explainability of neural network models is feature selection, which aims to identify the subset of relevant features. Theoretical results in the field have mostly focused on the prediction aspect of the problem with virtually no work on feature selection consistency for deep ne...More
0
Introduction
  • Neural networks have become one of the most popular models for learning systems for their strong approximation properties and superior predictive performance.
  • 2 Feature selection with analytic deep neural networks
  • Given an input x that belongs to be a bounded open set X ⊂ Rd0 , the output map fα(x) of an L-layer neural network with parameters α = (P, p, S, Q, q) is defined by
Highlights
  • In recent years, neural networks have become one of the most popular models for learning systems for their strong approximation properties and superior predictive performance
  • One of the most important steps toward model interpretability is feature selection, which aims to identify the subset of relevant features with respect to an outcome
  • We argue that the results can be extended to general analytic deep networks
  • We prove that Group Lasso (GL)+AGL is feature-selection-consistent for all analytic deep networks that interact with inputs through a finite set of linear units
  • To the best of our knowledge, this is the first work that establishes selection consistency for deep networks. This is in contrast to Dinh and Ho [2020], Liang et al [2018] and Ye and Sun [2018], which only provide results for shallow networks with one hidden layer, or Polson and Rocková [2018], Feng and Simon [2017], Liu [2019], which focus on posterior concentration, prediction consistency, parameter-estimation consistency and convergence of feature importance
  • While simulations seem to hint that Group Lasso may not be optimal for feature selection with neural networks, a rigorous answer to this hypothesis requires a deeper understanding of the behavior of the estimator that is out of the scope of this paper
Results
  • This framework allows interactions across layers of the network architecture and only requires that (i) the model interacts with inputs through a finite set of linear units, and (ii) the activation functions are analytic.
  • One popular method for feature selection with neural networks is the Group Lasso (GL).
  • The authors note that Lemhadri et al [2019] identifies 11 of the original predictors along with 2 random predictors as the optimal set of features for prediction, which is consistent with the performance of GL+AGL.
  • The authors prove that GL+AGL is feature-selection-consistent for all analytic deep networks that interact with inputs through a finite set of linear units.
  • Both theoretical and simulation results of the work advocate the use of GL+AGL over the popular Group Lasso for feature selection.
  • This is in contrast to Dinh and Ho [2020], Liang et al [2018] and Ye and Sun [2018], which only provide results for shallow networks with one hidden layer, or Polson and Rocková [2018], Feng and Simon [2017], Liu [2019], which focus on posterior concentration, prediction consistency, parameter-estimation consistency and convergence of feature importance.
  • 3http://lib.stat.cmu.edu/datasets/boston regularizing parameter λn → 0 (Zou [2006], Lemmas 2 and 3), but is not feature-selection consistent for λn ∼ n−1/2 (Zou [2006], Proposition 1) or for all choices of λn if some necessary condition on the covariance matrix is not satisfied (Zou [2006], Theorem 1).
  • For both linear model and neural network, parameter-estimation consistency directly implies prediction consistency and convergence of feature importance.
  • While simulations seem to hint that Group Lasso may not be optimal for feature selection with neural networks, a rigorous answer to this hypothesis requires a deeper understanding of the behavior of the estimator that is out of the scope of this paper.
Conclusion
  • There have been existing results of this type for neural networks from the prediction aspect of the problem Feng and Simon [2017], Farrell et al [2018] and it would be of general interest how analyses of selection consistency apply in those cases.
  • To the best of the knowledge, this is the first work that establishes feature selection consistency, an important cornerstone of interpretable statistical inference, for deep learning.
  • Ye and Sun [2018] takes Assumption 6 as given while the authors can avoid this Assumption using Lemma 3.2
Funding
  • LSTH was supported by startup funds from Dalhousie University, the Canada Research Chairs program, and the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2018-05447
  • VD was supported by a startup fund from University of Delaware and National Science Foundation grant DMS-1951474
Study subjects and analysis
datasets: 100
The input consists of 50 features, 10 of which are significant while the others are rendered insignificant by setting the corresponding weights to zero. We generate 100 datasets of size n = 5000 from the generic model Y = fα∗ (X) + where ∼ N (0, 1) and non-zero weights of α∗ are sampled independently from N (0, 1). We perform GL and GL+AGL on each simulated dataset with regularizing constants chosen using average test errors from three random three-fold train-test splits

observations: 506
Next, we apply the methods to the Boston housing dataset 3. This dataset consists of 506 observations of house prices and 13 predictors. To analyze the data, we consider a network with three hidden layers of 10 nodes

such datasets: 100
GL identifies all 13 predictors as important, while GL+AGL only selects 11 of them. To further investigate the robustness of the results, we follow the approach of Lemhadri et al [2019] to add 13 random Gaussian noise predictors to the original dataset for analysis. 100 such datasets are created to compare the performance of GL against GL+AGL using the same experimental setting as above. The results are presented in Figure 2, for which we observe that GL struggles to distinguish the random noises from the correct predictors

Reference
  • Samuel Ainsworth, Nicholas Foti, Adrian KC Lee, and Emily Fox. Interpretable VAEs for nonlinear group factor analysis. arXiv preprint arXiv:1802.06765, 2018.
    Findings
  • Peter J Bickel, Yaácov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
    Google ScholarLocate open access versionFindings
  • Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
    Google ScholarLocate open access versionFindings
  • An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces. Neural computation, 5(6):910–927, 1993.
    Google ScholarLocate open access versionFindings
  • Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, and Michael M Hoffman. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141):20170387, 2018.
    Google ScholarLocate open access versionFindings
  • Vu Dinh and Lam Ho. Consistent feature selection for neural networks via Adaptive Group Lasso. arXiv preprint arXiv:2006.00334, 2020.
    Findings
  • Hasan Fallahgoul, Vincentius Franstianto, and Gregoire Loeper. Towards explaining the ReLU feed-forward network. Available at SSRN, 2019.
    Google ScholarFindings
  • Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. arXiv preprint arXiv:1809.09953, 2018.
    Findings
  • Charles Fefferman and Scott Markel. Recovering a feed-forward net from its output. In Advances in Neural Information Processing Systems, pages 335–342, 1994.
    Google ScholarLocate open access versionFindings
  • Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592, 2017.
    Findings
  • Enguerrand Horel and Kay Giesecke. Towards Explainable AI: Significance tests for neural networks. arXiv preprint arXiv:1902.06021, 2019.
    Findings
  • Rania Ibrahim, Noha A Yousri, Mohamed A Ismail, and Nagwa M El-Makky. Multi-level gene/MiRNA feature selection using deep belief nets and active learning. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 3957–3960. IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • Shanyu Ji, János Kollár, and Bernard Shiffman. A global Lojasiewicz inequality for algebraic varieties. Transactions of the American Mathematical Society, 329(2):813–818, 1992.
    Google ScholarLocate open access versionFindings
  • Ismael Lemhadri, Feng Ruan, and Robert Tibshirani. A neural network with feature sparsity. arXiv preprint arXiv:1907.12207, 2019.
    Findings
  • Yifeng Li, Chih-Yu Chen, and Wyeth W Wasserman. Deep feature selection: theory and application to identify enhancers and promoters. Journal of Computational Biology, 23(5):322–336, 2016.
    Google ScholarLocate open access versionFindings
  • Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972, 2018.
    Google ScholarLocate open access versionFindings
  • Jeremiah Zhe Liu. Variable selection with rigorous uncertainty quantification using deep bayesian neural networks: Posterior concentration and bernstein-von mises phenomenon. arXiv preprint arXiv:1912.01189, 2019.
    Findings
  • Yang Lu, Yingying Fan, Jinchi Lv, and William Stafford Noble. DeepPINK: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8676–8686, 2018.
    Google ScholarLocate open access versionFindings
  • Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
    Google ScholarLocate open access versionFindings
  • Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
    Google ScholarLocate open access versionFindings
  • Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270, 2009.
    Google ScholarLocate open access versionFindings
  • Milad Zafar Nezhad, Dongxiao Zhu, Xiangrui Li, Kai Yang, and Phillip Levy. SAFS: A deep feature selection approach for precision medicine. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 501–506. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Nicholas G Polson and Veronika Rocková. Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems, pages 930–941, 2018.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pages 1135–1144, 2016.
    Google ScholarLocate open access versionFindings
  • Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
    Google ScholarLocate open access versionFindings
  • Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
    Findings
  • Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153, 2017.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
    Findings
  • Aboozar Taherkhani, Georgina Cosma, and T Martin McGinnity. Deep-FS: A feature selection algorithm for Deep Boltzmann Machines. Neurocomputing, 322:22–37, 2018.
    Google ScholarLocate open access versionFindings
  • Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily Fox. Neural Granger causality for nonlinear time series. arXiv preprint arXiv:1802.05842, 2018.
    Findings
  • Hansheng Wang and Chenlei Leng. A note on Adaptive Group Lasso. Computational Statistics & Data Analysis, 52(12):5277–5286, 2008.
    Google ScholarLocate open access versionFindings
  • Mao Ye and Yan Sun. Variable selection via penalized neural network: a drop-out-one loss approach. In International Conference on Machine Learning, pages 5620–5629, 2018.
    Google ScholarLocate open access versionFindings
  • Cheng Zhang, Vu Dinh, and Frederick A Matsen IV. Non-bifurcating phylogenetic tree inference via the Adaptive Lasso. Journal of the American Statistical Association. arXiv preprint arXiv:1805.11073, 2018.
    Findings
  • Cun-Hui Zhang and Jian Huang. The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
    Google ScholarLocate open access versionFindings
  • Huaqing Zhang, Jian Wang, Zhanquan Sun, Jacek M Zurada, and Nikhil R Pal. Feature selection for neural networks using Group Lasso regularization. IEEE Transactions on Knowledge and Data Engineering, 2019.
    Google ScholarLocate open access versionFindings
  • Lei Zhao, Qinghua Hu, and Wenwu Wang. Heterogeneous feature selection with multi-modal deep neural networks and Sparse Group Lasso. IEEE Transactions on Multimedia, 17(11):1936–1948, 2015.
    Google ScholarLocate open access versionFindings
  • Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine learning Research, 7(Nov):2541–2563, 2006.
    Google ScholarLocate open access versionFindings
  • Hui Zou. The Adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
    Google ScholarLocate open access versionFindings
Author
Vu Dinh
Vu Dinh
Lam Ho
Lam Ho
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科