# Neural Anisotropy Directions

NIPS 2020, 2020.

EI

Weibo:

Abstract:

In this work, we analyze the role of the network architecture in shaping the inductive bias of deep classifiers. To that end, we start by focusing on a very simple problem, i.e., classifying a class of linearly separable distributions, and show that, depending on the direction of the discriminative feature of the distribution, many stat...More

Code:

Data:

Introduction

- Given a finite set of samples, there are usually multiple solutions that can perfectly fit the training data, but the inductive bias of a learning algorithm selects and prioritizes those solutions that agree with its a priori assumptions [1, 2].
- The authors show that depending on the nature of the dataset, some deep neural networks can only perform well when the discriminative information of the data is aligned with certain directions of the input space.

Highlights

- In machine learning, given a finite set of samples, there are usually multiple solutions that can perfectly fit the training data, but the inductive bias of a learning algorithm selects and prioritizes those solutions that agree with its a priori assumptions [1, 2]
- The neural anisotropy directions (NADs) of a specific architecture are the ordered set of orthonormal vectors {ui}Di=1 ranked in terms of the preference of the network to separate the data in those particular directions of the input space
- We show that the importance of NADs is not limited to linearly separable tasks, and that they determine the selection of discriminative features of convolutional neural networks (CNNs) trained on CIFAR-10
- In this paper we described a new type of model-driven inductive bias that controls generalization in deep neural networks: the directional inductive bias
- We showed that this bias is encoded by an orthonormal set of vectors for each architecture, which we coined the NADs, and that these characterize the selection of discriminative features used by CNNs to separate a training set
- In [12], researchers highlighted that a neural network memorizes a dataset when this has no discriminative information

Results

- This is illustrated in Fig. 1 for state-of-the-art CNNs classifying a set of linearly separable distributions with a single discriminative feature lying in the direction of some Fourier basis vector1.
- The neural anisotropy directions (NADs) of a specific architecture are the ordered set of orthonormal vectors {ui}Di=1 ranked in terms of the preference of the network to separate the data in those particular directions of the input space.
- The authors will show that measuring the performance of a network on different versions of a linearly separable dataset can reveal its directional inductive bias.
- The authors will first show that the test accuracy on different versions of a linearly separable distribution can reveal the directional inductive bias of a network towards specific directions.
- Even if there is no information loss, the dependency of the optimization landscape on the discriminative direction can cause a network to show some bias towards the solutions that are better conditioned.
- The choice of the Fourier basis so far was almost arbitrary, and there is no reason to suspect that it will capture the full directional inductive bias of all CNNs. NADs characterize this general bias, but it is clear that, trying to identify them by measuring the performance of a neural network on many linearly separable datasets parameterized by a random v, would be extremely inefficient4.
- The authors showed that this bias is encoded by an orthonormal set of vectors for each architecture, which the authors coined the NADs, and that these characterize the selection of discriminative features used by CNNs to separate a training set.

Conclusion

- The authors believe that the findings can have potential impacts on future research in designing better architectures and AutoML [33], paving the way for better aligning the inductive biases of deep networks with a priori structures on real data.
- In this work the authors reveal the directional inductive bias of deep learning and describe its role in controlling the type of functions that neural networks can learn.

Summary

- Given a finite set of samples, there are usually multiple solutions that can perfectly fit the training data, but the inductive bias of a learning algorithm selects and prioritizes those solutions that agree with its a priori assumptions [1, 2].
- The authors show that depending on the nature of the dataset, some deep neural networks can only perform well when the discriminative information of the data is aligned with certain directions of the input space.
- This is illustrated in Fig. 1 for state-of-the-art CNNs classifying a set of linearly separable distributions with a single discriminative feature lying in the direction of some Fourier basis vector1.
- The neural anisotropy directions (NADs) of a specific architecture are the ordered set of orthonormal vectors {ui}Di=1 ranked in terms of the preference of the network to separate the data in those particular directions of the input space.
- The authors will show that measuring the performance of a network on different versions of a linearly separable dataset can reveal its directional inductive bias.
- The authors will first show that the test accuracy on different versions of a linearly separable distribution can reveal the directional inductive bias of a network towards specific directions.
- Even if there is no information loss, the dependency of the optimization landscape on the discriminative direction can cause a network to show some bias towards the solutions that are better conditioned.
- The choice of the Fourier basis so far was almost arbitrary, and there is no reason to suspect that it will capture the full directional inductive bias of all CNNs. NADs characterize this general bias, but it is clear that, trying to identify them by measuring the performance of a neural network on many linearly separable datasets parameterized by a random v, would be extremely inefficient4.
- The authors showed that this bias is encoded by an orthonormal set of vectors for each architecture, which the authors coined the NADs, and that these characterize the selection of discriminative features used by CNNs to separate a training set.
- The authors believe that the findings can have potential impacts on future research in designing better architectures and AutoML [33], paving the way for better aligning the inductive biases of deep networks with a priori structures on real data.
- In this work the authors reveal the directional inductive bias of deep learning and describe its role in controlling the type of functions that neural networks can learn.

Related work

- The inductive bias of deep learning has been extensively studied in the past. From a theoretical point of view, this has mainly concerned the analysis of the implicit bias of gradient descent [6, 12,13,14,15], the stability of convolutional networks to image transformations [7, 16], or the impossibility of learning certain combinatorial functions [17, 18]. On the practical side, the myriad of works that propose new architectures are typically motivated by some informal intuition on their effect on the inductive bias of the network [8,9,10,11, 19,20,21]. Although little attention is generally given to properly quantifying these intuitions, some works have recently analyzed the role of architecture in the translation equivariance of modern CNNs [22,23,24].

We will first show that the test accuracy on different versions of a linearly separable distribution can reveal the directional inductive bias of a network towards specific directions. In this sense, let D(v) be a linearly separable distribution parameterized by a unit vector v ∈ SD−1, such that any sample (x, y) ∼ D(v) satisfies x = yv + w, with noise w ∼ N 0, σ2(ID − vvT ) orthogonal to the direction v, and y sampled from {−1, +1} with equal probability. Despite D(v) being linearly separable based on v, note that if σ the noise will dominate the energy of the samples, making it hard for a classifier to identify the generalizing information in a finite-sample dataset.

Funding

- This work has been partially supported by the CHIST-ERA program under Swiss NSF Grant 20CH21_180444, and partially by Google via a Postdoctoral Fellowship and a GCP Research Credit Award

Study subjects and analysis

training samples: 10000

. Test accuracies of different architectures [8,9,10,11]. Each pixel corresponds to a linearly separable dataset (with 10, 000 training samples) with a single discriminative feature aligned with a basis element of the 2D-DFT. We use the standard 2D-DFT convention and place the dataset with lower discriminative frequencies at the center of the image, and the higher ones extending radially to the corners. All networks (except LeNet) achieve nearly 100% train accuracy. (σ = 3, = 1). Training iterations required to achieve a small training loss on different D(vi) aligned with some Fourier basis vectors (σ = 0.5)

training samples: 10000

Imaginary part of DFT. Test accuracies using different training sets drawn from D(v) ( = 1, with 10, 000 training samples and 10, 000 test samples) for different levels of σ. Directions v taken from the basis elements of the 2D-DFT. Each pixel corresponds to a linearly separable dataset. Test accuracy of two CNNs trained using different training sets drawn from D(v) ( = 1, and σ = 3) with orthogonal random v

Reference

- T. M. Mitchell, “The Need for Biases in Learning Generalizations,” tech. rep., Rutgers University, 1980.
- P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu, “Relational inductive biases, deep learning, and graph networks,” arXiv:1806.01261, Oct. 2018.
- J. Deng, W. Dong, R. Socher, L. J. Li, L. Kai, and F. F. Li, “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255, IEEE, June 2009.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” in International Conference on Learning Representations (ICLR), May 2019.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
- N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the Spectral Bias of Neural Networks,” in Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5301–5310, PMLR, June 2019.
- S. Mallat, “Understanding deep convolutional networks,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374, Apr. 2016.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for Large-Scale Image Recognition,” in International Conference on Learning Representations, (ICLR), May 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, IEEE, June 2016.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261– 2269, IEEE, July 2017.
- C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, (ICLR), Apr. 2017.
- D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The Implicit Bias of Gradient Descent on Separable Data,” in International Conference on Learning Representations (ICLR), 2018.
- S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Characterizing Implicit Bias in Terms of Optimization Geometry,” in Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1832–1841, PMLR, July 2018.
- P. Chaudhari and S. Soatto, “Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks,” in International Conference on Learning Representations (ICLR), 2018.
- A. Bietti and J. Mairal, “On the Inductive Bias of Neural Tangent Kernels,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 12893–12904, Curran Associates, Inc., May 2019.
- M. Nye and A. Saxe, “Are Efficient Deep Representations Learnable?,” in International Conference on Learning Representations, (ICLR), May 2018.
- E. Abbe and C. Sandon, “Provable limitations of deep learning,” arXiv:1812.06369, Apr. 2019.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, Curran Associates, Inc., Dec. 2017.
- M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 3844–3852, Curran Associates, Inc., Dec. 2016.
- S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 448–456, PMLR, July 2015.
- S. d'Ascoli, L. Sagun, G. Biroli, and J. Bruna, “Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias,” in Advances in Neural Information Processing Systems 32 (NeurIPS), pp. 9334–9345, Curran Associates, Inc., 2019.
- R. Zhang, “Making Convolutional Networks Shift-Invariant Again,” in Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 7324–7334, PMLR, June 2019.
- J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship between self-attention and convolutional layers,” in International Conference on Learning Representations (ICLR), Apr. 2020.
- D. Yin, R. G. Lopes, J. Shlens, E. D. Cubuk, and J. Gilmer, “A Fourier Perspective on Model Robustness in Computer Vision,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 13255–13265, Curran Associates, Inc., Dec. 2019.
- H. Wang, X. Wu, Z. Huang, and E. P. Xing, “High Frequency Component Helps Explain the Generalization of Convolutional Neural Networks,” arXiv:1905.13545, May 2019.
- G. Ortiz-Jimenez, A. Modas, S.-M. Moosavi-Dezfooli, and P. Frossard, “Hold me tight! Influence of discriminative features on deep network boundaries,” arXiv:2002.06349, Feb. 2020.
- B. Ghorbani, S. Krishnan, and Y. Xiao, “An Investigation into Neural Net Optimization via Hessian Eigenvalue Density,” in Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 2232–2241, PMLR, June 2019.
- B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, “Exponential expressivity in deep neural networks through transient chaos,” in Advances in Neural Information Processing Systems 29, pp. 3360–3368, Curran Associates, Inc., 2016.
- S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, “Deep information propagation,” in International Conference on Learning Representations (ICLR), 2017.
- J. Pennington, S. Schoenholz, and S. Ganguli, “The emergence of spectral universality in deep networks,” in International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1924–1932, PMLR, Apr. 2018.
- A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein, “Poison frogs! Targeted clean-label poisoning attacks on neural networks,” in Advances in Neural Information Processing Systems 31, pp. 6103–6113, Curran Associates, Inc., 2018.
- J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for Hyper-Parameter Optimization,” in Advances in Neural Information Processing Systems 24 (NeurIPS), pp. 2546–2554, Curran Associates, Inc., 2011.
- B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in International Conference on Learning Representations, (ICLR), Apr. 2017.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” arXiv:2005.14165, May 2020.
- D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman-Brown, A. Luccioni, T. Maharaj, E. D. Sherwin, S. K. Mukkavilli, K. P. Kording, C. Gomes, A. Y. Ng, D. Hassabis, J. C. Platt, F. Creutzig, J. Chayes, and Y. Bengio, “Tackling Climate Change with Machine Learning,” arXiv:1906.05433, Nov. 2019.
- D. Wagner, “AI & Global Governance: How AI is Changing the Global Economy - United Nations University Centre for Policy Research.” https://cpr.unu.edu/ai-global-governance-howai-is-changing-the-global-economy.html, Nov.2018.
- N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Towards the Science of Security and Privacy in Machine Learning,” arXiv:1611.03814, Nov. 2016.
- R. C. Gonzalez and R. E. Woods, Digital Image Processing. Pearson, 4 edition ed., 2017.
- G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Ed.). USA: Johns Hopkins University Press, 1996.

Full Text

Tags

Comments