# A Group-Theoretic Framework For Data Augmentation

JOURNAL OF MACHINE LEARNING RESEARCH, (2020)

EI

Abstract

Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we deve...More

Code:

Data:

Introduction

- Deep learning algorithms such as convolutional neural networks (CNNs) are successful in part because they exploit natural symmetry in the data.
- Image identity is roughly invariant to translations and rotations: so a slightly translated cat is still a cat.
- Such invariances are present in many datasets, including image, text and speech data.
- CNNs induce an approximate equivariance to translations, but not to rotations.
- This is an inductive bias of CNNs, and the idea dates back at least to the neocognitron (Fukushima, 1980)

Highlights

- Deep learning algorithms such as convolutional neural networks (CNNs) are successful in part because they exploit natural symmetry in the data
- A general framework for understanding data augmentation is missing. Such a framework would enable us to reason clearly about the benefits offered by augmentation, in comparison to invariant features. Such a framework could shed light on questions such as: How can we improve the performance of our models by adding transformed versions of the training data? Under what conditions can we see benefits? Developing such a framework is challenging for several reasons: first, it is unclear what mathematical approach to use, and second, it is unclear how to demonstrate that data augmentation “helps”
- We describe a few important problems where symmetries occur, but where other approaches—not data augmentation—are currently used (Section 8): cryo-electron microscopy, spherically invariant data, and random effects models
- We have several estimators for the original problem: maximum likelihood estimation (MLE), constrained MLE, augmented MLE, invariant MLE, and marginal MLE. We note that the former four methods are general in the empirical risk minimization (ERM) context, whereas the last one is specific to likelihood-based models
- The above ideas apply to empirical risk minimization
- There are many popular algorithms and methods that are not most naturally expressed as plain ERM

Methods

- ERM/MLE Constrained ERM/MLE Augmented ERM/MLE Invariant ERM/MLE Marginal MLE min θ∈Θ.
- 1 n i∈[n] L(θ, Xi) min θ∈ΘG.
- 1 n i∈[n] L(θ, gXi)dQ(g).
- 1 n i∈[n] L(θ, T (Xi)), T (x) = T max θ∈Θ.
- 1 n i∈[n] log pθdQ(g) mθ∈iΘnEθ0 L(θ, X) Eθ0 X ).
- Mθ∈iΘnEθ0 L(θ, gX)dQ(g) mθ∈iΘnEθ0 L(θ, T (X))

Results

- A general framework for understanding data augmentation is missing
- Such a framework would enable them to reason clearly about the benefits offered by augmentation, in comparison to invariant features.
- Such a framework could shed light on questions such as: How can the authors improve the performance of the models by adding transformed versions of the training data?
- The authors get improved performance for augmented ERM, as stated previously, as well as

Conclusion

- The authors have several estimators for the original problem: MLE, constrained MLE, augmented MLE, invariant MLE, and marginal MLE.
- The authors note that the former four methods are general in the ERM context, whereas the last one is specific to likelihood-based models.
- There are many popular algorithms and methods that are not most naturally expressed as plain ERM.

- Table1: Optimization objectives

Related work

**Related work by**

Mallat, Boelcskei and others (e.g., Mallat, 2012; Bruna and Mallat, 2013; Wiatowski and Bolcskei, 2018; Anselmi et al, 2019) tries to explain how CNNs extract features, using ideas from harmonic analysis. They show that the features of certain models of neural networks (Deep Scattering Networks for Mallat) are increasingly invariant with respect to depth.

Equivariance. The notion of equivariance is also key in statistics (e.g., Lehmann and Casella, 1998). A statistical model is called equivariant with respect to a group G acting on the sample space if there is an induced group G∗ acting on the parameter space Θ such that for any X ∼ Pθ, and any g ∈ G, there is a g∗ ∈ G∗ such that gX ∼ Pg∗θ. Under equivariance, it is customary to restrict to equivariant estimators, i.e., those that satisfy θ(gx) = g∗θ(x). Under some conditions, there are Uniformly Minimum Risk Equivariant (UMRE) estimators.

Funding

- This work was supported in part by NSF BIGDATA grant IIS 1837992 and NSF TRIPODS award 1934960

Reference

- F. Anselmi, G. Evangelopoulos, L. Rosasco, and T. Poggio. Symmetry-adapted representation learning. Pattern Recognition, 86:201–208, 2019.
- A. Antoniou, A. Storkey, and H. Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
- S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148, 2019.
- A. S. Bandeira, B. Blum-Smith, J. Kileel, A. Perry, J. Weed, and A. S. Wein. Estimation under group actions: recovering orbits from invariants. arXiv preprint arXiv:1712.10163, 2017.
- P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- T. Bendory, N. Boumal, C. Ma, Z. Zhao, and A. Singer. Bispectrum inversion with application to multireference alignment. IEEE Transactions on Signal Processing, 66(4):1037–1050, 2018.
- Y. Bengio, F. Bastien, A. Bergeron, N. Boulanger-Lewandowski, T. Breuel, Y. Chherawala, M. Cisse, M. Cote, D. Erhan, J. Eustache, et al. Deep learners benefit more from out-of-distribution examples. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 164–172, 2011.
- U. Bergmann, V. Yachandra, and J. Yano, editors. X-Ray Free Electron Lasers. Energy and Environment Series. The Royal Society of Chemistry, 2017.
- B. Bloem-Reddy and Y. W. Teh. Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082, 2019.
- P. J. Brockwell and R. A. Davis. Time series: theory and methods. Springer, 2009. J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.
- P. Chao, T. Mazaheri, B. Sun, N. B. Weingartner, and Z. Nussinov. The stochastic replica approach to machine learning: Stability and parameter optimization. arXiv preprint arXiv:1708.05715, 2017.
- Z. Chen, Y. Cao, D. Zou, and Q. Gu. How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360, 2019.
- D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010.
- arXiv preprint arXiv:1902.02918, 2019. T. Cohen and M. Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999, 2016a. T. Cohen, M. Geiger, and M. Weiler. A general theory of equivariant cnns on homogeneous spaces. arXiv preprint arXiv:1811.02017, 2018a. T. S. Cohen and M. Welling. Steerable cnns. arXiv preprint arXiv:1612.08498, 2016b. T. S. Cohen, M. Geiger, J. Kohler, and M. Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018b.
- E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719, 2019.
- T. Dao, A. Gu, A. Ratner, V. Smith, C. De Sa, and C. Re. A kernel theory of modern data augmentation. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- T. DeVries and G. W. Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017a.
- T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017b.
- S. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660, 2016.
- L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.
- C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis. Learning so(3) equivariant representations with spherical cnns. In The European Conference on Computer Vision (ECCV), September 2018a.
- C. Esteves, A. Sud, Z. Luo, K. Daniilidis, and A. Makadia. Cross-domain 3d equivariant image embeddings. arXiv preprint arXiv:1812.02716, 2018b.
- C. Esteves, Y. Xu, C. Allen-Blanchette, and K. Daniilidis. Equivariant multi-view networks. arXiv preprint arXiv:1904.00993, 2019.
- V. Favre-Nicolin, J. Baruchel, H. Renevier, J. Eymery, and A. Borbely. XTOP: high-resolution X-ray diffraction and imaging. Journal of Applied Crystallography, 48(3):620–620, 2015.
- N. I. Fisher, T. Lewis, and B. J. Embleton. Statistical analysis of spherical data. Cambridge university press, 1993.
- D. Foster, A. Sekhari, O. Shamir, N. Srebro, K. Sridharan, and B. Woodworth. The complexity of making the gradient small in stochastic convex optimization. arXiv preprint arXiv:1902.04686, 2019.
- J. Frank. Three-dimensional electron microscopy of macromolecular assemblies: visualization of biological molecules in their native state. Oxford University Press, 2006.
- K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.
- R. Gens and P. M. Domingos. Deep symmetry networks. In Advances in neural information processing systems, pages 2537–2545, 2014.
- N. C. Giri. Group invariance in statistical inference. World Scientific, 1996.
- S. Hauberg, O. Freifeld, A. B. L. Larsen, J. Fisher, and L. Hansen. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In Artificial Intelligence and Statistics, pages 342–350, 2016.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016. doi: 10.1109/CVPR. 2016.90. I. S. Helland. Statistical inference under symmetry. International Statistical Review, 72(3):409–422, 2004. A. Hernandez-Garcıa and P. Konig. Data augmentation instead of explicit regularization. arXiv preprint arXiv:1806.03852, 2018a. A. Hernandez-Garcıa and P. Konig. Further advantages of data augmentation on convolutional neural networks. In International Conference on Artificial Neural Networks, pages 95–103.
- Springer, 2018b. A. Hernandez-Garcıa, J. Mehrer, N. Kriegeskorte, P. Konig, and T. C. Kietzmann. Deep neural networks trained with heavier data augmentation learn features closer to representations in hit. In Conference on Cognitive Computational Neuroscience, 2018.
- D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen. Population based augmentation: Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393, 2019.
- E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler, and D. Soudry. Augment your batch: better training with larger batches. arXiv preprint arXiv:1901.09335, 2019.
- A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- N. Jaitly and G. E. Hinton. Vocal tract length perturbation (vtlp) improves speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, volume 117, 2013.
- H. Javadi, R. Balestriero, and R. Baraniuk. A hessian based complexity measure for deep networks. arXiv preprint arXiv:1905.11639, 2019.
- Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
- Z. Kam. The reconstruction of structure from electron micrographs of randomly oriented particles. Journal of Theoretical Biology, 82(1):15–39, 1980.
- Z. Kam. Determination of macromolecular structure in solution by spatial correlation of scattering fluctuations. Macromolecules, 10(5):927–934, 1977.
- R. Kondor and S. Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.
- R. Kondor, Z. Lin, and S. Trivedi. Clebsch–gordan nets: a fully fourier space spherical convolutional neural network. In Advances in Neural Information Processing Systems, pages 10117–10126, 2018.
- A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989. E. Lehmann and G. Casella. Theory of point estimation. Springer Texts in Statistics, 1998. E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2005.
- H. W. Lin, M. Tegmark, and D. Rolnick. Why does deep and cheap learning work so well? Journal of
- Statistical Physics, 168(6):1223–1247, 2017.
- Applied Statistics, 12(4):2121–2150, 2018.
- S. Liu, D. Papailiopoulos, and D. Achlioptas. Bad global minima exist and sgd can reach them. arXiv preprint arXiv:1906.02613, 2019.
- R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk. Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019.
- C. Lyle, M. Kwiatkowksa, and Y. Gal. An analysis of the effect of invariance on generalization in neural networks. In International conference on machine learning Workshop on Understanding and Improving Generalization in Deep Learning, 2019. F. R. Maia and J. Hajdu. The trickle before the torrentdiffraction data from X-ray lasers. Scientific Data, 3, 2016. S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
- T. Mazaheri, B. Sun, J. Scher-Zagier, A. Thind, D. Magee, P. Ronhovde, T. Lookman, R. Mishra, and Z. Nussinov. Stochastic replica voting machine prediction of stable cubic and double perovskite materials and binary alloys. Physical Review Materials, 3(6):063802, 2019. M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. T. Nguyen and S. Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pages 1085–1093, 2013.
- K. Pande, M. Schmidt, P. Schwander, and D. Saldin. Simulations on time-resolved structure determination of uncrystallized biomolecules in the presence of shot noise. Structural Dynamics, 2(2):024103, 2015.
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019. K. Perlin. An image synthesizer. ACM Siggraph Computer Graphics, 19(3):287–296, 1985.
- S. Rajput, Z. Feng, Z. Charles, P.-L. Loh, and D. Papailiopoulos. Does data augmentation lead to positive margin? arXiv preprint arXiv:1905.03177, 2019.
- A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Re. Learning to compose domain-specific transformations for data augmentation. In Advances in neural information processing systems, pages 3236–3246, 2017.
- S. Ravanbakhsh, J. Schneider, and B. Poczos. Equivariance through parameter-sharing. 2017.
- C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
- D. K. Saldin, V. L. Shneerson, R. Fung, and A. Ourmazd. Structure of isolated biomolecules obtained from ultrashort x-ray pulses: exploiting the symmetry of random orientations. Journal of Physics: Condensed Matter, 21(13), 2009. J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
- S. R. Searle, G. Casella, and C. E. McCulloch. Variance components. John Wiley & Sons, 2009. S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. A. Singer. Mathematics for cryo-electron microscopy. arXiv preprint arXiv:1803.06714, 2018.
- L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI, 5:66, 2018.
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association, 82(398):528–540, 1987. T. Tao. Topics in random matrix theory. American Mathematical Society, 2012.
- T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid. A bayesian data augmentation approach for learning deep models. In Advances in Neural Information Processing Systems, pages 2797–2806, 2017. A. W. Van der Vaart. Asymptotic statistics. Cambridge University Press, 1998. C. Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
- C. Vonesch, F. Stauber, and M. Unser. Steerable pca for rotation-invariant image recognition. SIAM Journal on Imaging Sciences, 8(3):1857–1873, 2015. M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pages 10381–10392, 2018. T. Wiatowski and H. Bolcskei. A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Transactions on Information Theory, 64(3):1845–1866, 2018.
- D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5028–5037, 2017. Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974–983, 2007.
- Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Z. Zhao, Y. Shkolnisky, and A. Singer. Fast steerable principal component analysis. IEEE Transactions on Computational Imaging, 2(1):1–12, 2016.
- Z. Zhao, L. T. Liu, and A. Singer. Steerable e pca. arXiv preprint arXiv:1812.08789, 2018.
- Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn