## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Learning compositional functions via multiplicative weight updates

NIPS 2020, (2020): 13319-13330

EI

摘要

Compositionality is a basic structural feature of both biological and artificial neural networks. Learning compositional functions via gradient descent incurs well known problems like vanishing and exploding gradients, making careful learning rate tuning essential for real-world applications. This paper proves that multiplicative weight...更多

简介

- Neural computation in living systems emerges from the collective behaviour of large numbers of low precision and potentially faulty processing elements.
- Building on recent results in the perturbation analysis of compositional functions [5], the authors show that a multiplicative learning rule satisfies a descent lemma tailored to neural networks.
- Madam seems to not require learning rate tuning and further may be used to train neural networks with low bit width synapses stored in a logarithmic number system.

重点内容

- Neural computation in living systems emerges from the collective behaviour of large numbers of low precision and potentially faulty processing elements
- Building on recent results in the perturbation analysis of compositional functions [5], we show that a multiplicative learning rule satisfies a descent lemma tailored to neural networks
- Madam seems to not require learning rate tuning and further may be used to train neural networks with low bit width synapses stored in a logarithmic number system
- After studying the optimisation properties of compositional functions, we have confirmed that neural networks may be trained via multiplicative updates to weights stored in a logarithmic number system
- A basic question in the design of low-precision number systems is how the bits should be split between the exponent and mantissa

结果

- Multiplicative weight updates are most naturally implemented in a logarithmic number system, in line with anatomical findings about biological synapses [1].
- Bernstein et al [5] suggested that these algorithms are accounting for the compositional structure of the neural network function class, and derived a new distance measure called deep relative trust to describe this analytically.
- Existing works using logarithmic number systems combine them with additive optimisation algorithms like Adam and SGD [30], requiring tuning of both the learning algorithm and the numerical representation.
- Once it has been established that multiplicative updates are a good learning rule it becomes natural to represent them using a logarithmic number system.
- The multiplicative nature of Madam suggests storing synapse strengths in a logarithmic number system, where numbers are represented just by a sign and exponent.
- In Table 1, the authors compare the final results using tuned learning rates for Adam and SGD and using η = 0.01 for Madam.
- The results demonstrate that B-bit Madam can be used to train networks that use 8–12 bits per weight, often with little to no loss in accuracy compared to an FP32 baseline.
- In the 12-bit ImageNet experiment the authors were able to reduce the error from ∼ 29% to ∼ 25% by borrowing layerwise parameter scales σ∗ from a pre-trained model instead of using the standard Pytorch [37] initialisation scale.
- Prior research on deep network training using logarithmic number systems has combined them with additive optimisation algorithms like SGD [30].

结论

- After studying the optimisation properties of compositional functions, the authors have confirmed that neural networks may be trained via multiplicative updates to weights stored in a logarithmic number system.
- When compositional functions are learnt via multiplicative updates, the signs of the weights are frozen and can be thought to satisfy Dale’s principle.
- These rules are usually modeled via additive updates, it has been suggested that multiplicative updates may better explain the available data—both in terms of the induced stationary distribution of synapse strengths, and time-dependent observations [40,41,42].

- Table1: Results after tuning the learning rate η. For each task, we compare to the better performing optimiser in {SGD, Adam} and list the associated inital η. For Madam we use the same intial η across all tasks. We quote top-1 test error, FID [<a class="ref-link" id="c36" href="#r36">36</a>] and perplexity for the classifiers, GAN and transformer respectively. Lower is better in all cases. The mean and range are based on three repeats
- Table2: Benchmarking B-bit Madam. We tested 12-bit, 10-bit and 8-bit Madam on various tasks

相关工作

- Multiplicative weight updates Multiplicative algorithms have a storied history in computer science. Examples in machine learning include the Winnow algorithm [6] and the exponentiated gradient algorithm [7]—both for learning linear classifiers in the face of irrelevant input features. The Hedge algorithm [8], which underpins the AdaBoost framework for boosting weak learners, is also multiplicative. In algorithmic game theory, multiplicative weight updates may be used to solve two-player zero sum games [9]. Arora et al [10] survey many more applications.

Multiplicative updates are typically viewed as appropriate for problems where the geometry of the optimisation domain is described by the relative entropy [7], as is often the case when optimising over probability distributions. Since the relative entropy is a Bregman divergence, the algorithm may then be studied under the framework of mirror descent [11]. We suggest that multiplicative updates may arise under a broader principle: when the geometry of the optimisation domain is described by any relative distance measure.

引用论文

- Thomas M. Bartol, Jr., Cailey Bromer, Justin Kinney, Michael A. Chirillo, Jennifer N. Bourne, Kristen M. Harris, and Terrence J. Sejnowski. Nanoconnectomic upper bound on the variability of synaptic plasticity. eLife, 2015.
- Tom Baker and Dan Hammerstrom. Characterization of artificial neural network algorithms. In International Symposium on Circuits and Systems, 1989.
- Mark Horowitz. Computing’s energy problem (and what we can do about it). In International Solid-State Circuits Conference, 2014.
- Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 2017.
- Jeremy Bernstein, Arash Vahdat, Yisong Yue, and Ming-Yu Liu. On the distance between two neural networks and the stability of learning, 2020. arXiv:2002.03432.
- Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 1988.
- Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 1997.
- Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997.
- Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 1995.
- Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 2012.
- Inderjit S. Dhillon and Joel A. Tropp. Matrix nearness problems with Bregman divergences. SIAM Journal on Matrix Analysis and Applications, 2008.
- Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 2018.
- Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020.
- Navid Azizan, Sahin Lale, and Babak Hassibi. Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization, 2019. arXiv:1906.03830.
- Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Neural Information Processing Systems, 2015.
- Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32K for Imagenet training. Technical Report UCB/EECS-2017-156, University of California, Berkeley, 2017.
- Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. In International Conference on Learning Representations, 2020.
- Carl van Vreeswijk and Haim Sompolinsky. Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 1996.
- Daniel J. Amit, K. Y. Michael Wong, and Colin Campbell. Perceptron learning with signconstrained weights. Journal of Physics A, 1989.
- Iwata, Yoshida, Matsuda, Sato, and Suzumura. An artificial neural network accelerator using general purpose 24 bit floating point digital signal processors. In International Joint Conference on Neural Networks, 1989.
- Jordan L. Holt and Jenq-Neng Hwang. Finite precision error analysis of neural network hardware implementations. Transactions on Computers, 1993.
- Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Neural Information Processing Systems, 2015.
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 2018.
- Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with low-precision weights, 2017. arXiv:1702.03044.
- Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, 2015.
- Lorenz K. Müller and Giacomo Indiveri. Rounding methods for neural networks with low resolution synaptic weights, 2015. arXiv:1504.05767.
- Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Neural Information Processing Systems, 2019.
- Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In Neural Information Processing Systems, 2018.
- Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.
- Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong. LogNet: Energy-efficient neural networks using logarithmic computation. In International Conference on Acoustics, Speech and Signal Processing, 2017.
- Sebastian Vogel, Mengyu Liang, Andre Guntoro, Walter Stechele, and Gerd Ascheid. Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base. In International Conference on Computer-Aided Design, 2018.
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 1986.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Computer Vision and Pattern Recognition, 2017.
- Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
- Tijmen Tieleman and Geoffrey E. Hinton. Lecture 6.5—RMSprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Neural Information Processing Systems, 2017.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019.
- Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of bfloat16 for deep learning training, 2019. arXiv:1905.12322.
- Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018.
- Mark C. van Rossum, Guo-qiang Bi, and Gina G. Turrigiano. Stable Hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience, 2000.
- Boris Barbour, Nicolas Brunel, Vincent Hakim, and Jean-Pierre Nadal. What can we learn from synaptic weight distributions? Trends in Neurosciences, 2007.
- György Buzsáki and Kenji Mizuseki. The log-dynamic brain: how skewed distributions affect network operations. Nature Reviews Neuroscience, 2014.
- Yonatan Loewenstein, Annerose Kuras, and Simon Rumpel. Multiplicative dynamics underlie the emergence of the log-normal distribution of spine sizes in the neocortex in vivo. Journal of Neuroscience, 2011.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn