Parameter Norm Growth During Training of Transformers

William Merrill
William Merrill
Vivek Ramanujan
Vivek Ramanujan
Cited by: 0|Views22

Abstract:

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically some variant of gradient descent (GD). To better understand this bias, we study the tendency of transformer parameters to grow in magnitu...More

Code:

Data:

Full Text
Bibtex
Your rating :
0

 

Tags
Comments