TY - JOUR
T1 - Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes
AU - Ma , Chao
AU - Kunin , Daniel
AU - Wu , Lei
AU - Ying , Lexing
JO - Journal of Machine Learning
VL - 3
SP - 247
EP - 267
PY - 2022
DA - 2022/09
SN - 1
DO - http://doi.org/10.4208/jml.220404
UR - https://global-sci.org/intro/article_detail/jml/21028.html
KW - Neural network loss, Subquadratic growth, Multiscale structure, Edge of stability.
AB - <p style="text-align: justify;">A quadratic approximation of neural network loss landscapes has been extensively used to study
the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study
the structure of neural network loss functions and its implication on optimization in a region beyond the reach
of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a
multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of
scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly.
Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [1, 2] observed for
the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning
rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the
non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a
two-layer neural network problem we show that training data with different magnitudes give rise to different
scales of the loss function, producing subquadratic growth and multiple separate scales.</p>