Volume 1, Issue 2
On the Existence of Global Minima and Convergence Analyses for Gradient Descent Methods in the Training of Deep Neural Networks

J. Mach. Learn. , 1 (2022), pp. 141-246.

Published online: 2022-07

Category: Theory

Cited by

Export citation
• Abstract

Although gradient descent (GD) optimization methods in combination with rectified linear unit (ReLU) artificial neural networks (ANNs) often supply an impressive performance in real world learning problems, till this day it remains – in all practically relevant scenarios – an open problem of research to rigorously prove (or disprove) the conjecture that such GD optimization methods do converge in the training of ANNs with ReLU activation.
In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function $p: [a, b]^d → [0, ∞)$ of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function $f: [a, b]^d → \mathbb{R}^{\delta}$ (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove by proving that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fréchet subdifferentiability).
Our mathematical convergence analysis builds up on ideas from our previous article [S. Eberle, A. Jentzen, A. Riekert, & G. Weiss, Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv:2108.08106 (2021)], on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Łojasiewicz inequalities, on tools from functional analysis such as the Arzelà–Ascoli theorem on the relative compactness of uniformly bounded and equicontinuous sequences of continuous functions, on tools from nonsmooth analysis such as the concept of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed in [P. Petersen, M. Raslan, & F. Voigtlaender, Topological properties of the set of functions generated by neural networks of fixed size. Found. Comput. Math. 21 (2021), no. 2, 375–444].

• General Summary

The training of deep neural networks (DNNs) via gradient-based optimization methods is nowadays a widely used procedure with various practical applications. However, it still remains a widely open problem to mathematically explain why these methods are so successful in practice.

In this article we consider fully connected feedforward DNNs with the rectified linear unit (ReLU) activation. We study solutions of gradient flow (GF) differential equations in the training of such DNNs with an arbitrarily large number of hidden layers and we prove that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point. Our assumptions are that the distribution of the input data has a piecewise polynomial density and that the target function (describing the relationship between input data and output data) is piecewise polynomial. One of the main steps in our proof is to verify that the considered risk function satisfies a Kurdyka-Lojasiewicz inequality. A key difficulty in the analysis of the training of ReLU DNNs is the fact that the ReLU function is not differentiable, and thus we need to employ some tools from nonsmooth analysis.

Under the additional assumption that the considered risk function admits at least one regular global minimum, we also establish convergence of the risk of the gradient descent (GD) method with random initializations in the training of ReLU DNNs. Finally, in the special situation of shallow networks with one-dimensional input we verify this assumption by proving that for every Lipschitz continuous target function there exists a sufficiently regular global minimum in the risk landscape.

• Keywords

Deep learning, Global minimum, Kurdyka-Łojasiewicz inequality, Gradient descent, Optimization.

• BibTex
• RIS
• TXT
@Article{JML-1-141, author = {Arnulf and Jentzen and and 24083 and and Arnulf Jentzen and Adrian and Riekert and and 24084 and and Adrian Riekert}, title = {On the Existence of Global Minima and Convergence Analyses for Gradient Descent Methods in the Training of Deep Neural Networks}, journal = {Journal of Machine Learning}, year = {2022}, volume = {1}, number = {2}, pages = {141--246}, abstract = {

Although gradient descent (GD) optimization methods in combination with rectified linear unit (ReLU) artificial neural networks (ANNs) often supply an impressive performance in real world learning problems, till this day it remains – in all practically relevant scenarios – an open problem of research to rigorously prove (or disprove) the conjecture that such GD optimization methods do converge in the training of ANNs with ReLU activation.
In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function $p: [a, b]^d → [0, ∞)$ of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function $f: [a, b]^d → \mathbb{R}^{\delta}$ (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove by proving that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fréchet subdifferentiability).
Our mathematical convergence analysis builds up on ideas from our previous article [S. Eberle, A. Jentzen, A. Riekert, & G. Weiss, Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv:2108.08106 (2021)], on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Łojasiewicz inequalities, on tools from functional analysis such as the Arzelà–Ascoli theorem on the relative compactness of uniformly bounded and equicontinuous sequences of continuous functions, on tools from nonsmooth analysis such as the concept of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed in [P. Petersen, M. Raslan, & F. Voigtlaender, Topological properties of the set of functions generated by neural networks of fixed size. Found. Comput. Math. 21 (2021), no. 2, 375–444].

}, issn = {2790-2048}, doi = {https://doi.org/10.4208/jml.220114a}, url = {http://global-sci.org/intro/article_detail/jml/20801.html} }
TY - JOUR T1 - On the Existence of Global Minima and Convergence Analyses for Gradient Descent Methods in the Training of Deep Neural Networks AU - Jentzen , Arnulf AU - Riekert , Adrian JO - Journal of Machine Learning VL - 2 SP - 141 EP - 246 PY - 2022 DA - 2022/07 SN - 1 DO - http://doi.org/10.4208/jml.220114a UR - https://global-sci.org/intro/article_detail/jml/20801.html KW - Deep learning, Global minimum, Kurdyka-Łojasiewicz inequality, Gradient descent, Optimization. AB -

Although gradient descent (GD) optimization methods in combination with rectified linear unit (ReLU) artificial neural networks (ANNs) often supply an impressive performance in real world learning problems, till this day it remains – in all practically relevant scenarios – an open problem of research to rigorously prove (or disprove) the conjecture that such GD optimization methods do converge in the training of ANNs with ReLU activation.
In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function $p: [a, b]^d → [0, ∞)$ of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function $f: [a, b]^d → \mathbb{R}^{\delta}$ (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove by proving that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fréchet subdifferentiability).
Our mathematical convergence analysis builds up on ideas from our previous article [S. Eberle, A. Jentzen, A. Riekert, & G. Weiss, Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv:2108.08106 (2021)], on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Łojasiewicz inequalities, on tools from functional analysis such as the Arzelà–Ascoli theorem on the relative compactness of uniformly bounded and equicontinuous sequences of continuous functions, on tools from nonsmooth analysis such as the concept of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed in [P. Petersen, M. Raslan, & F. Voigtlaender, Topological properties of the set of functions generated by neural networks of fixed size. Found. Comput. Math. 21 (2021), no. 2, 375–444].

Arnulf Jentzen & Adrian Riekert. (2022). On the Existence of Global Minima and Convergence Analyses for Gradient Descent Methods in the Training of Deep Neural Networks. Journal of Machine Learning. 1 (2). 141-246. doi:10.4208/jml.220114a
Copy to clipboard
The citation has been copied to your clipboard