# Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations

@article{Awasthi2021EfficientAF, title={Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations}, author={Pranjal Awasthi and Alex K. Tang and Aravindan Vijayaraghavan}, journal={ArXiv}, year={2021}, volume={abs/2107.10209} }

We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form f(x) = aσ(Wx+b), where x is drawn from the Gaussian distribution, and σ(t) := max(t, 0) is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias b is zero. In order to deal with the presence of the… Expand

#### References

SHOWING 1-10 OF 39 REFERENCES

Learning Deep ReLU Networks Is Fixed-Parameter Tractable

- Computer Science, Mathematics
- ArXiv
- 2020

An algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters is given, whose bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network. Expand

Learning Two-layer Neural Networks with Symmetric Inputs

- Computer Science, Mathematics
- ICLR
- 2019

A new algorithm for learning a two-layer neural network under a general class of input distributions based on the method-of-moments framework and extends several results in tensor decompositions to avoid the complicated non-convex optimization in learning neural networks. Expand

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

- Computer Science, Mathematics
- NeurIPS
- 2019

This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Expand

Learning Neural Networks with Two Nonlinear Layers in Polynomial Time

- Computer Science
- COLT
- 2019

This work gives a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU), and suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Expand

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

- Computer Science
- 2017

This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. Expand

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, Mathematics
- ICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand

SGD Learns the Conjugate Kernel Class of the Network

- Computer Science, Mathematics
- NIPS
- 2017

We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of… Expand

Learning Polynomials with Neural Networks

- Mathematics, Computer Science
- ICML
- 2014

This paper shows that for a randomly initialized neural network with sufficiently many hidden units, the generic gradient descent algorithm learns any low degree polynomial, assuming the authors initialize the weights randomly, and shows that if they use complex-valued weights, there are no "robust local minima". Expand

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

- Computer Science, Mathematics
- NeurIPS
- 2018

It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory. Expand

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

- Computer Science, Mathematics
- ICML
- 2019

This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure. Expand