SGD with large step sizes learns sparse features

Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion (EPFL)

ICML 2023

Paper: https://arxiv.org/abs/2210.05337

Abstract

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows to shed a new light on some common practice and observed phenomena when training neural networks.

Code

The exact code to reproduce all the reported experiments on simple networks is available in jupyter notebooks:

diag_nets.ipynb: diagonal linear networks (also see diag_nets_2d_loss_surface.ipynb for loss surface visualizations).
fc_nets_1d_regression.ipynb: two-layer ReLU networks on 1D regression problem.
fc_nets_two_layer.ipynb: two-layer ReLU networks in a teacher-student setup (+ neuron movement visualization).
fc_nets_multi_layer.ipynb: three-layer ReLU networks in a teacher-student setup.

For deep networks, see folder deep_nets where the dependencies are collected in Dockerfile. Typical training commands for a ResNet-18 on CIFAR-10 would look like this:

Plain SGD without explicit regularization (loss stabilization is achieved via exponential warmup):
- with large step sizes: python train.py --dataset=cifar10 --lr_init=0.75 --lr_schedule=piecewise_05epochs --warmup_exp=1.05 --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.0 --l2_reg=0.0 --no_data_augm --eval_iter_freq=200 --exp_name=no_explicit_reg
- with small step sizes: python train.py --dataset=cifar10 --lr_init=0.01 --lr_schedule=constant --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.0 --l2_reg=0.0 --no_data_augm --eval_iter_freq=200 --exp_name=no_explicit_reg
SGD + momentum in the state-of-the-art setting with data augmentation and weight decay:
- with large step sizes: python train.py --dataset=cifar10 --lr_init=0.05 --lr_schedule=piecewise_05epochs --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.9 --l2_reg=0.0005 --eval_iter_freq=200 --exp_name=sota_setting
- with small step sizes: python train.py --dataset=cifar10 --lr_init=0.002 --lr_schedule=constant --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.9 --l2_reg=0.0005 --eval_iter_freq=200 --exp_name=sota_setting

The runs with CIFAR-100 are analogous, just put dataset=cifar100. The step size schedule can be selected from [constant, piecewise_01epochs, piecewise_03epochs, piecewise_05epochs], see utils_train.py for more details.

Contact

Feel free to reach out if you have any questions regarding the code!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
deep_nets		deep_nets
images		images
.gitignore		.gitignore
README.md		README.md
diag_nets.ipynb		diag_nets.ipynb
diag_nets.py		diag_nets.py
diag_nets_2d_loss_surface.ipynb		diag_nets_2d_loss_surface.ipynb
fc_nets.py		fc_nets.py
fc_nets_1d_regression.ipynb		fc_nets_1d_regression.ipynb
fc_nets_multi_layer.ipynb		fc_nets_multi_layer.ipynb
fc_nets_two_layer.ipynb		fc_nets_two_layer.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGD with large step sizes learns sparse features

Abstract

Code

Contact

About

Releases

Packages

Languages

tml-epfl/sgd-sparse-features

Folders and files

Latest commit

History

Repository files navigation

SGD with large step sizes learns sparse features

Abstract

Code

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages