Skip to content

tml-epfl/sgd-sparse-features

Repository files navigation

SGD with large step sizes learns sparse features

Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion (EPFL)

ICML 2023

Paper: https://arxiv.org/abs/2210.05337

Abstract

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows to shed a new light on some common practice and observed phenomena when training neural networks.

Code

The exact code to reproduce all the reported experiments on simple networks is available in jupyter notebooks:

  • diag_nets.ipynb: diagonal linear networks (also see diag_nets_2d_loss_surface.ipynb for loss surface visualizations).
  • fc_nets_1d_regression.ipynb: two-layer ReLU networks on 1D regression problem.
  • fc_nets_two_layer.ipynb: two-layer ReLU networks in a teacher-student setup (+ neuron movement visualization).
  • fc_nets_multi_layer.ipynb: three-layer ReLU networks in a teacher-student setup.

For deep networks, see folder deep_nets where the dependencies are collected in Dockerfile. Typical training commands for a ResNet-18 on CIFAR-10 would look like this:

  • Plain SGD without explicit regularization (loss stabilization is achieved via exponential warmup):
    • with large step sizes: python train.py --dataset=cifar10 --lr_init=0.75 --lr_schedule=piecewise_05epochs --warmup_exp=1.05 --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.0 --l2_reg=0.0 --no_data_augm --eval_iter_freq=200 --exp_name=no_explicit_reg
    • with small step sizes: python train.py --dataset=cifar10 --lr_init=0.01 --lr_schedule=constant --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.0 --l2_reg=0.0 --no_data_augm --eval_iter_freq=200 --exp_name=no_explicit_reg
  • SGD + momentum in the state-of-the-art setting with data augmentation and weight decay:
    • with large step sizes: python train.py --dataset=cifar10 --lr_init=0.05 --lr_schedule=piecewise_05epochs --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.9 --l2_reg=0.0005 --eval_iter_freq=200 --exp_name=sota_setting
    • with small step sizes: python train.py --dataset=cifar10 --lr_init=0.002 --lr_schedule=constant --model=resnet18_plain --model_width=64 --epochs=100 --batch_size=256 --momentum=0.9 --l2_reg=0.0005 --eval_iter_freq=200 --exp_name=sota_setting

The runs with CIFAR-100 are analogous, just put dataset=cifar100. The step size schedule can be selected from [constant, piecewise_01epochs, piecewise_03epochs, piecewise_05epochs], see utils_train.py for more details.

Contact

Feel free to reach out if you have any questions regarding the code!