A collection of optimizer-related papers and code.
For the last column, we let GD
for Gradient Descent, S
for second-order (quasi-newton) methods, E
for evolutionary, GF
for gradient free, VR
for variance reduced.
Title | Year | Optimizer | Published | Code | |
---|---|---|---|---|---|
The AdEMAMix Optimizer: Better, Faster, Older | 2024 | AdEMAMix | arxiv | pytorch | GD |
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information | 2024 | FAdam | arxiv | pytorch | GD |
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection | 2024 | GaLore | arxiv | pytorch | GD |
CoRe Optimizer: An All-in-One Solution for Machine Learning | 2023 | CoRe | arxiv | pytorch | GD |
AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix | 2023 | AGD | arxiv | pytorch | GD,S |
AdaLomo: Low-memory Optimization with Adaptive Learning Rate | 2023 | AdaLOMO | arxiv | pytorch | GD |
Large Language Models as Optimizers | 2023 | OPRO | arxiv | python | llm |
Promoting Exploration in Memory-Augmented Adam using Critical Momenta | 2023 | Adam+CM | arxiv | pytorch | GD |
CAME: Confidence-guided Adaptive Memory Efficient Optimization | 2023 | CAME | acl'23 | pytorch | GD |
Full Parameter Fine-tuning for Large Language Models with Limited Resources | 2023 | LOMO | arxiv | pytorch | GD |
Prodigy: An Expeditiously Adaptive Parameter-Free Learner | 2023 | Prodigy | arxiv | pytorch | GD |
DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method | 2023 | DoWG | neurips'23 | GD | |
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | 2023 | Sophia | arxiv | pytorch | GD |
UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization | 2023 | UAdam | arxiv | GD | |
Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term | 2023 | WSAM | kdd'23 | pytorch | GD |
DP-Adam: Correcting DP Bias in Adam's Second Moment Estimation | 2023 | DP-Adam | iclr-W'23 | GD | |
An Adam-enhanced Particle Swarm Optimizer for Latent Factor Analysis | 2023 | ADHPL | arxiv | E | |
DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule | 2023 | DoG | icml'23 | pytorch | GD |
FOSI: Hybrid First and Second Order Optimization | 2023 | FOSI | HPI'23 | jax | GD,S |
Symbolic Discovery of Optimization Algorithms | 2023 | Lion | neurips'23 | jax, tf, pytorch | GD |
Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale | 2022 | Amos | arxiv | jax | GD |
VeLO: Training Versatile Learned Optimizers by Scaling Up | 2022 | VeLO | arxiv | jax | GD |
Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method | 2022 | GradaGrad | arxiv | GD | |
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU | 2022 | CowClip | aaai'23 | tf | GD |
Smooth momentum: improving lipschitzness in gradient descent | 2022 | Smooth Momentum | APIN | GD | |
Towards Better Generalization of Adaptive Gradient Methods | 2020 | SAGD | neurips'20 | GD | |
An Improved Adaptive Optimization Technique for Image Classification | 2020 | Mean-ADAM | ICIEV | GD | |
SCW-SGD: Stochastically Confidence-Weighted SGD | 2020 | SCWSGD | ICIP | GD | |
Slime mould algorithm: A new method for stochastic optimization | 2020 | SMA | FGCS | code | E |
Ranger-Deep-Learning-Optimizer | 2020 | Ranger | github | pytorch | GD |
pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization | 2020 | pbSGD | ijcai'20 | pytorch | GD |
A Variant of Gradient Descent Algorithm Based on Gradient Averaging | 2020 | Grad-Avg | arxiv | GD | |
Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum | 2020 | FRSGD | arxiv | GD | |
CADA: Communication-Adaptive Distributed Adam | 2020 | CADA | arxiv | pytorch, matlab | GD |
Eigenvalue-corrected Natural Gradient Based on a New Approximation | 2020 | TEKFAC | arxiv | GD | |
SMG: A Shuffling Gradient-Based Method with Momentum | 2020 | SMG | icml'21 | GD | |
SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization | 2020 | SALR | TNNLS | GD | |
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering | 2020 | MEKA | neurips-W'21 | GD | |
Mixing ADAM and SGD: a Combined Optimization Method | 2020 | MAS | arxiv | pytorch | GD |
EAdam Optimizer: How ε Impact Adam | 2020 | EAdam | arxiv | pytorch | GD |
Adam+: A Stochastic Method with Adaptive Variance Reduction | 2020 | Adam+ | arxiv | GD | |
Sharpness-aware Minimization for Efficiently Improving Generalization | 2020 | SAM | iclr'21 | jax | GD |
Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties | 2020 | Expectigrad | arxiv | tf | GD |
AEGD: Adaptive Gradient Descent with Energy | 2020 | AEGD | AIMS | pytorch | GD |
Adam with Bandit Sampling for Deep Learning | 2020 | Adambs | arxiv | GD | |
AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients | 2020 | AdaBelief | neurips'20 | pytorch | GD |
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | 2020 | Apollo[W] | arxiv | pytorch | GD,S |
S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima | 2020 | S-SGD | arxiv | GD | |
Gravilon: Applications of a New Gradient Descent Method to Machine Learning | 2020 | Gravilon | arxiv | GD | |
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization | 2020 | PAGE | icml'21 | GD | |
Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities | 2020 | Ada{ACSA,AGD+} | aaai'21 | GD | |
Stochastic Normalized Gradient Descent with Momentum for Large Batch Training | 2020 | SNGM | arxiv | GD | |
AdaScale SGD: A User-Friendly Algorithm for Distributed Training | 2020 | AdaScale | icml'21 | GD | |
Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization | 2020 | PSTorm | JOTA | GD | |
MTAdam: Automatic Balancing of Multiple Training Loss Terms | 2020 | MTAdam | acl'21 | pytorch | GD |
AdaSGD: Bridging the gap between SGD and Adam | 2020 | AdaSGD | arxiv | GD | |
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | 2020 | AdamP | iclr'21 | pytorch | GD |
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes | 2020 | LANS | arxiv | pytorch | GD |
AdaSwarm: Augmenting Gradient-Based optimizers in Deep Learning with Swarm Intelligence | 2020 | AdaSwarm | TETC | pytorch | E |
Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods | 2020 | SKQN,S4QN | cvpr'21 | GD | |
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs | 2020 | SHAdaGrad | arxiv | GD | |
A New Accelerated Stochastic Gradient Method with Momentum | 2020 | SGDM | arxiv | GD | |
Practical Quasi-Newton Methods for Training Deep Neural Networks | 2020 | K-BFGS[(L)] | neurips'20 | pytorch | GD |
AdaS: Adaptive Scheduling of Stochastic Gradients | 2020 | AdaS | cvpr'22 | pytorch | GD |
Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia | 2020 | Adai | icml'22 | pytorch | GD |
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning | 2020 | ADAHESSIAN | aaai'21 | pytorch | GD |
Momentum with Variance Reduction for Nonconvex Composition Optimization | 2020 | MVRC-[1,2] | arxiv | GD | |
CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing | 2020 | CoolMomentum | arxiv | tf, pytorch | GD |
Gradient Centralization: A New Optimization Technique for Deep Neural Networks | 2020 | GC | eccv'20 | pytorch, tf | GD |
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory | 2020 | AdaX[-W] | arxiv | pytorch | GD |
Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale | 2020 | RM3 | arxiv | tf | GD |
TAdam: A Robust Stochastic Gradient Optimizer | 2020 | TAdam | arxiv | pytorch | GD |
Iterative Averaging in the Quest for Best Test Error | 2020 | Gadam | arxiv | GD | |
On the distance between two neural networks and the stability of learning | 2020 | Fromage | neurips'20 | pytorch | GD |
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent | 2020 | SRSGD | arxiv | pytorch | GD |
Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent | 2020 | SGD-G2 | arxiv | GD | |
LaProp: Separating Momentum and Adaptivity in Adam | 2020 | LaProp | arxiv | pytorch | GD |
Compositional ADAM: An Adaptive Compositional Solver | 2020 | C-ADAM | arxiv | GD | |
Biased Stochastic Gradient Descent for Conditional Stochastic Optimization | 2020 | BSGD | arxiv | GD | |
On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods | 2020 | AdamT | ijcnn'20 | pytorch | GD |
Efficient Learning Rate Adaptation for Convolutional Neural Network Training | 2019 | e-AdLR | ijcnn'19 | GD | |
ProxSGD: Training Structured Neural Networks under Regularization and Constraints | 2019 | ProxSGD | iclr'20 | tf | GD |
An Adaptive Optimization Algorithm Based on Hybrid Power and Multidimensional Update Strategy | 2019 | AdaHMG | ieee | GD | |
signSGD via Zeroth-Order Oracle | 2019 | ZO-signSGD | iclr'19 | GF | |
Fast DENSER: Efficient Deep NeuroEvolution | 2019 | F-DENSER | arxiv | tf | E |
Adathm: Adaptive Gradient Method Based on Estimates of Third-Order Moments | 2019 | Adathm | DSC | GD | |
A new perspective in understanding of Adam-Type algorithms and beyond | 2019 | AdamAL | arxiv | pytorch | GD |
CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity | 2019 | CProp | arxiv | pytorch | GD |
Domain-independent Dominance of Adaptive Methods | 2019 | AvaGrad, Delayed Adam | cvpr'21 | pytorch | GD |
Second-order Information in First-order Optimization Methods | 2019 | AdaSqrt | arxiv | tf | GD |
Does Adam optimizer keep close to the optimal point? | 2019 | AdaFix | arxiv | GD | |
Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates | 2019 | AdaAlter | arxiv | mxnet | GD |
UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization | 2019 | UniXGrad | neurips'19 | GD | |
Demon: Improved Neural Network Training with Momentum Decay | 2019 | Demon {SGDM,Adam} | icassp'22 | tf | GD |
ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization | 2019 | ZO-AdaMM | neurips'19 | tf | GF |
On Empirical Comparisons of Optimizers for Deep Learning | 2019 | RMSterov | arxiv | GD | |
An Adaptive and Momental Bound Method for Stochastic Learning | 2019 | AdaMod | arxiv | pytorch | GD |
On Higher-order Moments in Adam | 2019 | HAdam | arxiv | GD | |
diffGrad: An Optimization Method for Convolutional Neural Networks | 2019 | diffGrad | TNNLS | pytorch | GD |
Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | 2019 | SAMSGrad | arxiv | pytorch | GD |
On the Variance of the Adaptive Learning Rate and Beyond | 2019 | RAdam | iclr'20 | pytorch, TF | GD |
BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization | 2019 | BGADAM | arxiv | GD | |
Adaloss: Adaptive Loss Function for Landmark Localization | 2019 | Adaloss | arxiv | GD | |
signADAM: Learning Confidences for Deep Neural Networks | 2019 | signADAM[++] | icdmw'19 | pytorch | GD |
The Role of Memory in Stochastic Optimization | 2019 | PolyAdam | UAI'20 | GD | |
Lookahead Optimizer: k steps forward, 1 step back | 2019 | Lookahead | neurips'19 | tf, pytorch | GD |
Momentum-Based Variance Reduction in Non-Convex SGD | 2019 | STORM | neurips'19 | pytorch | GD |
SAdam: A Variant of Adam for Strongly Convex Functions | 2019 | SAdam | iclr'20 | code | GD |
Matrix-Free Preconditioning in Online Learning | 2019 | RecursiveOptimizer | icml'19 | tf | GD |
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization | 2019 | PowerSGD[M] | neurips'19 | pytorch | GD |
Fast-DENSER++: Evolving Fully-Trained Deep Artificial Neural Networks | 2019 | F-DENSER++ | arxiv | tf | E |
Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | 2019 | Novograd | neurips'19 | pytorch | GD |
An Adaptive Remote Stochastic Gradient Method for Training Neural Networks | 2019 | NAMS{G,B},ARSG | arxiv | pytorch,mxnet | GD |
Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates | 2019 | ArmijoLS | neurips'19 | pytorch | GD |
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes | 2019 | LAMB | iclr'19 | tf,pytorch | GD |
On the Convergence Proof of AMSGrad and a New Version | 2019 | AdamX | arxiv | GD | |
An Optimistic Acceleration of AMSGrad for Nonconvex Optimization | 2019 | OPT-AMSGrad | acml'21 | GD | |
Parabolic Approximation Line Search for DNNs | 2019 | PAL | neurip'20 | pytorch | GD |
Gradient-only line searches: An Alternative to Probabilistic Line Searches | 2019 | GOLS-I | arxiv | GD | |
Adaptive Gradient Methods with Dynamic Bound of Learning Rate | 2019 | AdaBound | iclr'19 | pytorch | GD |
Memory-Efficient Adaptive Optimization | 2019 | SM3 | neurips'19 | tf | GD |
DADAM: A Consensus-based Distributed Adaptive Gradient Method for Online Optimization | 2019 | DADAM | arxiv | matlab | GD |
On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks | 2018 | Ada{NAG,HB} | arxiv | GD | |
SADAGRAD: Strongly Adaptive Stochastic Gradient Methods | 2018 | SADAGRAD | icml'18 | GD | |
PSA-CMA-ES: CMA-ES with population size adaptation | 2018 | PSA-CMA-ES | gecco'18 | E | |
Adaptive Methods for Nonconvex Optimization | 2018 | Yogi | neurips'18 | tf | GD |
Deep Frank-Wolfe For Neural Network Optimization | 2018 | DFW | iclr'19 | pytorch | GD |
HyperAdam: A Learnable Task-Adaptive Adam for Network Training | 2018 | HyperAdam | aaai'19 | tf, pytorch | GD |
Practical Bayesian Learning of Neural Networks via Adaptive Optimisation Methods | 2018 | BADAM | icml'20 | tf | GD |
Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization | 2018 | KGD | arxiv | tf | GD |
Quasi-hyperbolic momentum and Adam for deep learning | 2018 | QHM,QHAdam | iclr'19 | pytorch, tf | GD |
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods | 2018 | AdaShift | iclr'19 | pytorch | GD |
Optimal Adaptive and Accelerated Stochastic Gradient Descent | 2018 | A2Grad{Exp,Inc,Uni} | arxiv | pytorch | GD |
Accelerating SGD with momentum for over-parameterized learning | 2018 | MaSS | arxiv | tf | GD |
Online Adaptive Methods, Universality and Acceleration | 2018 | AcceleGrad | neurips'18 | GD | |
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization | 2018 | AdaFom | iclr'19 | GD | |
AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes | 2018 | AdaGrad-Norm | icml'19 | pytorch | GD |
Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam | 2018 | VAdam | vadam'18 | pytorch, tf | GD |
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | 2018 | Padam | ijcai'20 | pytorch | GD |
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis | 2018 | EKFAC | neurips'18 | pytorch | GD |
Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods | 2018 | AdaBayes[FP] | neurips'18 | pytorch | GD |
Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate | 2018 | NosAdam | ijcai'19 | pytorch | GD |
Small steps and giant leaps: Minimal Newton solvers for Deep Learning | 2018 | Curveball | iccv'19 | matlab | GD |
GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization | 2018 | GADAM | arxiv | GD | |
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | 2018 | Adafactor | icml'18 | pytorch | GD |
Aggregated Momentum: Stability Through Passive Damping | 2018 | AggMo | iclr'19 | pytorch, tf | GD |
Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization | 2018 | Katyusha X | icml'18 | VR | |
WNGrad: Learn the Learning Rate in Gradient Descent | 2018 | WNGrad | arxiv | C++ | GD |
VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning | 2018 | VR-SGD | IKDE | C++ | GD |
signSGD: Compressed Optimisation for Non-Convex Problems | 2018 | signSGD | icml'18 | mxnet | GD |
Shampoo: Preconditioned Stochastic Tensor Optimization | 2018 | Shampoo | icml'18 | tf | GD |
L4: Practical loss-based stepsize adaptation for deep learning | 2018 | L4{Adam,Momentum} | neurips'18 | pytorch, tf | GD |
On the Convergence of Adam and Beyond | 2018 | AMSGrad, AdamNC | iclr'18 | pytorch | GD |
SW-SGD: The Sliding Window Stochastic Gradient Descent Algorithm | 2017 | SW-SGD | PCS | GD | |
Improving Generalization Performance by Switching from Adam to SGD | 2017 | SWATS | iclr'18 | pytorch | GD |
Noisy Natural Gradient as Variational Inference | 2017 | Noisy {Adam,K-FAC} | icml'18 | tf | GD |
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training | 2017 | AdaComp | aaai'18 | GD | |
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks | 2017 | AdaBatch | iclr-W'18 | PyTorch | GD |
First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time | 2017 | NEON | neurips'18 | GD | |
BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning | 2017 | BPGrad | cvpr'18 | matlab | GD |
Decoupled Weight Decay Regularization | 2017 | AdamW,SGDW | iclr'19 | lua | GD |
Evolving Deep Convolutional Neural Networks for Image Classification | 2017 | EvoCNN | ITEC | python | E |
Normalized Direction-preserving Adam | 2017 | ND-Adam | arxiv | pytorch, tf | GD |
Regularizing and Optimizing LSTM Language Models | 2017 | NT-ASGD | iclr'18 | pytorch | GD |
Natasha 2: Faster Non-Convex Optimization Than SGD | 2017 | Natasha{1.5,2} | neurips'18 | GD | |
Large Batch Training of Convolutional Networks | 2017 | LARS | arxiv | pytorch | GD |
Practical Gauss-Newton Optimisation for Deep Learning | 2017 | KFRA, KFLR | icml'17 | GD | |
YellowFin and the Art of Momentum Tuning | 2017 | YellowFin | arxiv | tf | GD |
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds | 2017 | SC-{Adagrad,RMSProp} | icml'17 | pytorch | GD |
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | 2017 | M-SVAG | icml'18 | tf | GD |
Training Deep Networks without Learning Rates Through Coin Betting | 2017 | COCOB | neurips'17 | tf | GD |
Sub-sampled Cubic Regularization for Non-convex Optimization | 2017 | SCR | icml'17 | numpy | S |
Online Convex Optimization with Unconstrained Domains and Losses | 2017 | RescaledExp | neurips'16 | GD | |
Evolving Deep Neural Networks | 2017 | CoDeepNEAT | arxiv | tf | E |
SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient | 2017 | SARAH | icml'17 | VR | |
IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate | 2017 | IQN | icassp'17 | C++ | GD,S |
NMODE --- Neuro-MODule Evolution | 2017 | NMODE | arxiv | C++ | E |
The Whale Optimization Algorithm | 2016 | WOA | AES | numpy | E |
Incorporating Nesterov Momentum into Adam | 2016 | Nadam | arxiv | pytorch | GD |
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates | 2016 | Eve | arxiv | pytorch | GD |
Direct Feedback Alignment Provides Learning in Deep Neural Networks | 2016 | DFA | neurips'16 | numpy | GD |
SGDR: Stochastic Gradient Descent with Warm Restarts | 2016 | SGDR | iclr'17 | theano | GD |
Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization | 2016 | Damp-oBFGS-Inf | SIAM | pytorch | GD,S |
A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order | 2016 | ZO-SCD | neurips'16 | GF | |
Barzilai-Borwein Step Size for Stochastic Gradient Descent | 2016 | {SGD,SVRG}-BB | neurips'16 | numpy | GD |
Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks | 2016 | SDProp | ijcai'17 | GD | |
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods | 2016 | Katyusha | stoc'17 | VR | |
Accelerating SVRG via second-order information | 2015 | SVRG+{I,II} | arxiv | GD,S | |
adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs | 2015 | adaQN | ecml'16 | numpy | GD,S |
A Linearly-Convergent Stochastic L-BFGS Algorithm | 2015 | SVRG-SQN | aistats | julia | GD,S |
Optimizing Neural Networks with Kronecker-factored Approximate Curvature | 2015 | K-FAC | icml'15 | tf | GD |
Probabilistic Line Searches for Stochastic Optimization | 2015 | ProbLS | JMLR | GD | |
Scale-Free Algorithms for Online Linear Optimization | 2015 | AdaFTRL | alt'15 | GD | |
Adam: A Method for Stochastic Optimization | 2014 | Adam, AdaMax | iclr'15 | pytorch | GD |
Random feedback weights support learning in deep neural networks | 2014 | FA | arxiv | pytorch | GD |
A Computationally Efficient Limited Memory CMA-ES for Large Scale Optimization | 2014 | LM-CMA-ES | gecco'14 | E | |
A Proximal Stochastic Gradient Method with Progressive Variance Reduction | 2014 | Prox-SVRG | SIAM | tf, numpy | VR |
RES: Regularized Stochastic BFGS Algorithm | 2014 | Reg-oBFGS-Inf | arxiv | GD,S | |
A Stochastic Quasi-Newton Method for Large-Scale Optimization | 2014 | SQN | SIAM | matlab | GD,S |
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives | 2014 | SAGA | neurips'14 | numpy | VR |
Accelerating stochastic gradient descent using predictive variance reduction | 2013 | SVRG | neurips'13 | pytorch | VR |
Ad Click Prediction: a View from the Trenches | 2013 | FTRL | kdd'13 | pytorch | GD |
Semi-Stochastic Gradient Descent Methods | 2013 | S2GD | arxiv | VR | |
Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming | 2013 | ZO-SGD | SIAM | GF | |
Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization | 2013 | ZO-{ProxSGD,PSGD} | arxiv | GF | |
Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients | 2013 | vSGD-fd | arxiv | GD | |
Neural Networks for Machine Learning | 2012 | RMSProp | coursera | tf | GD |
An Enhanced Hypercube-Based Encoding for Evolving the Placement, Density, and Connectivity of Neurons | 2012 | ES-HyperNEAT | AL | go | E |
CMA-TWEANN: efficient optimization of neural networks via self-adaptation and seamless augmentation | 2012 | CMA-TWEANN | gecoo'12 | E | |
ADADELTA: An Adaptive Learning Rate Method | 2012 | ADADELTA | arxiv | pytorch | GD |
No More Pesky Learning Rates | 2012 | vSGD-{b,g,l} | icml'13 | lua | VR |
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets | 2012 | SAG | neurips'12 | VR | |
CMA-ES: evolution strategies and covariance matrix adaptation | 2011 | CMA-ES | gecco'12 | tf | E |
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization | 2011 | AdaGrad | JMLR | pytorch,C++ | GD |
AdaDiff: Adaptive Gradient Descent with the Differential of Gradient | 2010 | AdaDiff | iopscience | GD | |
A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks | 2009 | HyperNEAT | AL | E | |
Scalable training of L1-regularized log-linear models | 2007 | OWL-QN | acm | javascript | GD,S |
A Stochastic Quasi-Newton Method for Online Convex Optimization | 2007 | O-LBFGS | icml'07 | GD,S | |
Online convex programming and generalized infinitesimal gradient ascent | 2003 | OGD | icml'03 | GD | |
A Limited Memory Algorithm for Bound Constrained Optimization | 2003 | L-BFGS-B | SIAM | fortran, matlab | GD,S |
Evolving Neural Networks through Augmenting Topologies | 2002 | NEAT | EC | numpy | E |
Trust region methods | 2000 | Sub-sampled TR | SIAM | S | |
A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm | 1993 | RPROP | icnn'93 | pytorch | GD |
Acceleration of Stochastic Approximation by Averaging | 1992 | ASGD | SIAM | pytorch | GD |
Particle swarm optimization | 1995 | PSO | icnn'95 | E | |
On the limited memory BFGS method for large scale optimization | 1989 | L-BFGS | MP | GD,S | |
Large-scale linearly constrained optimization | 1978 | MINOS | MP | pytorch | GD,S |
Some methods of speeding up the convergence of iteration methods | 1964 | Polyak (momentum) | paper | GD | |
A Stochastic Approximation Method | 1951 | SGD | paper | pytorch | GD |