Skip to content

kozistr/pytorch_optimizer

pytorch-optimizer

Build workflow Documentation Status
Quality codecov black ruff
Package PyPI version PyPI pyversions
Status PyPi download PyPi month download
License apache

The reasons why you use pytorch-optimizer.

  • Wide range of supported optimizers. Currently, 86 optimizers (+ bitsandbytes, qgalore, torchao), 16 lr schedulers, and 13 loss functions are supported!
  • Including many variants such as Cautious, AdamD, Gradient Centrailiaztion
  • Easy to use, clean, and tested codes
  • Active maintenance
  • Somewhat a bit more optimized compared to the original implementation

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have CC BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install pytorch-optimizer

From v2.12.0, v3.1.0, you can use bitsandbytes, q-galore-torch, torchao optimizers respectively! please check the bnb requirements, q-galore-torch installation, torchao installation before installing it.

From v3.0.0, drop Python 3.7 support. However, you can still use this package with Python 3.7 by installing with --ignore-requires-python option.

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

optimizer = load_optimizer(optimizer='adamp')(model.parameters())

# if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.

optimizer = load_optimizer(optimizer='bnb_adamw8bit')(model.parameters())

Also, you can load the optimizer via torch.hub.

import torch

model = YourModel()

opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_optimizers

get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']
Optimizer Description Official Code Paper Citation
AdaBelief Adapting Step-sizes by the Belief in Observed Gradients github https://arxiv.org/abs/2010.07468 cite
AdaBound Adaptive Gradient Methods with Dynamic Bound of Learning Rate github https://openreview.net/forum?id=Bkg3g2R9FX cite
AdaHessian An Adaptive Second Order Optimizer for Machine Learning github https://arxiv.org/abs/2006.00719 cite
AdamD Improved bias-correction in Adam https://arxiv.org/abs/2110.10828 cite
AdamP Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights github https://arxiv.org/abs/2006.08217 cite
diffGrad An Optimization Method for Convolutional Neural Networks github https://arxiv.org/abs/1909.11015v3 cite
MADGRAD A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic github https://arxiv.org/abs/2101.11075 cite
RAdam On the Variance of the Adaptive Learning Rate and Beyond github https://arxiv.org/abs/1908.03265 cite
Ranger a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer github https://bit.ly/3zyspC3 cite
Ranger21 a synergistic deep learning optimizer github https://arxiv.org/abs/2106.13731 cite
Lamb Large Batch Optimization for Deep Learning github https://arxiv.org/abs/1904.00962 cite
Shampoo Preconditioned Stochastic Tensor Optimization github https://arxiv.org/abs/1802.09568 cite
Nero Learning by Turning: Neural Architecture Aware Optimisation github https://arxiv.org/abs/2102.07227 cite
Adan Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models github https://arxiv.org/abs/2208.06677 cite
Adai Disentangling the Effects of Adaptive Learning Rate and Momentum github https://arxiv.org/abs/2006.15815 cite
SAM Sharpness-Aware Minimization github https://arxiv.org/abs/2010.01412 cite
ASAM Adaptive Sharpness-Aware Minimization github https://arxiv.org/abs/2102.11600 cite
GSAM Surrogate Gap Guided Sharpness-Aware Minimization github https://openreview.net/pdf?id=edONMAnhLu- cite
D-Adaptation Learning-Rate-Free Learning by D-Adaptation github https://arxiv.org/abs/2301.07733 cite
AdaFactor Adaptive Learning Rates with Sublinear Memory Cost github https://arxiv.org/abs/1804.04235 cite
Apollo An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization github https://arxiv.org/abs/2009.13586 cite
NovoGrad Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks github https://arxiv.org/abs/1905.11286 cite
Lion Symbolic Discovery of Optimization Algorithms github https://arxiv.org/abs/2302.06675 cite
Ali-G Adaptive Learning Rates for Interpolation with Gradients github https://arxiv.org/abs/1906.05661 cite
SM3 Memory-Efficient Adaptive Optimization github https://arxiv.org/abs/1901.11150 cite
AdaNorm Adaptive Gradient Norm Correction based Optimizer for CNNs github https://arxiv.org/abs/2210.06364 cite
RotoGrad Gradient Homogenization in Multitask Learning github https://openreview.net/pdf?id=T8wHz4rnuGL cite
A2Grad Optimal Adaptive and Accelerated Stochastic Gradient Descent github https://arxiv.org/abs/1810.00553 cite
AccSGD Accelerating Stochastic Gradient Descent For Least Squares Regression github https://arxiv.org/abs/1704.08227 cite
SGDW Decoupled Weight Decay Regularization github https://arxiv.org/abs/1711.05101 cite
ASGD Adaptive Gradient Descent without Descent github https://arxiv.org/abs/1910.09529 cite
Yogi Adaptive Methods for Nonconvex Optimization NIPS 2018 cite
SWATS Improving Generalization Performance by Switching from Adam to SGD https://arxiv.org/abs/1712.07628 cite
Fromage On the distance between two neural networks and the stability of learning github https://arxiv.org/abs/2002.03432 cite
MSVAG Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients github https://arxiv.org/abs/1705.07774 cite
AdaMod An Adaptive and Momental Bound Method for Stochastic Learning github https://arxiv.org/abs/1910.12249 cite
AggMo Aggregated Momentum: Stability Through Passive Damping github https://arxiv.org/abs/1804.00325 cite
QHAdam Quasi-hyperbolic momentum and Adam for deep learning github https://arxiv.org/abs/1810.06801 cite
PID A PID Controller Approach for Stochastic Optimization of Deep Networks github CVPR 18 cite
Gravity a Kinematic Approach on Optimization in Deep Learning github https://arxiv.org/abs/2101.09192 cite
AdaSmooth An Adaptive Learning Rate Method based on Effective Ratio https://arxiv.org/abs/2204.00825v1 cite
SRMM Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates github https://arxiv.org/abs/2201.01652 cite
AvaGrad Domain-independent Dominance of Adaptive Methods github https://arxiv.org/abs/1912.01823 cite
PCGrad Gradient Surgery for Multi-Task Learning github https://arxiv.org/abs/2001.06782 cite
AMSGrad On the Convergence of Adam and Beyond https://openreview.net/pdf?id=ryQu7f-RZ cite
Lookahead k steps forward, 1 step back github https://arxiv.org/abs/1907.08610 cite
PNM Manipulating Stochastic Gradient Noise to Improve Generalization github https://arxiv.org/abs/2103.17182 cite
GC Gradient Centralization github https://arxiv.org/abs/2004.01461 cite
AGC Adaptive Gradient Clipping github https://arxiv.org/abs/2102.06171 cite
Stable WD Understanding and Scheduling Weight Decay github https://arxiv.org/abs/2011.11152 cite
Softplus T Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM https://arxiv.org/abs/1908.00700 cite
Un-tuned w/u On the adequacy of untuned warmup for adaptive optimization https://arxiv.org/abs/1910.04209 cite
Norm Loss An efficient yet effective regularization method for deep neural networks https://arxiv.org/abs/2103.06583 cite
AdaShift Decorrelation and Convergence of Adaptive Learning Rate Methods github https://arxiv.org/abs/1810.00143v4 cite
AdaDelta An Adaptive Learning Rate Method https://arxiv.org/abs/1212.5701v1 cite
Amos An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale github https://arxiv.org/abs/2210.11693 cite
SignSGD Compressed Optimisation for Non-Convex Problems github https://arxiv.org/abs/1802.04434 cite
Sophia A Scalable Stochastic Second-order Optimizer for Language Model Pre-training github https://arxiv.org/abs/2305.14342 cite
Prodigy An Expeditiously Adaptive Parameter-Free Learner github https://arxiv.org/abs/2306.06101 cite
PAdam Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks github https://arxiv.org/abs/1806.06763 cite
LOMO Full Parameter Fine-tuning for Large Language Models with Limited Resources github https://arxiv.org/abs/2306.09782 cite
AdaLOMO Low-memory Optimization with Adaptive Learning Rate github https://arxiv.org/abs/2310.10195 cite
Tiger A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious github cite
CAME Confidence-guided Adaptive Memory Efficient Optimization github https://aclanthology.org/2023.acl-long.243/ cite
WSAM Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term github https://arxiv.org/abs/2305.15817 cite
Aida A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range github https://arxiv.org/abs/2203.13273 cite
GaLore Memory-Efficient LLM Training by Gradient Low-Rank Projection github https://arxiv.org/abs/2403.03507 cite
Adalite Adalite optimizer github https://github.com/VatsaDev/adalite cite
bSAM SAM as an Optimal Relaxation of Bayes github https://arxiv.org/abs/2210.01620 cite
Schedule-Free Schedule-Free Optimizers github https://github.com/facebookresearch/schedule_free cite
FAdam Adam is a natural gradient optimizer using diagonal empirical Fisher information github https://arxiv.org/abs/2405.12807 cite
Grokfast Accelerated Grokking by Amplifying Slow Gradients github https://arxiv.org/abs/2405.20233 cite
Kate Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad github https://arxiv.org/abs/2403.02648 cite
StableAdamW Stable and low-precision training for large-scale vision-language models https://arxiv.org/abs/2304.13013 cite
AdamMini Use Fewer Learning Rates To Gain More github https://arxiv.org/abs/2406.16793 cite
TRAC Adaptive Parameter-free Optimization github https://arxiv.org/abs/2405.16642 cite
AdamG Towards Stability of Parameter-free Optimization https://arxiv.org/abs/2405.04376 cite
AdEMAMix Better, Faster, Older github https://arxiv.org/abs/2409.03137 cite
SOAP Improving and Stabilizing Shampoo using Adam github https://arxiv.org/abs/2409.11321 cite
ADOPT Modified Adam Can Converge with Any β2 with the Optimal Rate github https://arxiv.org/abs/2411.02853 cite
FTRL Follow The Regularized Leader https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf
Cautious Improving Training with One Line of Code github https://arxiv.org/pdf/2411.16085v1 cite
DeMo Decoupled Momentum Optimization github https://arxiv.org/abs/2411.19870 cite
MicroAdam Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence github https://arxiv.org/abs/2405.15593 cite
Muon MomentUm Orthogonalized by Newton-schulz github https://x.com/kellerjordan0/status/1842300916864844014 cite
LaProp Separating Momentum and Adaptivity in Adam github https://arxiv.org/abs/2002.04839 cite
APOLLO SGD-like Memory, AdamW-level Performance github https://arxiv.org/abs/2412.05270 cite
MARS Unleashing the Power of Variance Reduction for Training Large Models github https://arxiv.org/abs/2411.10438 cite
SGDSaI No More Adam: Learning Rate Scaling at Initialization is All You Need github https://arxiv.org/abs/2411.10438 cite

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_lr_schedulers

get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']

get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']
LR Scheduler Description Official Code Paper Citation
Explore-Exploit Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule https://arxiv.org/abs/2003.03977 cite
Chebyshev Acceleration via Fractal Learning Rate Schedules https://arxiv.org/abs/2103.01338 cite
REX Revisiting Budgeted Training with an Improved Schedule github https://arxiv.org/abs/2107.04197 cite
WSD Warmup-Stable-Decay learning rate scheduler github https://arxiv.org/abs/2404.06395 cite

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_loss_functions

get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
Loss Functions Description Official Code Paper Citation
Label Smoothing Rethinking the Inception Architecture for Computer Vision https://arxiv.org/abs/1512.00567 cite
Focal Focal Loss for Dense Object Detection https://arxiv.org/abs/1708.02002 cite
Focal Cosine Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble https://arxiv.org/abs/2007.07805 cite
LDAM Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss github https://arxiv.org/abs/1906.07413 cite
Jaccard (IOU) IoU Loss for 2D/3D Object Detection https://arxiv.org/abs/1908.03851 cite
Bi-Tempered The Principle of Unchanged Optimality in Reinforcement Learning Generalization https://arxiv.org/abs/1906.03361 cite
Tversky Tversky loss function for image segmentation using 3D fully convolutional deep networks https://arxiv.org/abs/1706.05721 cite
Lovasz Hinge A tractable surrogate for the optimization of the intersection-over-union measure in neural networks github https://arxiv.org/abs/1705.08790 cite

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping Gradient Centralization Softplus Transformation
Gradient Normalization Norm Loss Positive-Negative Momentum
Linear learning rate warmup Stable weight decay Explore-exploit learning rate schedule
Lookahead Chebyshev learning rate schedule (Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond Improved bias-correction in Adam Adaptive Gradient Norm Correction

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper. AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

Gradient Centralization

image

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

Gradient Normalization

Norm Loss

image

Positive-Negative Momentum

image

Linear learning rate warmup

image

Stable weight decay

image

Explore-exploit learning rate schedule

image

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is updated and substituted to the current weights every k lookahead steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of a gradient in each iteration based on the adaptive training history of gradient norm.

Cautious optimizer

Updates only occur when the proposed update direction aligns with the current gradient.

Frequently asked questions

here

Visualization

here

Citation

Please cite the original authors of optimization algorithms. You can easily find it in the above table! If you use this software, please cite it below. Or you can get it from "cite this repository" button.

@software{Kim_pytorch_optimizer_optimizer_2021,
    author = {Kim, Hyeongchan},
    month = jan,
    title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
    url = {https://github.com/kozistr/pytorch_optimizer},
    version = {3.1.0},
    year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr