For Xi-Lin's original PSGD repo, see psgd_torch.
For JAX versions, see psgd_jax and distributed_kron.
Implementation of PSGD Kron for PyTorch. PSGD is a second-order optimizer originally created by Xi-Lin Li that uses either a hessian-based or whitening-based (gg^T) preconditioner and lie groups to improve training convergence, generalization, and efficiency. I highly suggest taking a look at Xi-Lin's PSGD repo's readme linked to above for interesting details on how PSGD works and experiments using PSGD. There are also paper resources listed near the bottom of this readme.
The most versatile and easy-to-use PSGD optimizer is kron
, which uses a Kronecker-factored preconditioner. It has less hyperparameters that need tuning than adam, and can generally act as a drop-in replacement.
Shoutout to @ClashLuke for developing efficiency improvements for PSGD Kron in the heavyball repo, and for the design of 'smart_one_diag' memory save mode, which is a method to improve memory usage and speed with almost no cost to the optimizer's effectiveness. In Xi-Lin's repo, the equivalent is setting preconditioner_max_skew=1
.
Kron schedules the preconditioner update probability by default to start at 1.0 and anneal to 0.03 at the beginning of training, so training will be slightly slower at the start but will speed up by around 4k steps.
For basic usage, use kron
optimizer like any other pytorch optimizer:
from kron_torch import Kron optimizer = Kron(params) optimizer.zero_grad() loss.backward() optimizer.step()
Basic hyperparameters:
TLDR: Start with a learning rate around 3x smaller than adam's, and a weight decay 3-10x larger. There is no b2 or epsilon.
These next 3 settings control whether a dimension's preconditioner is diagonal or triangular. For example, for a layer with shape (256, 128), triagular preconditioners would be shapes (256, 256) and (128, 128), and diagonal preconditioners would be shapes (256,) and (128,). Depending on how these settings are chosen, kron
can balance between memory/speed and effectiveness. Defaults lead to most precoditioners being triangular except for 1-dimensional layers and very large dimensions.
max_size_triangular
: Any dimension with size above this value will have a diagonal preconditioner.
min_ndim_triangular
: Any tensor with less than this number of dims will have all diagonal preconditioners. Default is 2, so single-dim layers like bias and scale will use diagonal preconditioners.
memory_save_mode
: Can be None, 'smart_one_diag', 'one_diag', or 'all_diag'. None is default and lets all preconditioners be triangular. 'smart_one_diag' sets the largest dim to diagonal only if it's larger than the second largest dim (if it stands out). 'one_diag' sets the largest or last dim per layer as diagonal using np.argsort(shape)[::-1][0]
. 'all_diag' sets all preconditioners to be diagonal.
preconditioner_update_probability
: Preconditioner update probability uses a schedule by default that works well for most cases. It anneals from 1 to 0.03 at the beginning of training, so training will be slightly slower at the start but will speed up by around 4k steps. PSGD generally benefits from more preconditioner updates at the start of training, but once the preconditioner is learned it's okay to do them less often. An easy way to adjust update frequency is to define your own schedule using the precond_update_prob_schedule
function in kron.py (just changing the min_prob
value is easiest) and pass this into kron through the preconditioner_update_probability
hyperparameter.
This is the default schedule defined in the precond_update_prob_schedule
function at the top of kron.py:
PSGD papers and resources listed from Xi-Lin's repo
This work is licensed under a Creative Commons Attribution 4.0 International License.
2024 Evan Walters, Omead Pooladzandi, Xi-Lin Li
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4