The old PSGD implementation is deprecated. The new PSGD implementation is a superset of the old one, and further supports three more matmul-only/inverse-free methods for updating $Q$ . Recommended choices for dQ are QUAD and QEQ.
PSGD (Preconditioned SGD) is a general purpose (mathematical and stochastic, convex and nonconvex) 2nd order optimizer. It reformulates a wide range of preconditioner estimation and Hessian fitting problems as a family of strongly convex Lie group optimization problems.
Notations: $E_z[\ell(\theta, z)]$ or $\ell(\theta)$ the loss; $g$ the (stochastic) gradient wrt $\theta$ ; $H$ the Hessian; $h=Hv$ the Hessian-vector product (Hvp) with ${v\sim\mathcal{N}(0,I)}$ ; $P=Q^TQ$ the preconditioner applying on $g$ ; ${\rm tri}$ takes the upper or lower triangular part of a matrix; $\lVert \cdot \rVert$ takes spectral norm; superscripts $^T$ , $^*$ and $^H$ for transpose, conjugate and Hermitian transpose, respectively.
The PSGD theory has two orthogonal parts: criteria for preconditioner fitting and preconditioner fitting in Lie groups.
Criteria for preconditioner fittingPSGD was originally designed for preconditioning the gradient such that metrics of the spaces of preconditioned gradient and parameters are matched, i.e., $E_{\delta \theta, z}[(P\delta g)(P\delta g)^T] = E_{\delta \theta, z}[\delta \theta \delta \theta^T]$ , where $\delta$ denotes the perturbation operation and $P$ is symmetric positive definite (SPD). This leads to the original preconditioner fitting criterion $E_{\delta\theta, z}[\delta g^T P \delta g + \delta \theta^T P^{-1} \delta \theta]$ ref. The finite-difference notation may not be common in machine learning (ML). But, note that PSGD was invented before popular automatic differentiation (AD) tools like Tensorflow. Manually calculating the Hvp was cubersome then. With AD, we can simply replace pair $(\delta \theta, \delta g)$ with $(v, h)$ to obtain the Newton-style preconditioner fitting criterion $E_{v, z}[h^T P h + v^T P^{-1} v]$ . For the gradient/momentum whitening preconditioner, we just replace pair $(\delta \theta, \delta g)$ with $(v, g)$ to have criterion $E_{v, z}[g^T P g + v^T P^{-1} v]$ ref, where $v$ is an auxiliary variable and can be optionally integrated out as it is indepedent of $g$ .
Preconditioner fitting in Lie groupsThe above preconditioner fitting criteria are always convex in the Euclidean space, the manifold of SPD matrices and the Lie groups. But, they are strongly convex only in the Lie groups ref. The $Q$ here defines the coordinate transform $\vartheta=Q^{-T}\theta$ such that PSGD reduces to an SGD for $\vartheta$ . Lie group is a natural tool for this purpose by preserving invariances like the coordinate orientations such that $Q$ is always invertible. Also, the multiplicative updates in Lie group avoid explicit matrix inverse. There are virtually endless choices for the group forms of $Q$ , say the Kronecker product preconditioner ref, the affine Lie group ref, and the low rank approximation (LRA) group ref.
Table I: Variations of preconditioner fitting criterion Criterion Solution Notes $h^TPh + v^TP^{-1}v$ $Phh^TP = vv^T$ Reduces to secant equation $Ph=v$ when $v^Th>0$ (see quasi-Newton methods, e.g., BFGS). $E_v[h^TPh + v^TP^{-1}v]$ $P^{-2}=H^2$ Reduces to Newton's method when $H\succ 0$ . $E_{v,z}[g_z^TPg_z + v^TP^{-1}v]$ $P^{-2}=E_z[g_zg_z^T]$ $P^{-2}$ reduces to Fisher information matrix $F$ with per-sample gradient $g_z$ (see Gauss-Newton and natural gradient methods, e.g., KFAC). $\sum_t E_{v_t}[g_t^TPg_t + v_t^TP^{-1}v_t]$ $P^{-2}=\sum_t g_t g_t^T$ Relates to the AdaGrad family, e.g., Adam(W), RMSProp, Shampoo, $\ldots$ .Note 1: $v$ can be a nuisance or an auxiliary variable in the last two criteria since it is independent of $g$ and can be integrated out as $E_{v\sim\mathcal{N}(0,I)}[v^TP^{-1}v]={\rm tr}(P^{-1})$ , i.e., the Hutchinson's estimator.
Table II: Lie group ( $dQ=EQ$ ) preconditioners with storage and computation numbers for $\theta={\rm vec}(\Theta)$ with $\Theta\in\mathbb{R}^{m\times m}$ Lie Group Update of $Q$ ( $0<\mu\le 2$ ) Storages Computations Class ${\rm GL}(n, \mathbb{R})$ $Q\leftarrow \left( I - \mu \frac{Qhh^TQ^T - Q^{-T}vv^TQ^{-1}}{ \lVert Qh\rVert ^2 + \lVert Q^{-T}v\rVert^2 } \right) Q$ $\mathcal{O}(m^4)$ $\mathcal{O}(m^4)$ DenseNewton Tri matrices $Q\leftarrow {\rm tri}\left( I - \mu \frac{Qhh^TQ^T - Q^{-T}vv^TQ^{-1}}{ \lVert Qh\rVert^2 + \lVert Q^{-T}v\rVert^2 } \right) Q$ $\mathcal{O}(m^4)$ $\mathcal{O}(m^6)$ DenseNewton $Q={\rm diag}(q)$ $q\leftarrow \left( 1 - \mu \frac{(qh)^2 - (v/q)^2}{ \max\left((qh)^2 + (v/q)^2\right)} \right) q$ $\mathcal{O}(m^2)$ $\mathcal{O}(m^2)$ LRAWhiten/Newton ${\rm kron}(Q_2,Q_1)$ $A=Q_1 {\rm uvec}(h) Q_2^H$ , $B=Q_2^{-H} [{\rm uvec}(v)]^H Q_1^{-1}$ , $Q_1\leftarrow {\rm tri}\left( I - \mu \frac{AA^H-B^HB}{\lVert AA^H+B^HB \rVert} \right) Q_1$ , $Q_2\leftarrow {\rm tri}\left( I - \mu \frac{A^HA-BB^H}{\lVert A^HA+BB^H \rVert} \right) Q_2$ $\mathcal{O}(m^2)$ $\mathcal{O}(m^3)$ KronWhiten/Newton ${\rm kron}(Q_1,Q_2,\ldots)$ $A_{ab\ldots}=(Q_1)_{a \alpha}(Q_2)_{b \beta}\ldots ({\rm uvec}(h))_{\alpha\beta\ldots}$ , $B^*_{ab\ldots}=({\rm uvec}(v^*))_{\alpha\beta\ldots} (Q_1^{-1})_{\alpha a} (Q_2^{-1})_{\beta b}\ldots$ , $(Q_i)_{ac}\leftarrow {\rm tri}\left( I_{ab} - \mu \frac{A_{\ldots a\ldots}A^*_{\ldots b\ldots}-B_{\ldots a\ldots}B^*_{\ldots b\ldots}}{\lVert A_{\ldots a\ldots}A^*_{\ldots b\ldots}+B_{\ldots a\ldots}B^*_{\ldots b\ldots} \rVert} \right) Q_{bc}$ $\mathcal{O}(m^2)$ $\mathcal{O}(m^3)$ KronWhiten/Newton $Q=(I+UV^T){\rm diag}(d)$ , $U, V \in \mathbb{R}^{n\times r}$ , $0\le r\ll n$ $a=Qh$ , $b=Q^{-T}v$ , $Q\leftarrow Q-\mu{\rm diag}(a^2-b^2)Q/\max(a^2+b^2)$ , $U\leftarrow U - \mu\frac{(aa^T-bb^T)V(I+V^TU)}{\lVert a\rVert \, \lVert VV^Ta \rVert + \lVert b\rVert \, \lVert VV^Tb\rVert }$ , $V\leftarrow V - \mu\frac{ (I+VU^T)(aa^T-bb^T)U }{\lVert a\rVert \, \lVert UU^Ta\rVert + \lVert b\rVert \, \lVert UU^Tb\rVert}$ $\mathcal{O}(rm^2)$ $\mathcal{O}(rm^2)$ LRAWhiten/Newton ${\rm diag}(q_1)\otimes{\rm diag}(q_2)\otimes\ldots$ same as kron $\mathcal{O}(m)$ $\mathcal{O}(m^2)$ KronWhiten/NewtonNote 1: The other three inverse-free preconditioner update methods have similar forms and complexities. Please check ref for further details.
Note 2: For the gradient/momentum whitening preconditioner, we simply replace pair $(v, h)$ with $(v, g)$ , where $v$ is a dummy variable that can be optionally integrated out.
This script generates the following plot showing the typical behaviors of different Hessian fitting methods.
Optimizers with the criteria in Table I and preconditioner forms in Table II are wrapped into classes KronWhiten/Newton, LRAWhiten/Newton and DenseNewton for easy use.
Three main differences from torch.optim.SGD:
A few more details. The Hessian-vector products are calculated as a vector-jacobian-product (vjp), i.e., ${\rm autograd.grad}(g, \theta, v)$ in torch, maybe not always the most efficient way for a specific problem. Except for the Kronecker product preconditioners, no native support of complex parameter optimization (you can define complex parameters as view of real ones in order to use other preconditioners).
There are plenty of demos: Rosenbrock function minimization, vision transformer, generative pre-trained transformer, logistic regression, tensor rank decomposition, etc.. For this tiny vision transformer demo, the following results show that all the four PSGD-Kron-gradient-whitening preconditioners can improve the convergence a lot compared with Adam(W).
The inverse-free versions of PSGD also work well with half precision. Please check this GPT2 example for reproducing the following results.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4