Optimisers.jl
Installation: OptimizationOptimisers.jl
To use this package, install the OptimizationOptimisers package:
import Pkg;
Pkg.add("OptimizationOptimisers");
In addition to the optimisation algorithms provided by the Optimisers.jl package this subpackage also provides the Sophia optimisation algorithm.
Local Unconstrained Optimizers
Sophia
: Based on the recent paper https://arxiv.org/abs/2305.14342. It incorporates second order information in the form of the diagonal of the Hessian matrix hence avoiding the need to compute the complete hessian. It has been shown to converge faster than other first order methods such as Adam and SGD.solve(problem, Sophia(; η, βs, ϵ, λ, k, ρ))
η
is the learning rateβs
are the decay of momentumsϵ
is the epsilon valueλ
is the weight decay parameterk
is the number of iterations to re-compute the diagonal of the Hessian matrixρ
is the momentumDefaults:
η = 0.001
βs = (0.9, 0.999)
ϵ = 1e-8
λ = 0.1
k = 10
ρ = 0.04
Optimisers.Descent
: Classic gradient descent optimizer with learning ratesolve(problem, Descent(η))
η
is the learning rateDefaults:
η = 0.1
Optimisers.Momentum
: Classic gradient descent optimizer with learning rate and momentumsolve(problem, Momentum(η, ρ))
η
is the learning rateρ
is the momentumDefaults:
η = 0.01
ρ = 0.9
Optimisers.Nesterov
: Gradient descent optimizer with learning rate and Nesterov momentumsolve(problem, Nesterov(η, ρ))
η
is the learning rateρ
is the Nesterov momentumDefaults:
η = 0.01
ρ = 0.9
Optimisers.RMSProp
: RMSProp optimizersolve(problem, RMSProp(η, ρ))
η
is the learning rateρ
is the momentumDefaults:
η = 0.001
ρ = 0.9
Optimisers.Adam
: Adam optimizersolve(problem, Adam(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)
Optimisers.RAdam
: Rectified Adam optimizersolve(problem, RAdam(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)
Optimisers.RAdam
: Optimistic Adam optimizersolve(problem, OAdam(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.5, 0.999)
Optimisers.AdaMax
: AdaMax optimizersolve(problem, AdaMax(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)
Optimisers.ADAGrad
: ADAGrad optimizersolve(problem, ADAGrad(η))
η
is the learning rateDefaults:
η = 0.1
Optimisers.ADADelta
: ADADelta optimizersolve(problem, ADADelta(ρ))
ρ
is the gradient decay factorDefaults:
ρ = 0.9
Optimisers.AMSGrad
: AMSGrad optimizersolve(problem, AMSGrad(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)
Optimisers.NAdam
: Nesterov variant of the Adam optimizersolve(problem, NAdam(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)
Optimisers.AdamW
: AdamW optimizersolve(problem, AdamW(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsdecay
is the decay to weightsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)
decay = 0
Optimisers.ADABelief
: ADABelief variant of Adamsolve(problem, ADABelief(η, β::Tuple))
η
is the learning rateβ::Tuple
is the decay of momentumsDefaults:
η = 0.001
β::Tuple = (0.9, 0.999)