OptimizationSophia.jl

OptimizationSophia.jl is a package that provides the Sophia optimizer for neural network training.

Installation

To use this package, install the OptimizationSophia package:

using Pkg
Pkg.add("OptimizationSophia")

Methods

OptimizationSophia.SophiaType
Sophia(; η = 1e-3, βs = (0.9, 0.999), ϵ = 1e-8, λ = 1e-1, k = 10, ρ = 0.04)

A second-order optimizer that incorporates diagonal Hessian information for faster convergence.

Based on the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training" (https://arxiv.org/abs/2305.14342). Sophia uses an efficient estimate of the diagonal of the Hessian matrix to adaptively adjust the learning rate for each parameter, achieving faster convergence than first-order methods like Adam and SGD while avoiding the computational cost of full second-order methods.

Arguments

  • η::Float64 = 1e-3: Learning rate (step size)
  • βs::Tuple{Float64, Float64} = (0.9, 0.999): Exponential decay rates for the first moment (β₁) and diagonal Hessian (β₂) estimates
  • ϵ::Float64 = 1e-8: Small constant for numerical stability
  • λ::Float64 = 1e-1: Weight decay coefficient for L2 regularization
  • k::Integer = 10: Frequency of Hessian diagonal estimation (every k iterations)
  • ρ::Float64 = 0.04: Clipping threshold for the update to maintain stability

Example

using OptimizationBase, OptimizationSophia

# Define optimization problem
rosenbrock(x, p) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
x0 = zeros(2)
optf = OptimizationFunction(rosenbrock, OptimizationBase.AutoZygote())
prob = OptimizationProblem(optf, x0)

# Solve with Sophia
sol = solve(prob, Sophia(η = 0.01, k = 5))

Notes

Sophia is particularly effective for:

  • Large-scale optimization problems
  • Neural network training
  • Problems where second-order information can significantly improve convergence

The algorithm maintains computational efficiency by only estimating the diagonal of the Hessian matrix using a Hutchinson trace estimator with random vectors, making it more scalable than full second-order methods while still leveraging curvature information.

source

Examples

Train NN with Sophia

using OptimizationBase, OptimizationSophia, Lux, ADTypes, Zygote, MLUtils, Statistics, Random, ComponentArrays

x = rand(10000)
y = sin.(x)
data = MLUtils.DataLoader((x, y), batchsize = 100)

# Define the neural network
model = Chain(Dense(1, 32, tanh), Dense(32, 1))
ps, st = Lux.setup(Random.default_rng(), model)
ps_ca = ComponentArray(ps)
smodel = StatefulLuxLayer{true}(model, nothing, st)

function callback(state, l)
    state.iter % 25 == 1 && @show "Iteration: $(state.iter), Loss: $l"
    return l < 1e-1 ## Terminate if loss is small
end

function loss(ps, data)
    x_batch, y_batch = data
    ypred = [smodel([x_batch[i]], ps)[1] for i in eachindex(x_batch)]
    return sum(abs2, ypred .- y_batch)
end

optf = OptimizationFunction(loss, ADTypes.AutoZygote())
prob = OptimizationProblem(optf, ps_ca, data)

res = solve(prob, OptimizationSophia.Sophia(), callback = callback, epochs = 100)
retcode: Success
u: ComponentVector{Float32}(layer_1 = (weight = Float32[-0.49531996; 2.1828232; … ; 1.6459857; 2.4595056;;], bias = Float32[-0.5366939, 0.35397622, -0.2590761, -0.8706155, -0.6247321, 0.003280748, -0.0031517793, 0.40579414, -0.42528516, 0.2781369  …  -0.67254555, -0.731019, 0.07436459, -0.7001152, -0.76863915, -0.7966375, 0.21679728, 0.03269669, -0.3116036, -0.21531573]), layer_2 = (weight = Float32[-0.23208924 -0.11179227 … 0.032785162 0.23077805], bias = Float32[0.09673637]))