Optimization.jl

There are some solvers that are available in the Optimization.jl package directly without the need to install any of the solver wrappers.

Methods

  • LBFGS: The popular quasi-Newton method that leverages limited memory BFGS approximation of the inverse of the Hessian. Through a wrapper over the L-BFGS-B fortran routine accessed from the LBFGSB.jl package. It directly supports box-constraints.This can also handle arbitrary non-linear constraints through a Augmented Lagrangian method with bounds constraints described in 17.4 of Numerical Optimization by Nocedal and Wright. Thus serving as a general-purpose nonlinear optimization solver available directly in Optimization.jl.
Optimization.SophiaType
Sophia(; η = 1e-3, βs = (0.9, 0.999), ϵ = 1e-8, λ = 1e-1, k = 10, ρ = 0.04)

A second-order optimizer that incorporates diagonal Hessian information for faster convergence.

Based on the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training" (https://arxiv.org/abs/2305.14342). Sophia uses an efficient estimate of the diagonal of the Hessian matrix to adaptively adjust the learning rate for each parameter, achieving faster convergence than first-order methods like Adam and SGD while avoiding the computational cost of full second-order methods.

Arguments

  • η::Float64 = 1e-3: Learning rate (step size)
  • βs::Tuple{Float64, Float64} = (0.9, 0.999): Exponential decay rates for the first moment (β₁) and diagonal Hessian (β₂) estimates
  • ϵ::Float64 = 1e-8: Small constant for numerical stability
  • λ::Float64 = 1e-1: Weight decay coefficient for L2 regularization
  • k::Integer = 10: Frequency of Hessian diagonal estimation (every k iterations)
  • ρ::Float64 = 0.04: Clipping threshold for the update to maintain stability

Example

using Optimization, OptimizationOptimisers

# Define optimization problem
rosenbrock(x, p) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
x0 = zeros(2)
optf = OptimizationFunction(rosenbrock, Optimization.AutoZygote())
prob = OptimizationProblem(optf, x0)

# Solve with Sophia
sol = solve(prob, Sophia(η=0.01, k=5))

Notes

Sophia is particularly effective for:

  • Large-scale optimization problems
  • Neural network training
  • Problems where second-order information can significantly improve convergence

The algorithm maintains computational efficiency by only estimating the diagonal of the Hessian matrix using a Hutchinson trace estimator with random vectors, making it more scalable than full second-order methods while still leveraging curvature information.

source

Examples

Unconstrained rosenbrock problem

using Optimization, Zygote

rosenbrock(x, p) = (p[1] - x[1])^2 + p[2] * (x[2] - x[1]^2)^2
x0 = zeros(2)
p = [1.0, 100.0]

optf = OptimizationFunction(rosenbrock, AutoZygote())
prob = Optimization.OptimizationProblem(optf, x0, p)
sol = solve(prob, Optimization.LBFGS())
retcode: Success
u: 2-element Vector{Float64}:
 0.9999997057368228
 0.999999398151528

With nonlinear and bounds constraints

function con2_c(res, x, p)
    res .= [x[1]^2 + x[2]^2, (x[2] * sin(x[1]) + x[1]) - 5]
end

optf = OptimizationFunction(rosenbrock, AutoZygote(), cons = con2_c)
prob = OptimizationProblem(optf, x0, p, lcons = [1.0, -Inf],
    ucons = [1.0, 0.0], lb = [-1.0, -1.0],
    ub = [1.0, 1.0])
res = solve(prob, Optimization.LBFGS(), maxiters = 100)
retcode: Success
u: 2-element Vector{Float64}:
 0.783397417853095
 0.6215211044097776

Train NN with Sophia

using Optimization, Lux, Zygote, MLUtils, Statistics, Plots, Random, ComponentArrays

x = rand(10000)
y = sin.(x)
data = MLUtils.DataLoader((x, y), batchsize = 100)

# Define the neural network
model = Chain(Dense(1, 32, tanh), Dense(32, 1))
ps, st = Lux.setup(Random.default_rng(), model)
ps_ca = ComponentArray(ps)
smodel = StatefulLuxLayer{true}(model, nothing, st)

function callback(state, l)
    state.iter % 25 == 1 && @show "Iteration: $(state.iter), Loss: $l"
    return l < 1e-1 ## Terminate if loss is small
end

function loss(ps, data)
    x_batch, y_batch = data
    ypred = [smodel([x_batch[i]], ps)[1] for i in eachindex(x_batch)]
    return sum(abs2, ypred .- y_batch)
end

optf = OptimizationFunction(loss, AutoZygote())
prob = OptimizationProblem(optf, ps_ca, data)

res = Optimization.solve(prob, Optimization.Sophia(), callback = callback, epochs = 100)
retcode: Success
u: ComponentVector{Float32}(layer_1 = (weight = Float32[2.3240178; 1.7346908; … ; 0.43100044; -0.64003944;;], bias = Float32[0.45685038, -0.06763761, 0.3510983, 0.14574528, -0.72297215, -0.5403044, 0.4372348, 0.42898345, 0.049238935, -0.48616666  …  -0.4837212, -0.28015375, -0.56450635, -0.10916619, -0.15076779, -0.10695897, 0.20905593, -0.8714965, -0.47024918, 0.37850612]), layer_2 = (weight = Float32[-0.09004341 -0.0089986045 … 0.18264276 -0.013894698], bias = Float32[0.078782775]))