ForwardDiff AD
DifferenceEquations.jl works with ForwardDiff.jl out of the box for computing gradients and Jacobians. ForwardDiff requires no shadow arrays, no activity annotations, and no workspace pre-allocation – just wrap your function in ForwardDiff.gradient.
| Scenario | Recommendation |
|---|---|
| Small models (N ≤ 5 states), few parameters | ForwardDiff – same speed as Enzyme reverse, zero setup cost |
| Large models (N ≥ 10 states) | Enzyme reverse – scales as O(1) backward passes vs O(N²) forward passes |
| Many parameters (e.g., noise perturbation over T periods) | Enzyme reverse – ForwardDiff cost scales with parameter count |
| Quick prototyping | ForwardDiff – simpler API, no Duplicated bookkeeping |
| Production estimation loops | Enzyme reverse – lower memory, pre-allocated workspace |
For DirectIteration problems where you differentiate with respect to the noise sequence, the effective parameter dimension is K × T (shocks × periods), not just N². Even for small state dimensions, long horizons make Enzyme reverse the better choice.
The Core Pattern
ForwardDiff propagates dual numbers through the computation. The key requirement is that all arrays must have a consistent element type (either Float64 or Dual{...}). The pattern is:
- Write a scalar function of a vector. ForwardDiff.gradient takes
f: ℝⁿ → ℝ. - Promote all arrays inside the function. When
ForwardDiff.gradientcalls your function with aVector{Dual{...}}, convert all other matrices to the sameDualelement type so that caches are allocated correctly. - Use the public
solve()API. Unlike Enzyme, ForwardDiff creates fresh caches each call (with the correctDualelement type), so the simplesolve(prob, alg)path works directly.
_promote(::Type{T}, x::AbstractArray{T}) where {T} = x
_promote(::Type{T}, x::AbstractArray) where {T} = T.(x)Differentiating Joint Likelihood
using DifferenceEquations, LinearAlgebra, ForwardDiff, Random
N, K, M = 2, 1, 2
A = [0.8 0.1; -0.1 0.7]
B = [0.1; 0.0;;]
C = [1.0 0.0; 0.0 1.0]
H = [0.1 0.0; 0.0 0.1]
u0 = zeros(N)
Random.seed!(42)
noise = [randn(K) for _ in 1:5]
sim = solve(LinearStateSpaceProblem(A, B, u0, (0, 5); C, noise))
obs = [sim.z[t + 1] + 0.1 * randn(M) for t in 1:5]
# Type-promotion helper
_promote(::Type{T}, x::AbstractArray{T}) where {T} = x
_promote(::Type{T}, x::AbstractArray) where {T} = T.(x)
# Gradient of joint loglik w.r.t. vec(A)
function di_loglik(A_vec, B, C, u0, noise, obs, H)
T_el = eltype(A_vec)
A = reshape(A_vec, 2, 2)
H_d = _promote(T_el, H)
R = H_d * H_d'
prob = LinearStateSpaceProblem(
A, _promote(T_el, B), _promote(T_el, u0), (0, length(obs));
C = _promote(T_el, C), observables_noise = R,
observables = obs, noise = noise)
sol = solve(prob, DirectIteration())
return sol.logpdf
end
grad_A = ForwardDiff.gradient(
a -> di_loglik(a, B, C, u0, noise, obs, H), vec(copy(A)))4-element Vector{Float64}:
0.3277920573507629
-0.36594923061958085
0.15269102512629656
0.01738617601138187Differentiating the Kalman Filter
The KalmanFilter marginal log-likelihood works the same way.
# General Kalman loglik that promotes all inputs consistently
function kf_loglik(A, B, C, mu0, Sigma0, R, obs)
T_el = promote_type(eltype(A), eltype(B), eltype(C),
eltype(mu0), eltype(Sigma0), eltype(R))
N_st = size(A, 1)
prob = LinearStateSpaceProblem(
_promote(T_el, A), _promote(T_el, B),
zeros(T_el, N_st), (0, length(obs));
C = _promote(T_el, C),
u0_prior_mean = _promote(T_el, mu0),
u0_prior_var = _promote(T_el, Sigma0),
observables_noise = _promote(T_el, R),
observables = obs)
sol = solve(prob, KalmanFilter())
return sol.logpdf
end
mu0 = zeros(N)
Sigma0 = Matrix(1.0 * I(N))
R = [0.01 0.0; 0.0 0.01]
grad_kf = ForwardDiff.gradient(
a -> kf_loglik(reshape(a, N, N), B, C, mu0, Sigma0, R, obs), vec(copy(A)))4-element Vector{Float64}:
-0.9434200267655106
-0.7485738828260717
-0.4233928203146809
-2.1397999754379677Differentiating with Respect to Multiple Parameters
Because kf_loglik promotes all inputs via promote_type, you can differentiate with respect to any parameter.
# Gradient w.r.t. observation matrix C
grad_C = ForwardDiff.gradient(
c_vec -> kf_loglik(A, B, reshape(c_vec, M, N), mu0, Sigma0, R, obs),
vec(copy(C)))4-element Vector{Float64}:
-0.6937819088518985
-1.3366623124798376
-0.07362328088361392
-0.8712195407383561# Gradient w.r.t. prior mean
grad_mu0 = ForwardDiff.gradient(
m -> kf_loglik(A, B, C, m, Sigma0, R, obs), copy(mu0))2-element Vector{Float64}:
0.028915455807391244
0.02485905275285787Integration with Optimization.jl
ForwardDiff integrates with Optimization.jl via AutoForwardDiff(), which is simpler than the Enzyme path (no manual Duplicated bookkeeping).
using Optimization, OptimizationOptimJL
# Simulate data from a known model
Random.seed!(42)
T_opt = 200
B_opt = [0.0; 0.001;;]
C_opt = [0.09 0.67; 1.00 0.00]
R_opt = [0.01 0.0; 0.0 0.01]
prob_data = LinearStateSpaceProblem([0.95 6.2; 0.0 0.2], B_opt, zeros(2), (0, T_opt);
C = C_opt, observables_noise = R_opt)
sol_data = solve(prob_data)
obs_data = sol_data.z[2:end]
# Objective: negative Kalman loglik as function of β = [A[1,1]]
mu0_opt = zeros(2)
Sigma0_opt = Matrix(1e-2 * I(2))
function neg_loglik(beta, p)
A = [beta[1] 6.2; 0.0 0.2]
return -kf_loglik(A, p.B, p.C, p.mu0, p.Sigma0, p.R, p.obs)
end
params = (; B = B_opt, C = C_opt, R = R_opt, obs = obs_data,
mu0 = mu0_opt, Sigma0 = Sigma0_opt)
optf = OptimizationFunction(neg_loglik, AutoForwardDiff())
optprob = OptimizationProblem(optf, [0.90], params)
optsol = solve(optprob, LBFGS())
optsol.u # estimated β (true value: 0.95)1-element Vector{Float64}:
0.36049073352456606StaticArrays
ForwardDiff also works with SVector/SMatrix inputs. Construct static arrays from the dual-typed input vector inside the function.
using StaticArrays
function kf_loglik_static(A_vec, B, C, mu0, Sigma0, R, obs,
::Val{N_}, ::Val{M_}, ::Val{K_}) where {N_, M_, K_}
T_el = eltype(A_vec)
A = SMatrix{N_, N_}(reshape(A_vec, N_, N_))
prob = LinearStateSpaceProblem(
A, SMatrix{N_, K_}(T_el.(B)),
SVector{N_}(zeros(T_el, N_)), (0, length(obs));
C = SMatrix{M_, N_}(T_el.(C)),
u0_prior_mean = SVector{N_}(T_el.(mu0)),
u0_prior_var = SMatrix{N_, N_}(T_el.(Sigma0)),
observables_noise = SMatrix{M_, M_}(T_el.(R)),
observables = obs)
sol = solve(prob, KalmanFilter())
return sol.logpdf
end
obs_s = [SVector{M}(o) for o in obs]
grad_static = ForwardDiff.gradient(
a -> kf_loglik_static(a, SMatrix{N,K}(B), SMatrix{M,N}(C),
SVector{N}(mu0), SMatrix{N,N}(Sigma0), SMatrix{M,M}(R),
obs_s, Val(N), Val(M), Val(K)),
collect(vec(Matrix(A))))4-element Vector{Float64}:
-0.9434200267655113
-0.748573882826078
-0.42339282031468084
-2.1397999754379686Combining ForwardDiff + StaticArrays with save_everystep=false gives the best performance for small models. Without it, the allocation of T dual-number SVector buffers dominates. With save_everystep=false, only 2 scratch slots are used, yielding up to 7x speedup for the Kalman filter and 3.4x for ConditionalLikelihood at N=5.
sol = solve(prob, KalmanFilter(); save_everystep=false)Quadratic and Generic Models
ForwardDiff works with all problem types: QuadraticStateSpaceProblem, PrunedQuadraticStateSpaceProblem, and StateSpaceProblem. The same pattern applies — promote all arrays to the Dual element type inside the gradient function and call solve(prob, DirectIteration()).
Important Notes
- Type promotion is required. All arrays flowing into the problem must have the same element type. Use
promote_typeacross all inputs (as inkf_loglikabove) or the_promotehelper to convertFloat64arrays to theDualtype. - Fresh allocation each call. ForwardDiff creates new caches with
Dualelement types viasolve(). This is unavoidable (unlike Enzyme, which reusesFloat64caches with separate shadow arrays). Usesave_everystep=falseto minimize these allocations from O(T) to O(1) when you only needlogpdf. - Chunk size.
ForwardDiff.gradientdefaults to a chunk size of ~10, processing 10 partial derivatives per forward pass. For parameter count > 10, it runs multiple passes. This makes ForwardDiff cost scale linearly with the number of parameters being differentiated. - Observations stay
Float64. Theobservables(data) are not differentiated and can remainVector{Vector{Float64}}. The solver's internal buffers are allocated with theDualelement type, so whenFloat64observations are copied in, the dual partials are zero — which is correct since observations are data, not parameters being differentiated. - DirectIteration noise sensitivity. When differentiating
DirectIterationw.r.t. the noise sequence, the parameter dimension isK × T(shocks × periods). Even for small state-space models, long time series make ForwardDiff expensive and Enzyme reverse the better choice.