Optimization-Based Methods

The Objective Function Builders

Standard Nonlinear Regression

build_loss_objective builds an objective function to be used with Optim.jl and MathProgBase-associated solvers like NLopt.

function build_loss_objective(prob::DEProblem, alg, loss,
                              adtype = SciMLBase.NoAD(),
                              regularization = nothing;
                              priors = nothing,
                              prob_generator = STANDARD_PROB_GENERATOR,
                              kwargs...)
end

The first argument is the DEProblem to solve, and next is the alg to use. The alg must match the problem type, which can be any DEProblem (ODEs, SDEs, DAEs, DDEs, etc.). regularization defaults to nothing, which has no regularization function. The extra keyword arguments are passed to the differential equation solver.

Multiple Shooting

Multiple Shooting is often used in Boundary Value Problems (BVP) and is more robust than the regular objective function used in these problems. It proceeds as follows:

  • Divide up the time span into short time periods and solve the equation with the current parameters which here consist of both, the parameters of the differential equations and also the initial values for the short time periods.
  • This objective additionally involves a discontinuity error term that imposes higher cost if the end of the solution of one time period doesn't match the beginning of the next one.
  • Merge the solutions from the shorter intervals and then calculate the loss.
function multiple_shooting_objective(prob::DiffEqBase.DEProblem, alg, loss,
                                     adtype = SciMLBase.NoAD(),
                                     regularization = nothing;
                                     priors = nothing,
                                     discontinuity_weight = 1.0,
                                     prob_generator = STANDARD_PROB_GENERATOR,
                                     kwargs...)
end

For consistency multiple_shooting_objective takes exactly the same arguments as build_loss_objective. It also has the option for discontinuity_weight as a keyword argument, which assigns weight to the error occurring due to the discontinuity that arises from the breaking up of the time span.

Detailed Explanations of Arguments

The Loss Function

loss(sol)

is the function, which reduces the problem's solution to a scalar, which the optimizer will try to minimize. While this is very flexible, two convenience routines are included for fitting to data with standard cost functions:

L2Loss(t, data; differ_weight = nothing, data_weight = nothing,
       colloc_grad = nothing, dudt = nothing)

where t is the set of timepoints which the data are found at, and data are the values that are known where each column corresponds to measures of the values of the system. L2Loss is an optimized version of the L2-distance. The data_weight is a scalar or vector of weights for the loss function which must match the size of the data. Note that minimization of a weighted L2Loss is equivalent to maximum likelihood estimation of a heteroskedastic Normally distributed likelihood. differ_weight allows one to add a weight on the first differencing terms sol[i+1]-sol[i] against the data first differences. This smooths out the loss term and can make it easier to fit strong solutions of stochastic models, but is zero (nothing) by default. Additionally, colloc_grad allows one to give a matrix of the collocation gradients for the data. This is used to add an interpolation derivative term, like the two-stage method. A convenience function colloc_grad(t,data) returns a collocation gradient from a 3rd order spline calculated by Dierckx.jl, which can be used as the colloc_grad. Note that, with a collocation gradient and regularization, this loss is equivalent to a 4DVAR.

Additionally, we include a more flexible log-likelihood approach:

LogLikeLoss(t, distributions, diff_distributions = nothing)

In this case, there are two forms. The simple case is where distributions[i,j] is the likelihood distributions from a UnivariateDistribution from Distributions.jl, where it corresponds to the likelihood at t[i] for component j. The second case is where distributions[i] is a MultivariateDistribution which corresponds to the likelihood at t[i] over the vector of components. This likelihood function then calculates the negative of the total log-likelihood over time as its objective value (negative since optimizers generally find minimums, and thus this corresponds to maximum likelihood estimation). The third term, diff_distributions, acts similarly but allows putting a distribution on the first difference terms sol[i+1]-sol[i].

Note that these distributions can be generated via fit_mle on some dataset against some chosen distribution type.

Note About Loss Functions

For parameter estimation problems, it's not uncommon for the optimizers to hit unstable regions of parameter space. This causes warnings that the solver exited early, and the built-in loss functions like L2Loss automatically handle this. However, if using a user-supplied loss function, you should make sure it's robust to these issues. One common pattern is to apply infinite loss when the integration is not successful. Using the retcodes, this can be done via:

function my_loss_function(sol)
    tot_loss = 0.0
    if any((!SciMLBase.successful_retcode(s.retcode) for s in sol))
        tot_loss = Inf
    else
        # calculation for the loss here
    end
    tot_loss
end

Note on First Differencing

L2Loss(t, data, differ_weight = 0.3, data_weight = 0.7)

First differencing incorporates the differences of data points at consecutive time points which adds more information about the trajectory in the loss function. Adding first differencing is helpful in cases where the L2Loss alone leads to non-identifiable parameters, but adding a first differencing term makes it more identifiable. This can be noted on stochastic differential equation models, where this aims to capture the autocorrelation and therefore helps us avoid getting the same stationary distribution despite different trajectories and thus wrong parameter estimates.

The Regularization Function

The regularization can be any function of p, the parameter vector:

regularization(p)

The Regularization helper function builds a regularization using a penalty function penalty from PenaltyFunctions.jl:

reg = Regularization(λ, penalty = L2Penalty())
build_loss_objective(prob, alg, loss, SciMLBase.NoAD(), reg)

using Optimization, Zygote
build_loss_objective(prob, alg, loss, Optimization.AutoZygote(), reg)

The regularization defaults to L2 if no penalty function is specified. λ is the weight parameter for the addition of the regularization term.

Using automatic differentiation

To use derivatives with optimization solvers, Optimization.jl's adtype argument as described here should be used with the wrapper subpackage OptimizationOptimJL, OptimizationNLopt etc.

using Optimization, ForwardDiff
build_loss_objective(prob, alg, loss, Optimization.AutoForwardDiff())
multiple_shooting_objective(prob, alg, loss, Optimization.AutoForwardDiff())

The Problem Generator Function

The argument prob_generator allows one to specify a function for generating new problems from a given parameter set. By default, this just builds a new problem which fixes the element types in a way that's autodifferentiation compatible and adds the new parameter vector p. For example, the code for this is:

prob_generator = (prob, p) -> remake(prob, u0 = convert.(eltype(p), prob.u0), p = p)

Then the new problem with these new values is returned.

One can use this to change the meaning of the parameters using this function. For example, if one instead wanted to optimize the initial conditions for a function without parameters, you could change this to:

prob_generator = (prob, p) -> remake(prob.f, u0 = p)

which simply uses p as the initial condition in the initial value problem.

Using the Objectives for MAP estimates

You can also add a prior option to build_loss_objective and multiple_shooting_objective that essentially turns it into MAP by multiplying the log-likelihood (the cost) by the prior. The option is available as the keyword argument priors, it can take in either an array of univariate distributions for each of the parameters or a multivariate distribution.

ms_obj = multiple_shooting_objective(ms_prob, Tsit5(), L2Loss(t, data); priors = priors,
                                     discontinuity_weight = 1.0, abstol = 1e-12,
                                     reltol = 1e-12)