Sparse Identification with noisy data

Many real-world data sources are corrupted with measurement noise, which can have a big impact on the recovery of the underlying equations of motion. This example shows how we can use a collocation method and batching to perform SINDy in the presence of noise.

using DataDrivenDiffEq
using LinearAlgebra
using OrdinaryDiffEq
using DataDrivenSparse
using StableRNGs
using Plots
gr()

rng = StableRNG(1337)

function pendulum(u, p, t)
    x = u[2]
    y = -9.81sin(u[1]) - 0.3u[2]^3 - 3.0 * cos(u[1]) - 10.0 * exp(-((t - 5.0) / 5.0)^2)
    return [x; y]
end

u0 = [0.99π; -1.0]
tspan = (0.0, 15.0)
prob = ODEProblem(pendulum, u0, tspan)
sol = solve(prob, Tsit5(), saveat = 0.01);

We add random noise to our measurements.

X = sol[:, :] + 0.2 .* randn(rng, size(sol));
ts = sol.t;

plot(ts, X', color = :red)
plot!(sol, color = :black)

To estimate the system, we first create a DataDrivenProblem, which requires measurement data. Using a collocation method, it automatically provides the derivative and smoothes the trajectory. Control signals can be passed in as a function (u,p,t)->control or an array of measurements.

prob = ContinuousDataDrivenProblem(X, ts, GaussianKernel(),
    U = (u, p, t) -> [exp(-((t - 5.0) / 5.0)^2)],
    p = ones(2))

plot(prob, size = (600,600))

Now we infer the system structure. First we define a Basis which collects all possible candidate terms. Since we want to use SINDy, we call solve with an sparsifying algorithm, in this case STLSQ which iterates different sparsity thresholds and returns a Pareto optimal solution. Note that we include the control signal in the basis as an additional variable c.

@variables u[1:2] c[1:1]
@parameters w[1:2]
u = collect(u)
c = collect(c)
w = collect(w)

h = Num[sin.(w[1] .* u[1]); cos.(w[2] .* u[1]); polynomial_basis(u, 5); c]

basis = Basis(h, u, parameters = w, controls = c);

Model ##Basis#332 with 24 equations
States : u[1] u[2]
Parameters : w[1] w[2]
Independent variable: t
Equations
φ₁ = sin(u[1]*w[1])
φ₂ = cos(u[1]*w[2])
φ₃ = 1
φ₄ = u[1]
...
φ₂₄ = c[1]

To solve the problem, we also define a DataProcessing which defines randomly shuffled minibatches of our data and selects the best fit.

sampler = DataProcessing(split = 0.8, shuffle = true, batchsize = 30, rng = rng)
λs = exp10.(-10:0.1:0)
opt = STLSQ(λs)
res = solve(prob, basis, opt,
    options = DataDrivenCommonOptions(data_processing = sampler, digits = 1))

"DataDrivenSolution{Float64}"

Info

A more detailed description of the result can be printed via print(res, Val{true}), which also includes the discovered equations and parameter values.

Where the resulting DataDrivenSolution stores information about the inferred model and the parameters:

system = get_basis(res)
params = get_parameter_map(system)

Model ##Basis#335 with 2 equations
States : u[1] u[2]
Parameters : 19
Independent variable: t
Equations
Differential(t)(u[1]) = p₃*u[2]
Differential(t)(u[2]) = p₆ + c[1]*p₁₉ + p₁₀*u[2] + p₄*sin(u[1]*w[1]) + p₅*cos(u[1]*w[2]) + p₇*u[1] + p₁₁*u[1]*u[2] + p₁₄*(u[2]^2) + p₈*(u[1]^2) + p₁₂*(u[1]^2)*u[2] + p₁₅*u[1]*(u[2]^2) + p₁₇*(u[2]^3) + p₉*(u[1]^3) + p₁₃*(u[1]^3)*u[2] + p₁₆*(u[1]^2)*(u[2]^2) + p₁₈*u[1]*(u[2]^3)

Pair{SymbolicUtils.BasicSymbolic{Real}, Float64}[w[1] => 1.0, w[2] => 1.0, p₃ => 0.9, p₄ => -8.6, p₅ => -5.8, p₆ => 9.2, p₇ => 7.6, p₈ => 1.8, p₉ => 0.1, p₁₀ => 2.8, p₁₁ => 5.3, p₁₂ => 2.8, p₁₃ => 0.5, p₁₄ => 2.7, p₁₅ => 2.5, p₁₆ => 0.7, p₁₇ => 0.6, p₁₈ => 0.5, p₁₉ => -10.3]

We can see that even if there are other terms active, the most important terms are included inside the model.

And a visual check of the result can be performed by plotting the result

plot(
    plot(prob), plot(res), layout = (1,2)
)

Copy-Pasteable Code

using DataDrivenDiffEq
using LinearAlgebra
using OrdinaryDiffEq
using DataDrivenSparse
using StableRNGs

rng = StableRNG(1337)

function pendulum(u, p, t)
    x = u[2]
    y = -9.81sin(u[1]) - 0.3u[2]^3 - 3.0 * cos(u[1]) - 10.0 * exp(-((t - 5.0) / 5.0)^2)
    return [x; y]
end

u0 = [0.99π; -1.0]
tspan = (0.0, 15.0)
prob = ODEProblem(pendulum, u0, tspan)
sol = solve(prob, Tsit5(), saveat = 0.01);

X = sol[:, :] + 0.2 .* randn(rng, size(sol));
ts = sol.t;

prob = ContinuousDataDrivenProblem(X, ts, GaussianKernel(),
    U = (u, p, t) -> [exp(-((t - 5.0) / 5.0)^2)],
    p = ones(2))

@variables u[1:2] c[1:1]
@parameters w[1:2]
u = collect(u)
c = collect(c)
w = collect(w)

h = Num[sin.(w[1] .* u[1]); cos.(w[2] .* u[1]); polynomial_basis(u, 5); c]

basis = Basis(h, u, parameters = w, controls = c);
println(basis) # hide

sampler = DataProcessing(split = 0.8, shuffle = true, batchsize = 30, rng = rng)
λs = exp10.(-10:0.1:0)
opt = STLSQ(λs)
res = solve(prob, basis, opt,
    options = DataDrivenCommonOptions(data_processing = sampler, digits = 1))

system = get_basis(res)
params = get_parameter_map(system)
println(system) # hide
println(params) # hide

# This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl

This page was generated using Literate.jl.