Massively Data-Parallel ODE Solving the Lorenz Equation
For example, the following solves the Lorenz equation with 10,000 separate random parameters on the GPU. To start, we create a normal EnsembleProblem
as per DifferentialEquations.jl. Here's a perfectly good multithreaded CPU parallelized Lorenz solve over randomized parameters:
using DiffEqGPU, OrdinaryDiffEq, CUDA
function lorenz(du, u, p, t)
du[1] = p[1] * (u[2] - u[1])
du[2] = u[1] * (p[2] - u[3]) - u[2]
du[3] = u[1] * u[2] - p[3] * u[3]
end
u0 = Float32[1.0; 0.0; 0.0]
tspan = (0.0f0, 100.0f0)
p = [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = rand(Float32, 3) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, Tsit5(), EnsembleThreads(), trajectories = 10_000, saveat = 1.0f0);
EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, Vector{Vector{Float32}}, Nothing, Nothing, Vector{Float32}, Vector{Vector{Vector{Float32}}}, SciMLBase.ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, SciMLBase.ODEFunction{true, SciMLBase.AutoSpecialize, FunctionWrappersWrappers.FunctionWrappersWrapper{Tuple{FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{Float32}, Vector{Float32}, Vector{Float32}, Float32}}, FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{Float32}, Float32}}, FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{Float32}, Vector{Float32}, ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}}, FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{Float32}, ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}}}, false}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.InterpolationData{SciMLBase.ODEFunction{true, SciMLBase.AutoSpecialize, FunctionWrappersWrappers.FunctionWrappersWrapper{Tuple{FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{Float32}, Vector{Float32}, Vector{Float32}, Float32}}, FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{Float32}, Float32}}, FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{Float32}, Vector{Float32}, ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}}, FunctionWrappers.FunctionWrapper{Nothing, Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}, Vector{Float32}, ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}}}}, false}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Vector{Vector{Float32}}, Vector{Float32}, Vector{Vector{Vector{Float32}}}, OrdinaryDiffEq.Tsit5Cache{Vector{Float32}, Vector{Float32}, Vector{Float32}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}}, SciMLBase.DEStats, Nothing}
Changing this to being GPU-parallelized is as simple as changing the ensemble method to EnsembleGPUArray
:
sol = solve(monteprob, Tsit5(), EnsembleGPUArray(CUDA.CUDABackend()), trajectories = 10_000,
saveat = 1.0f0);
EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, uType, Nothing, Nothing, Vector{Float32}, rateType, P, OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, IType, SciMLBase.DEStats, Nothing} where {uType, rateType, P, IType}
and voilà, the method is re-compiled to parallelize the solves over a GPU!
While EnsembleGPUArray
has a bit of overhead due to its form of GPU code construction, EnsembleGPUKernel
is a more restrictive GPU-itizing algorithm that achieves a much lower overhead in kernel launching costs. However, it requires this problem to be written in out-of-place form and use special solvers. This looks like:
using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA
function lorenz2(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return SVector{3}(du1, du2, du3)
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz2, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(CUDA.CUDABackend()),
trajectories = 10_000,
saveat = 1.0f0)
EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, SubArray{StaticArraysCore.SVector{3, Float32}, 1, Matrix{StaticArraysCore.SVector{3, Float32}}, Tuple{UnitRange{Int64}, Int64}, true}, Nothing, Nothing, SubArray{Float32, 1, Matrix{Float32}, Tuple{UnitRange{Int64}, Int64}, true}, Nothing, DiffEqGPU.ImmutableODEProblem{StaticArraysCore.SVector{3, Float32}, Tuple{Float32, Float32}, false, StaticArraysCore.SVector{3, Float32}, SciMLBase.ODEFunction{false, SciMLBase.AutoSpecialize, typeof(Main.lorenz2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, GPUTsit5, SciMLBase.LinearInterpolation{SubArray{Float32, 1, Matrix{Float32}, Tuple{UnitRange{Int64}, Int64}, true}, SubArray{StaticArraysCore.SVector{3, Float32}, 1, Matrix{StaticArraysCore.SVector{3, Float32}}, Tuple{UnitRange{Int64}, Int64}, true}}, Nothing, Nothing}
Note that this form is also compatible with EnsembleThreads()
, and EnsembleGPUArray()
, so EnsembleGPUKernel()
simply supports a subset of possible problem types. For more information on the limitations of EnsembleGPUKernel()
, see its docstring.
Using Stiff ODE Solvers with EnsembleGPUArray
DiffEqGPU also supports more advanced features than shown above. Other tutorials dive into handling events or callbacks and multi-GPU parallelism. But the simplest thing to show is that the generality of solvers allows for other types of equations. For example, one can handle stiff ODEs with EnsembleGPUArray
simply by using a stiff ODE solver. Note that, as explained in the docstring, analytical derivatives (Jacobian and time gradient) must be supplied. For the Lorenz equation, this looks like:
function lorenz_jac(J, u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
x = u[1]
y = u[2]
z = u[3]
J[1, 1] = -σ
J[2, 1] = ρ - z
J[3, 1] = y
J[1, 2] = σ
J[2, 2] = -1
J[3, 2] = x
J[1, 3] = 0
J[2, 3] = -x
J[3, 3] = -β
end
function lorenz_tgrad(J, u, p, t)
nothing
end
u0 = Float32[1.0; 0.0; 0.0]
tspan = (0.0f0, 100.0f0)
p = [10.0f0, 28.0f0, 8 / 3.0f0]
func = ODEFunction(lorenz, jac = lorenz_jac, tgrad = lorenz_tgrad)
prob_jac = ODEProblem(func, u0, tspan, p)
monteprob_jac = EnsembleProblem(prob_jac, prob_func = prob_func)
solve(monteprob_jac, Rodas5(), EnsembleGPUArray(CUDA.CUDABackend()), trajectories = 10_000,
saveat = 1.0f0)
EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, uType, Nothing, Nothing, Vector{Float32}, rateType, SciMLBase.ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, StaticArraysCore.SizedVector{3, Float32, Vector{Float32}}, SciMLBase.ODEFunction{true, SciMLBase.FullSpecialize, typeof(Main.lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, typeof(Main.lorenz_tgrad), typeof(Main.lorenz_jac), Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, A, IType, SciMLBase.DEStats, Nothing} where {uType, rateType, A, IType}