Massively Data-Parallel ODE Solving on GPUs

Before we start: the two ways to accelerate ODE solvers with GPUs

Before we dive deeper, let us remark that there are two very different ways that one can accelerate an ODE solution with GPUs. There is one case where u is very big and f is very expensive but very structured, and you use GPUs to accelerate the computation of said f. The other use case is where u is very small, but you want to solve the ODE f over many different initial conditions (u0) or parameters p. In that case, you can use GPUs to parallelize over different parameters and initial conditions. In other words:

Type of Problem	SciML Solution
Accelerate a big ODE	Use CUDA.jl's CuArray as `u0`
Solve the same ODE with many `u0` and `p`	Use DiffEqGPU.jl's`EnsembleGPUArray` and `EnsembleGPUKernel`

This showcase will focus on the latter case. For the former, see the massively parallel GPU ODE solving showcase.

Supported GPUs

SciML's GPU support extends to a wide array of hardware, including:

GPU Manufacturer	GPU Kernel Language	Julia Support Package	Backend Type
NVIDIA	CUDA	CUDA.jl	`CUDA.CUDABackend()`
AMD	ROCm	AMDGPU.jl	`AMDGPU.ROCBackend()`
Intel	OneAPI	OneAPI.jl	`oneAPI.oneAPIBackend()`
Apple (M-Series)	Metal	Metal.jl	`Metal.MetalBackend()`

For this tutorial we will demonstrate the CUDA backend for NVIDIA GPUs, though any of the other GPUs can be used by simply swapping out the backend choice.

Problem Setup

Let's say we wanted to quantify the uncertainty in the solution of a differential equation. One simple way to do this would be to a Monte Carlo simulation of the same ODE, randomly jiggling around some parameters according to an uncertainty distribution. We could do that on a CPU, but that's not hip. What's hip are GPUs! GPUs have thousands of cores, so could we make each core of our GPU solve the same ODE, but with different parameters? The ensembling tools of DiffEqGPU.jl solve exactly this issue, and today you will learn how to master the GPUniverse.

Let's dive right in.

Defining the Ensemble Problem for CPU

DifferentialEquations.jl has an ensemble interface for solving many ODEs. DiffEqGPU conveniently uses exactly the same interface, so just a change of a few characters is all that's required to change a CPU-parallelized code into a GPU-parallelized code. Given that, let's start with the CPU-parallelized code.

Let's implement the Lorenz equation out-of-place. If you don't know what that means, see the getting started with DifferentialEquations.jl

import DiffEqGPU
import OrdinaryDiffEq as ODE
import StaticArrays
import CUDA
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return StaticArrays.SVector{3}(du1, du2, du3)
end

u0 = StaticArrays.@SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = StaticArrays.@SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODE.ODEProblem{false}(lorenz, u0, tspan, p)

ODEProblem with uType StaticArraysCore.SVector{3, Float32} and tType Float32. In-place: false
Non-trivial mass matrix: false
timespan: (0.0f0, 10.0f0)
u0: 3-element StaticArraysCore.SVector{3, Float32} with indices SOneTo(3):
 1.0
 0.0
 0.0

Notice we use SVectors, i.e. StaticArrays, in order to define our arrays. This is important for later, since the GPUs will want a fully non-allocating code to build a kernel on.

Now, from this problem, we build an EnsembleProblem as per the DifferentialEquations.jl specification. A prob_func jiggles the parameters and we solve 10_000 trajectories:

prob_func = (prob, i, repeat) -> ODE.remake(prob, p = (StaticArrays.@SVector rand(Float32, 3)) .* p)
monteprob = DiffEqGPU.EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = ODE.solve(monteprob, ODE.Tsit5(), DiffEqGPU.EnsembleThreads(), trajectories = 10_000, saveat = 1.0f0)

EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, Vector{StaticArraysCore.SVector{3, Float32}}, Nothing, Nothing, Vector{Float32}, Vector{Vector{StaticArraysCore.SVector{3, Float32}}}, Nothing, SciMLBase.ODEProblem{StaticArraysCore.SVector{3, Float32}, Tuple{Float32, Float32}, false, StaticArraysCore.SVector{3, Float32}, SciMLBase.ODEFunction{false, SciMLBase.AutoSpecialize, typeof(Main.lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, SciMLBase.StandardODEProblem}, OrdinaryDiffEqTsit5.Tsit5{typeof(OrdinaryDiffEqCore.trivial_limiter!), typeof(OrdinaryDiffEqCore.trivial_limiter!), Static.False}, OrdinaryDiffEqCore.InterpolationData{SciMLBase.ODEFunction{false, SciMLBase.AutoSpecialize, typeof(Main.lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, Vector{StaticArraysCore.SVector{3, Float32}}, Vector{Float32}, Vector{Vector{StaticArraysCore.SVector{3, Float32}}}, Nothing, OrdinaryDiffEqTsit5.Tsit5ConstantCache, Nothing}, SciMLBase.DEStats, Nothing, Nothing, Nothing, Nothing}

Taking the Ensemble to the GPU

Now uhh, we just change EnsembleThreads() to EnsembleGPUArray()

sol = ODE.solve(monteprob, ODE.Tsit5(), DiffEqGPU.EnsembleGPUArray(CUDA.CUDABackend()),
    trajectories = 10_000, saveat = 1.0f0)

EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, uType, Nothing, Nothing, Vector{Float32}, rateType, Nothing, SciMLBase.ODEProblem{StaticArraysCore.SVector{3, Float32}, Tuple{Float32, Float32}, false, StaticArraysCore.SVector{3, Float32}, SciMLBase.ODEFunction{false, SciMLBase.AutoSpecialize, typeof(Main.lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, SciMLBase.StandardODEProblem}, OrdinaryDiffEqTsit5.Tsit5{typeof(OrdinaryDiffEqCore.trivial_limiter!), typeof(OrdinaryDiffEqCore.trivial_limiter!), Static.False}, IType, SciMLBase.DEStats, Nothing, Nothing, Nothing, Nothing} where {uType, rateType, IType}

Or for a more efficient version, EnsembleGPUKernel(). But that requires special solvers, so we also change to GPUTsit5().

sol = ODE.solve(
    monteprob, DiffEqGPU.GPUTsit5(), DiffEqGPU.EnsembleGPUKernel(CUDA.CUDABackend()), trajectories = 10_000)

EnsembleSolution Solution of length 10000 with uType:
SciMLBase.ODESolution{Float32, 2, SubArray{StaticArraysCore.SVector{3, Float32}, 1, Matrix{StaticArraysCore.SVector{3, Float32}}, Tuple{UnitRange{Int64}, Int64}, true}, Nothing, Nothing, SubArray{Float32, 1, Matrix{Float32}, Tuple{UnitRange{Int64}, Int64}, true}, Nothing, Nothing, SciMLBase.ImmutableODEProblem{StaticArraysCore.SVector{3, Float32}, Tuple{Float32, Float32}, false, StaticArraysCore.SVector{3, Float32}, SciMLBase.ODEFunction{false, SciMLBase.AutoSpecialize, typeof(Main.lorenz), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, SciMLBase.StandardODEProblem}, DiffEqGPU.GPUTsit5, SciMLBase.LinearInterpolation{SubArray{Float32, 1, Matrix{Float32}, Tuple{UnitRange{Int64}, Int64}, true}, SubArray{StaticArraysCore.SVector{3, Float32}, 1, Matrix{StaticArraysCore.SVector{3, Float32}}, Tuple{UnitRange{Int64}, Int64}, true}}, Nothing, Nothing, Nothing, Nothing, Nothing}

Okay, so that was anticlimactic, but that's the point: if it were harder than that, it wouldn't be automatic! Now go check out DiffEqGPU.jl's documentation for more details, that's the end of our show.