Code Optimization for Small Nonlinear Systems in Julia
General Code Optimization in Julia
Before starting this tutorial, we recommend the reader to check out one of the many tutorials for optimization Julia code. The following is an incomplete list:
- The Julia Performance Tips
- MIT 18.337 Course Notes on Optimizing Serial Code
- What scientists must know about hardware to write fast code
User-side optimizations are important because, for sufficiently difficult problems, most time will be spent inside your f function, the function you are trying to solve. “Efficient solvers" are those that reduce the required number of f calls to hit the error tolerance. The main ideas for optimizing your nonlinear solver code, or any Julia function, are the following:
- Make it non-allocating
- Use StaticArrays for small arrays
- Use broadcast fusion
- Make it type-stable
- Reduce redundant calculations
- Make use of BLAS calls
- Optimize algorithm choice
We'll discuss these strategies in the context of nonlinear solvers. Let's start with small systems.
Optimizing Nonlinear Solver Code for Small Systems
Take for example a prototypical small nonlinear solver code in its out-of-place form:
import NonlinearSolve as NLS
f(u, p) = u .* u .- p
u0 = [1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f, u0, p)
sol = NLS.solve(prob, NLS.NewtonRaphson())retcode: Success
u: 2-element Vector{Float64}:
1.4142135623730951
1.4142135623730951We can use BenchmarkTools.jl to get more precise timings:
import BenchmarkTools
BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 10.060 μs … 76.299 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.010 μs ┊ GC (median): 0.00%
Time (mean ± σ): 11.349 μs ± 1.745 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▆██▄▃
▁▂▅███████▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
10.1 μs Histogram: frequency by time 18.4 μs <
Memory estimate: 6.12 KiB, allocs estimate: 100.Note that this way of writing the function is a shorthand for:
f(u, p) = [u[1] * u[1] - p, u[2] * u[2] - p]f (generic function with 1 method)where the function f returns an array. This is a common pattern from things like MATLAB's fzero or SciPy's scipy.optimize.fsolve. However, by design it's very slow. In the benchmark you can see that there are many allocations. These allocations cause the memory allocator to take more time than the actual numerics itself, which is one of the reasons why codes from MATLAB and Python end up slow.
In order to avoid this issue, we can use a non-allocating "in-place" approach. Written out by hand, this looks like:
function f(du, u, p)
du[1] = u[1] * u[1] - p
du[2] = u[2] * u[2] - p
return nothing
end
prob = NLS.NonlinearProblem(f, u0, p)
BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
Range (min … max): 13.800 μs … 77.983 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 17.393 μs ┊ GC (median): 0.00%
Time (mean ± σ): 17.023 μs ± 2.291 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▁▁ ▁▅██▄▁
▁▁▂▃▄▆█████▆▅▃▂▂▂▂▂▂▂▂▄██████▆▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▃
13.8 μs Histogram: frequency by time 22.4 μs <
Memory estimate: 4.50 KiB, allocs estimate: 66.Notice how much faster this already runs! We can make this code even simpler by using the .= in-place broadcasting.
function f(du, u, p)
du .= u .* u .- p
return nothing
end
BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 8.990 μs … 98.889 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.580 μs ┊ GC (median): 0.00%
Time (mean ± σ): 13.520 μs ± 3.725 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▆██▇▅▄▂▁ ▁▁ ▁ ▁▃▅▇▇▇▇▆▅▄▃▂▁▁ ▃
███████████▇▇▇▇▆▇▇███████▇████████████████████▇▇█▇▆▆▆▆▆▆▆▅▆ █
8.99 μs Histogram: log(frequency) by time 21.5 μs <
Memory estimate: 4.50 KiB, allocs estimate: 66.Further Optimizations for Small Nonlinear Solves with Static Arrays and SimpleNonlinearSolve
Allocations are only expensive if they are “heap allocations”. For a more in-depth definition of heap allocations, there are many sources online. But a good working definition is that heap allocations are variable-sized slabs of memory which have to be pointed to, and this pointer indirection costs time. Additionally, the heap has to be managed, and the garbage controllers has to actively keep track of what's on the heap.
However, there's an alternative to heap allocations, known as stack allocations. The stack is statically-sized (known at compile time) and thus its accesses are quick. Additionally, the exact block of memory is known in advance by the compiler, and thus re-using the memory is cheap. This means that allocating on the stack has essentially no cost!
Arrays have to be heap allocated because their size (and thus the amount of memory they take up) is determined at runtime. But there are structures in Julia which are stack-allocated. structs for example are stack-allocated “value-type”s. Tuples are a stack-allocated collection. The most useful data structure for NonlinearSolve though is the StaticArray from the package StaticArrays.jl. These arrays have their length determined at compile-time. For example:
import StaticArrays
A = StaticArrays.SA[2.0, 3.0, 5.0]
typeof(A)SVector{3, Float64} (alias for StaticArraysCore.SArray{Tuple{3}, Float64, 1, 3})Notice that the 3 after SVector gives the size of the SVector. It cannot be changed. Additionally, SVectors are immutable, so we have to create a new SVector to change values. But remember, we don't have to worry about allocations because this data structure is stack-allocated. SArrays have numerous extra optimizations as well: they have fast matrix multiplication, fast QR factorizations, etc. which directly make use of the information about the size of the array. Thus, when possible, they should be used.
Unfortunately, static arrays can only be used for sufficiently small arrays. After a certain size, they are forced to heap allocate after some instructions and their compile time balloons. Thus, static arrays shouldn't be used if your system has more than ~20 variables. Additionally, only the native Julia algorithms can fully utilize static arrays.
Let's ***optimize our nonlinear solve using static arrays***. Note that in this case, we want to use the out-of-place allocating form, but this time we want to output a static array. Doing it with broadcasting looks like:
f_SA(u, p) = u .* u .- p
u0 = StaticArrays.SA[1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f_SA, u0, p)
BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())BenchmarkTools.Trial: 10000 samples with 218 evaluations per sample.
Range (min … max): 314.628 ns … 265.193 μs ┊ GC (min … max): 0.00% … 99.72%
Time (median): 353.069 ns ┊ GC (median): 0.00%
Time (mean ± σ): 710.490 ns ± 3.953 μs ┊ GC (mean ± σ): 24.84% ± 8.21%
▇█▆▄▃▁ ▁▁▂▂ ▂▃▃▃▄▄▃▂▁▁ ▂
███████▆▇▇██████▇▅▅▄▄▄▃▁▄▄▇█████████████▇▆█▇▇▆▇▇▇▅▇▄▅▆▃▄▄▄▃▄▄ █
315 ns Histogram: log(frequency) by time 1.87 μs <
Memory estimate: 2.06 KiB, allocs estimate: 13.Note that only change here is that u0 is made into a StaticArray! If we needed to write f out for a more complex nonlinear case, then we'd simply do the following:
f_SA(u, p) = StaticArrays.SA[u[1] * u[1] - p, u[2] * u[2] - p]
BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())BenchmarkTools.Trial: 10000 samples with 200 evaluations per sample.
Range (min … max): 336.500 ns … 561.547 μs ┊ GC (min … max): 0.00% … 99.65%
Time (median): 509.572 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.130 μs ± 8.897 μs ┊ GC (mean ± σ): 25.88% ± 6.45%
▇█▃
▄███▅▂▂▁▁▁▂▂▃▂▂▂▂▁▁▁▁▁▁▁▁▂▂▂▃▄▄▄▃▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▁▁ ▂
336 ns Histogram: frequency by time 2.1 μs <
Memory estimate: 2.06 KiB, allocs estimate: 13.However, notice that this did not give us a speedup but rather a slowdown. This is because many of the methods in NonlinearSolve.jl are designed to scale to larger problems, and thus aggressively do things like caching which is good for large problems but not good for these smaller problems and static arrays. In order to see the full benefit, we need to move to one of the methods from SimpleNonlinearSolve.jl which are designed for these small-scale static examples. Let's now use SimpleNewtonRaphson:
BenchmarkTools.@benchmark NLS.solve(prob, NLS.SimpleNewtonRaphson())BenchmarkTools.Trial: 10000 samples with 961 evaluations per sample.
Range (min … max): 94.256 ns … 71.443 μs ┊ GC (min … max): 0.00% … 99.79%
Time (median): 138.178 ns ┊ GC (median): 0.00%
Time (mean ± σ): 153.062 ns ± 798.270 ns ┊ GC (mean ± σ): 7.00% ± 1.41%
▁ ▆██▆▃
▂▄▃▃█▅▂▂▁▁▁▁▁▁▁▁▁▁▂▃▆███████▇▆▄▃▃▂▂▂▂▁▁▁▂▂▃▄▅▅▅▄▄▃▂▂▂▂▂▂▂▂▁▁▁ ▃
94.3 ns Histogram: frequency by time 200 ns <
Memory estimate: 80 bytes, allocs estimate: 1.And there we go, around 40ns from our starting point of almost 4μs!