Code Optimization for Small Nonlinear Systems in Julia

General Code Optimization in Julia

Before starting this tutorial, we recommend the reader to check out one of the many tutorials for optimization Julia code. The following is an incomplete list:

User-side optimizations are important because, for sufficiently difficult problems, most time will be spent inside your f function, the function you are trying to solve. “Efficient solvers" are those that reduce the required number of f calls to hit the error tolerance. The main ideas for optimizing your nonlinear solver code, or any Julia function, are the following:

Make it non-allocating
Use StaticArrays for small arrays
Use broadcast fusion
Make it type-stable
Reduce redundant calculations
Make use of BLAS calls
Optimize algorithm choice

We'll discuss these strategies in the context of nonlinear solvers. Let's start with small systems.

Optimizing Nonlinear Solver Code for Small Systems

Take for example a prototypical small nonlinear solver code in its out-of-place form:

import NonlinearSolve as NLS

f(u, p) = u .* u .- p
u0 = [1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f, u0, p)
sol = NLS.solve(prob, NLS.NewtonRaphson())

retcode: Success
u: 2-element Vector{Float64}:
 1.4142135623730951
 1.4142135623730951

We can use BenchmarkTools.jl to get more precise timings:

import BenchmarkTools

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  10.068 μs … 71.333 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.529 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.752 μs ±  1.551 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▆▃                                                         
  ▇███▇▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▂▁▁▁▁▁▁▁▂▂▂▁▁▁▂▁▁▂▁▁▁▁▁▁▂▂▂▂ ▃
  10.1 μs         Histogram: frequency by time        20.5 μs <

 Memory estimate: 6.14 KiB, allocs estimate: 100.

Note that this way of writing the function is a shorthand for:

f(u, p) = [u[1] * u[1] - p, u[2] * u[2] - p]

f (generic function with 1 method)

where the function f returns an array. This is a common pattern from things like MATLAB's fzero or SciPy's scipy.optimize.fsolve. However, by design it's very slow. In the benchmark you can see that there are many allocations. These allocations cause the memory allocator to take more time than the actual numerics itself, which is one of the reasons why codes from MATLAB and Python end up slow.

In order to avoid this issue, we can use a non-allocating "in-place" approach. Written out by hand, this looks like:

function f(du, u, p)
    du[1] = u[1] * u[1] - p
    du[2] = u[2] * u[2] - p
    return nothing
end

prob = NLS.NonlinearProblem(f, u0, p)
BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  9.097 μs … 194.542 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.487 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.696 μs ±   2.339 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▆▁                                                         
  ▆███▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▂▂▂▂▂▁▂▂▁▂▁▂▁▂▂▁▁▂▁▂▁▂▂▁▁▁▁▂▂▂ ▂
  9.1 μs          Histogram: frequency by time        19.2 μs <

 Memory estimate: 4.52 KiB, allocs estimate: 66.

Notice how much faster this already runs! We can make this code even simpler by using the .= in-place broadcasting.

function f(du, u, p)
    du .= u .* u .- p
    return nothing
end

BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  9.097 μs … 69.679 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.508 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.736 μs ±  1.568 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▇▁                                                        
  ▅███▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▁▂▁▁▂▁▁▂▁▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂ ▂
  9.1 μs         Histogram: frequency by time        19.4 μs <

 Memory estimate: 4.52 KiB, allocs estimate: 66.

Further Optimizations for Small Nonlinear Solves with Static Arrays and SimpleNonlinearSolve

Allocations are only expensive if they are “heap allocations”. For a more in-depth definition of heap allocations, there are many sources online. But a good working definition is that heap allocations are variable-sized slabs of memory which have to be pointed to, and this pointer indirection costs time. Additionally, the heap has to be managed, and the garbage controllers has to actively keep track of what's on the heap.

However, there's an alternative to heap allocations, known as stack allocations. The stack is statically-sized (known at compile time) and thus its accesses are quick. Additionally, the exact block of memory is known in advance by the compiler, and thus re-using the memory is cheap. This means that allocating on the stack has essentially no cost!

Arrays have to be heap allocated because their size (and thus the amount of memory they take up) is determined at runtime. But there are structures in Julia which are stack-allocated. structs for example are stack-allocated “value-type”s. Tuples are a stack-allocated collection. The most useful data structure for NonlinearSolve though is the StaticArray from the package StaticArrays.jl. These arrays have their length determined at compile-time. For example:

import StaticArrays

A = StaticArrays.SA[2.0, 3.0, 5.0]
typeof(A)

SVector{3, Float64} (alias for StaticArraysCore.SArray{Tuple{3}, Float64, 1, 3})

Notice that the 3 after SVector gives the size of the SVector. It cannot be changed. Additionally, SVectors are immutable, so we have to create a new SVector to change values. But remember, we don't have to worry about allocations because this data structure is stack-allocated. SArrays have numerous extra optimizations as well: they have fast matrix multiplication, fast QR factorizations, etc. which directly make use of the information about the size of the array. Thus, when possible, they should be used.

Unfortunately, static arrays can only be used for sufficiently small arrays. After a certain size, they are forced to heap allocate after some instructions and their compile time balloons. Thus, static arrays shouldn't be used if your system has more than ~20 variables. Additionally, only the native Julia algorithms can fully utilize static arrays.

Let's ***optimize our nonlinear solve using static arrays***. Note that in this case, we want to use the out-of-place allocating form, but this time we want to output a static array. Doing it with broadcasting looks like:

f_SA(u, p) = u .* u .- p

u0 = StaticArrays.SA[1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f_SA, u0, p)

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 209 evaluations per sample.
 Range (min … max):  350.129 ns … 436.165 μs  ┊ GC (min … max):  0.00% … 99.72%
 Time  (median):     400.890 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   841.453 ns ±   6.599 μs  ┊ GC (mean ± σ):  28.98% ±  7.33%

   ▆█                                                            
  ▃██▇▅▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▂▁▁▁▁▁▂▂▁▁▁▂▁▁▁▂▂▃▄▃▃▄▄▃▃▃▂▂▂▂▂▂▂ ▃
  350 ns           Histogram: frequency by time         1.36 μs <

 Memory estimate: 2.06 KiB, allocs estimate: 13.

Note that only change here is that u0 is made into a StaticArray! If we needed to write f out for a more complex nonlinear case, then we'd simply do the following:

f_SA(u, p) = StaticArrays.SA[u[1] * u[1] - p, u[2] * u[2] - p]

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 225 evaluations per sample.
 Range (min … max):  351.680 ns … 457.538 μs  ┊ GC (min … max):  0.00% … 99.75%
 Time  (median):     426.462 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.003 μs ±   8.675 μs  ┊ GC (mean ± σ):  29.21% ±  6.73%

   █▆                                                            
  ▄███▅▄▃▃▃▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▁▁▂▂▂▁▁▁▁▁▁▁▂▁▂▁▂▄▅▄▅▇▆▅▄▃▃▂▂▂▂▂▂▂▂ ▃
  352 ns           Histogram: frequency by time         1.42 μs <

 Memory estimate: 2.06 KiB, allocs estimate: 13.

However, notice that this did not give us a speedup but rather a slowdown. This is because many of the methods in NonlinearSolve.jl are designed to scale to larger problems, and thus aggressively do things like caching which is good for large problems but not good for these smaller problems and static arrays. In order to see the full benefit, we need to move to one of the methods from SimpleNonlinearSolve.jl which are designed for these small-scale static examples. Let's now use SimpleNewtonRaphson:

BenchmarkTools.@benchmark NLS.solve(prob, NLS.SimpleNewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 955 evaluations per sample.
 Range (min … max):   91.291 ns … 100.881 μs  ┊ GC (min … max):  0.00% … 99.82%
 Time  (median):     127.420 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   141.782 ns ±   1.332 μs  ┊ GC (mean ± σ):  13.25% ±  1.41%

                                    ▁▃▁▃█                        
  ▃▇█▃▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▁▂▄▆█████▆▃▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  91.3 ns          Histogram: frequency by time          152 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.

And there we go, around 40ns from our starting point of almost 4μs!