Code Optimization for Small Nonlinear Systems in Julia

General Code Optimization in Julia

Before starting this tutorial, we recommend the reader to check out one of the many tutorials for optimization Julia code. The following is an incomplete list:

User-side optimizations are important because, for sufficiently difficult problems, most time will be spent inside your f function, the function you are trying to solve. “Efficient solvers" are those that reduce the required number of f calls to hit the error tolerance. The main ideas for optimizing your nonlinear solver code, or any Julia function, are the following:

Make it non-allocating
Use StaticArrays for small arrays
Use broadcast fusion
Make it type-stable
Reduce redundant calculations
Make use of BLAS calls
Optimize algorithm choice

We'll discuss these strategies in the context of nonlinear solvers. Let's start with small systems.

Optimizing Nonlinear Solver Code for Small Systems

Take for example a prototypical small nonlinear solver code in its out-of-place form:

import NonlinearSolve as NLS

f(u, p) = u .* u .- p
u0 = [1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f, u0, p)
sol = NLS.solve(prob, NLS.NewtonRaphson())

retcode: Success
u: 2-element Vector{Float64}:
 1.4142135623730951
 1.4142135623730951

We can use BenchmarkTools.jl to get more precise timings:

import BenchmarkTools

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  12.152 μs … 78.626 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.934 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.157 μs ±  1.496 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃█▇▄                                                       
  ▂▅█████▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▂▂▁▁▁▁▁▂▂▁▁▁▂▂▂▂▂▂ ▃
  12.2 μs         Histogram: frequency by time        22.3 μs <

 Memory estimate: 6.09 KiB, allocs estimate: 67.

Note that this way of writing the function is a shorthand for:

f(u, p) = [u[1] * u[1] - p, u[2] * u[2] - p]

f (generic function with 1 method)

where the function f returns an array. This is a common pattern from things like MATLAB's fzero or SciPy's scipy.optimize.fsolve. However, by design it's very slow. In the benchmark you can see that there are many allocations. These allocations cause the memory allocator to take more time than the actual numerics itself, which is one of the reasons why codes from MATLAB and Python end up slow.

In order to avoid this issue, we can use a non-allocating "in-place" approach. Written out by hand, this looks like:

function f(du, u, p)
    du[1] = u[1] * u[1] - p
    du[2] = u[2] * u[2] - p
    return nothing
end

prob = NLS.NonlinearProblem(f, u0, p)
BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  11.531 μs … 177.620 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.193 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.459 μs ±   3.506 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅█▅▁                                                        
  ▂▆████▆▄▃▃▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▂▂▁▂▁▂▁▂▂▂▂▂▂▂▁▁▁▁▂▂▂▂▁▁▁▁▂▂▁▁▁▂▂▂▂ ▃
  11.5 μs         Histogram: frequency by time         21.5 μs <

 Memory estimate: 4.58 KiB, allocs estimate: 50.

Notice how much faster this already runs! We can make this code even simpler by using the .= in-place broadcasting.

function f(du, u, p)
    du .= u .* u .- p
    return nothing
end

BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  11.812 μs … 71.643 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.513 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.703 μs ±  1.437 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▇▂                                                       
  ▂▅████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▁▁▂▂▁▂▂▂▂▁▂▂▁▁▁▁▂▁▁▂▁▁▁▂▂▂▂▂ ▃
  11.8 μs         Histogram: frequency by time        21.8 μs <

 Memory estimate: 4.58 KiB, allocs estimate: 50.

Further Optimizations for Small Nonlinear Solves with Static Arrays and SimpleNonlinearSolve

Allocations are only expensive if they are “heap allocations”. For a more in-depth definition of heap allocations, there are many sources online. But a good working definition is that heap allocations are variable-sized slabs of memory which have to be pointed to, and this pointer indirection costs time. Additionally, the heap has to be managed, and the garbage controllers has to actively keep track of what's on the heap.

However, there's an alternative to heap allocations, known as stack allocations. The stack is statically-sized (known at compile time) and thus its accesses are quick. Additionally, the exact block of memory is known in advance by the compiler, and thus re-using the memory is cheap. This means that allocating on the stack has essentially no cost!

Arrays have to be heap allocated because their size (and thus the amount of memory they take up) is determined at runtime. But there are structures in Julia which are stack-allocated. structs for example are stack-allocated “value-type”s. Tuples are a stack-allocated collection. The most useful data structure for NonlinearSolve though is the StaticArray from the package StaticArrays.jl. These arrays have their length determined at compile-time. They are created using macros attached to normal array expressions, for example:

import StaticArrays

A = StaticArrays.SA[2.0, 3.0, 5.0]
typeof(A)

SVector{3, Float64} (alias for StaticArraysCore.SArray{Tuple{3}, Float64, 1, 3})

Notice that the 3 after SVector gives the size of the SVector. It cannot be changed. Additionally, SVectors are immutable, so we have to create a new SVector to change values. But remember, we don't have to worry about allocations because this data structure is stack-allocated. SArrays have numerous extra optimizations as well: they have fast matrix multiplication, fast QR factorizations, etc. which directly make use of the information about the size of the array. Thus, when possible, they should be used.

Unfortunately, static arrays can only be used for sufficiently small arrays. After a certain size, they are forced to heap allocate after some instructions and their compile time balloons. Thus, static arrays shouldn't be used if your system has more than ~20 variables. Additionally, only the native Julia algorithms can fully utilize static arrays.

Let's ***optimize our nonlinear solve using static arrays***. Note that in this case, we want to use the out-of-place allocating form, but this time we want to output a static array. Doing it with broadcasting looks like:

f_SA(u, p) = u .* u .- p

u0 = StaticArrays.SA[1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f_SA, u0, p)

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
 Range (min … max):  3.073 μs … 547.692 μs  ┊ GC (min … max): 0.00% … 98.33%
 Time  (median):     3.346 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.635 μs ±   7.547 μs  ┊ GC (mean ± σ):  2.90% ±  1.39%

       ▆▇█▅                                                    
  ▁▁▂▄██████▅▃▃▃▃▂▂▂▃▄▄▅▃▃▂▃▄▄▃▃▂▂▂▂▂▂▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  3.07 μs         Histogram: frequency by time        4.86 μs <

 Memory estimate: 2.31 KiB, allocs estimate: 16.

Note that only change here is that u0 is made into a StaticArray! If we needed to write f out for a more complex nonlinear case, then we'd simply do the following:

f_SA(u, p) = StaticArrays.SA[u[1] * u[1] - p, u[2] * u[2] - p]

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  2.825 μs … 499.874 μs  ┊ GC (min … max): 0.00% … 98.48%
 Time  (median):     3.118 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.382 μs ±   6.996 μs  ┊ GC (mean ± σ):  2.89% ±  1.39%

        ▅██▅▂                                                  
  ▁▁▁▂▃▇█████▇▄▃▂▂▂▂▃▃▃▃▃▃▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  2.83 μs         Histogram: frequency by time        4.54 μs <

 Memory estimate: 2.31 KiB, allocs estimate: 16.

However, notice that this did not give us a speedup but rather a slowdown. This is because many of the methods in NonlinearSolve.jl are designed to scale to larger problems, and thus aggressively do things like caching which is good for large problems but not good for these smaller problems and static arrays. In order to see the full benefit, we need to move to one of the methods from SimpleNonlinearSolve.jl which are designed for these small-scale static examples. Let's now use SimpleNewtonRaphson:

BenchmarkTools.@benchmark NLS.solve(prob, NLS.SimpleNewtonRaphson())

BenchmarkTools.Trial: 10000 samples with 46 evaluations per sample.
 Range (min … max):  894.935 ns …  1.554 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     899.935 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   910.791 ns ± 40.316 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▆▅▄▂▁                                              ▁▁▁▁   ▂
  ██████████▇▆▆▅▆▆▆▆▆▅▆▅▄▅▁▄▄▅▅▅▄▄▄▄▄▃▄▄▃▄▃▃▅▄▃▄▁▃▃▃▁▅▆██████▇ █
  895 ns        Histogram: log(frequency) by time      1.08 μs <

 Memory estimate: 80 bytes, allocs estimate: 1.

And there we go, around 40ns from our starting point of almost 4μs!