Code Optimization for Small Nonlinear Systems in Julia

General Code Optimization in Julia

Before starting this tutorial, we recommend the reader to check out one of the many tutorials for optimization Julia code. The following is an incomplete list:

User-side optimizations are important because, for sufficiently difficult problems, most time will be spent inside your f function, the function you are trying to solve. “Efficient solvers" are those that reduce the required number of f calls to hit the error tolerance. The main ideas for optimizing your nonlinear solver code, or any Julia function, are the following:

  • Make it non-allocating
  • Use StaticArrays for small arrays
  • Use broadcast fusion
  • Make it type-stable
  • Reduce redundant calculations
  • Make use of BLAS calls
  • Optimize algorithm choice

We'll discuss these strategies in the context of nonlinear solvers. Let's start with small systems.

Optimizing Nonlinear Solver Code for Small Systems

Take for example a prototypical small nonlinear solver code in its out-of-place form:

import NonlinearSolve as NLS

f(u, p) = u .* u .- p
u0 = [1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f, u0, p)
sol = NLS.solve(prob, NLS.NewtonRaphson())
retcode: Success
u: 2-element Vector{Float64}:
 1.4142135623730951
 1.4142135623730951

We can use BenchmarkTools.jl to get more precise timings:

import BenchmarkTools

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  12.192 μs78.686 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.004 μs               GC (median):    0.00%
 Time  (mean ± σ):   13.284 μs ±  1.622 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▅█                                                       
  ▃▆██▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▂▁▂▂▂▁▂▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  12.2 μs         Histogram: frequency by time        22.9 μs <

 Memory estimate: 6.09 KiB, allocs estimate: 67.

Note that this way of writing the function is a shorthand for:

f(u, p) = [u[1] * u[1] - p, u[2] * u[2] - p]
f (generic function with 1 method)

where the function f returns an array. This is a common pattern from things like MATLAB's fzero or SciPy's scipy.optimize.fsolve. However, by design it's very slow. In the benchmark you can see that there are many allocations. These allocations cause the memory allocator to take more time than the actual numerics itself, which is one of the reasons why codes from MATLAB and Python end up slow.

In order to avoid this issue, we can use a non-allocating "in-place" approach. Written out by hand, this looks like:

function f(du, u, p)
    du[1] = u[1] * u[1] - p
    du[2] = u[2] * u[2] - p
    return nothing
end

prob = NLS.NonlinearProblem(f, u0, p)
BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  11.562 μs191.077 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     12.283 μs                GC (median):    0.00%
 Time  (mean ± σ):   12.586 μs ±   4.073 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▅█                                                        
  ▃▇██▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▂▁▁▂▁▁▁▁▁▁▂▂▂▂▁▂▂▂▂▂ ▃
  11.6 μs         Histogram: frequency by time           22 μs <

 Memory estimate: 4.58 KiB, allocs estimate: 50.

Notice how much faster this already runs! We can make this code even simpler by using the .= in-place broadcasting.

function f(du, u, p)
    du .= u .* u .- p
    return nothing
end

BenchmarkTools.@benchmark sol = NLS.solve(prob, NLS.NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  11.802 μs77.654 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     12.584 μs               GC (median):    0.00%
 Time  (mean ± σ):   12.848 μs ±  1.595 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▂██                                                      
  ▂▄███▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▁▂▂▂▂▂▂▂▂ ▃
  11.8 μs         Histogram: frequency by time          22 μs <

 Memory estimate: 4.58 KiB, allocs estimate: 50.

Further Optimizations for Small Nonlinear Solves with Static Arrays and SimpleNonlinearSolve

Allocations are only expensive if they are “heap allocations”. For a more in-depth definition of heap allocations, there are many sources online. But a good working definition is that heap allocations are variable-sized slabs of memory which have to be pointed to, and this pointer indirection costs time. Additionally, the heap has to be managed, and the garbage controllers has to actively keep track of what's on the heap.

However, there's an alternative to heap allocations, known as stack allocations. The stack is statically-sized (known at compile time) and thus its accesses are quick. Additionally, the exact block of memory is known in advance by the compiler, and thus re-using the memory is cheap. This means that allocating on the stack has essentially no cost!

Arrays have to be heap allocated because their size (and thus the amount of memory they take up) is determined at runtime. But there are structures in Julia which are stack-allocated. structs for example are stack-allocated “value-type”s. Tuples are a stack-allocated collection. The most useful data structure for NonlinearSolve though is the StaticArray from the package StaticArrays.jl. These arrays have their length determined at compile-time. They are created using macros attached to normal array expressions, for example:

import StaticArrays

A = StaticArrays.SA[2.0, 3.0, 5.0]
typeof(A)
SVector{3, Float64} (alias for StaticArraysCore.SArray{Tuple{3}, Float64, 1, 3})

Notice that the 3 after SVector gives the size of the SVector. It cannot be changed. Additionally, SVectors are immutable, so we have to create a new SVector to change values. But remember, we don't have to worry about allocations because this data structure is stack-allocated. SArrays have numerous extra optimizations as well: they have fast matrix multiplication, fast QR factorizations, etc. which directly make use of the information about the size of the array. Thus, when possible, they should be used.

Unfortunately, static arrays can only be used for sufficiently small arrays. After a certain size, they are forced to heap allocate after some instructions and their compile time balloons. Thus, static arrays shouldn't be used if your system has more than ~20 variables. Additionally, only the native Julia algorithms can fully utilize static arrays.

Let's ***optimize our nonlinear solve using static arrays***. Note that in this case, we want to use the out-of-place allocating form, but this time we want to output a static array. Doing it with broadcasting looks like:

f_SA(u, p) = u .* u .- p

u0 = StaticArrays.SA[1.0, 1.0]
p = 2.0
prob = NLS.NonlinearProblem(f_SA, u0, p)

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.983 μs542.576 μs   GC (min … max): 0.00% … 98.39%
 Time  (median):     3.249 μs                GC (median):    0.00%
 Time  (mean ± σ):   3.534 μs ±   7.544 μs   GC (mean ± σ):  2.98% ±  1.39%

      ▄██▆▂                                                    
  ▁▁▃▇█████▅▄▂▂▂▃▄▄▃▃▄▅▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  2.98 μs         Histogram: frequency by time        4.74 μs <

 Memory estimate: 2.31 KiB, allocs estimate: 16.

Note that only change here is that u0 is made into a StaticArray! If we needed to write f out for a more complex nonlinear case, then we'd simply do the following:

f_SA(u, p) = StaticArrays.SA[u[1] * u[1] - p, u[2] * u[2] - p]

BenchmarkTools.@benchmark NLS.solve(prob, NLS.NewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.716 μs531.511 μs   GC (min … max): 0.00% … 98.48%
 Time  (median):     3.029 μs                GC (median):    0.00%
 Time  (mean ± σ):   3.307 μs ±   7.470 μs   GC (mean ± σ):  3.16% ±  1.39%

        ▃██▅                                                  
  ▂▁▂▂▃▇████▇▅▄▃▃▃▄▅▄▄▄▄▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.72 μs         Histogram: frequency by time        4.48 μs <

 Memory estimate: 2.31 KiB, allocs estimate: 16.

However, notice that this did not give us a speedup but rather a slowdown. This is because many of the methods in NonlinearSolve.jl are designed to scale to larger problems, and thus aggressively do things like caching which is good for large problems but not good for these smaller problems and static arrays. In order to see the full benefit, we need to move to one of the methods from SimpleNonlinearSolve.jl which are designed for these small-scale static examples. Let's now use SimpleNewtonRaphson:

BenchmarkTools.@benchmark NLS.solve(prob, NLS.SimpleNewtonRaphson())
BenchmarkTools.Trial: 10000 samples with 47 evaluations per sample.
 Range (minmax):  889.979 ns 4.979 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     894.660 ns               GC (median):    0.00%
 Time  (mean ± σ):   905.790 ns ± 56.500 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█▅▄▂                                             ▁        ▁
  ██████▇▇▆▅▆▅▅▅▅▅▅▄▄▄▃▄▅▃▃▃▅▃▄▄▄▂▅▄▃▂▃▃▃▂▃▃▂▃▂▄▂▄▇████▇▆▅▅▄ █
  890 ns        Histogram: log(frequency) by time      1.07 μs <

 Memory estimate: 80 bytes, allocs estimate: 1.

And there we go, around 40ns from our starting point of almost 4μs!