Linear System Solvers

LS.solve(prob::LS.LinearProblem,alg;kwargs)

Solves for $Au=b$ in the problem defined by prob using the algorithm alg. If no algorithm is given, a default algorithm will be chosen.

The default algorithm nothing is good for picking an algorithm that will work, but one may need to change this to receive more performance or precision. If more precision is necessary, LS.QRFactorization() and LS.SVDFactorization() are the best choices, with SVD being the slowest but most precise.

For efficiency, RFLUFactorization is the fastest for dense LU-factorizations until around 150x150 matrices, though this can be dependent on the exact details of the hardware. After this point, MKLLUFactorization is usually faster on most hardware. Note that on Mac computers that AppleAccelerateLUFactorization is generally always the fastest. OpenBLASLUFactorization provides direct OpenBLAS calls without going through libblastrampoline and can be faster than LUFactorization in some configurations. LUFactorization will use your base system BLAS which can be fast or slow depending on the hardware configuration. SimpleLUFactorization will be fast only on very small matrices but can cut down on compile times.

For very large dense factorizations, offloading to the GPU can be preferred. Metal.jl can be used on Mac hardware to offload, and has a cutoff point of being faster at around size 20,000 x 20,000 matrices (and only supports Float32). CudaOffloadLUFactorization and CudaOffloadQRFactorization can be more efficient at a much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though this is highly dependent on the chosen GPU hardware. These algorithms require a CUDA-compatible NVIDIA GPU. CUDA offload supports Float64 but most consumer GPU hardware will be much faster on Float32 (many are >32x faster for Float32 operations than Float64 operations) and thus for most hardware this is only recommended for Float32 matrices. Choose CudaOffloadLUFactorization for better performance on well-conditioned problems, or CudaOffloadQRFactorization for better numerical stability on ill-conditioned problems.

Mixed Precision Methods

For large well-conditioned problems where memory bandwidth is the bottleneck, mixed precision methods can provide significant speedups (up to 2x) by performing the factorization in Float32 while maintaining Float64 interfaces. These methods are particularly effective for:

Large dense matrices (> 1000x1000)
Well-conditioned problems (condition number < 10^4)
Hardware with good Float32 performance

Available mixed precision solvers:

MKL32MixedLUFactorization - CPUs with MKL
AppleAccelerate32MixedLUFactorization - Apple CPUs with Accelerate
CUDAOffload32MixedLUFactorization - NVIDIA GPUs with CUDA
MetalOffload32MixedLUFactorization - Apple GPUs with Metal

These methods automatically handle the precision conversion, making them easy drop-in replacements when reduced precision is acceptable for the factorization step.

Note

Performance details for dense LU-factorizations can be highly dependent on the hardware configuration. For details see this issue. If one is looking to best optimize their system, we suggest running the performance tuning benchmark.

Sparse Matrices

For sparse LU-factorizations, KLUFactorization if there is less structure to the sparsity pattern and UMFPACKFactorization if there is more structure. Pardiso.jl's methods are also known to be very efficient sparse linear solvers.

For GPU-accelerated sparse LU-factorizations, there are two high-performance options. When using CuSparseMatrixCSR arrays with CUDSS.jl loaded, LUFactorization() will automatically use NVIDIA's cuDSS library. Alternatively, CUSOLVERRFFactorization provides access to NVIDIA's cusolverRF library. Both offer significant performance improvements for sparse systems on CUDA-capable GPUs and are particularly effective for large sparse matrices that can benefit from GPU parallelization. CUDSS is more for Float32 while CUSOLVERRFFactorization is for Float64.

While these sparse factorizations are based on implementations in other languages, and therefore constrained to standard number types (Float64, Float32 and their complex counterparts), SparspakFactorization is able to handle general number types, e.g. defined by ForwardDiff.jl, MultiFloats.jl, or IntervalArithmetics.jl.

As sparse matrices get larger, iterative solvers tend to get more efficient than factorization methods if a lower tolerance of the solution is required.

Krylov.jl generally outperforms IterativeSolvers.jl and KrylovKit.jl, and is compatible with CPUs and GPUs, and thus is the generally preferred form for Krylov methods. The choice of Krylov method should be the one most constrained to the type of operator one has, for example if positive definite then KrylovJL_CG(), but if no good properties then use KrylovJL_GMRES().

Finally, a user can pass a custom function for handling the linear solve using LS.LinearSolveFunction() if existing solvers are not optimally suited for their application. The interface is detailed here.

Lazy SciMLOperators

If the linear operator is given as a lazy non-concrete operator, such as a FunctionOperator, then using a Krylov method is preferred in order to not concretize the matrix. Krylov.jl generally outperforms IterativeSolvers.jl and KrylovKit.jl, and is compatible with CPUs and GPUs, and thus is the generally preferred form for Krylov methods. The choice of Krylov method should be the one most constrained to the type of operator one has, for example if positive definite then KrylovJL_CG(), but if no good properties then use KrylovJL_GMRES().

Tip

If your materialized operator is a uniform block diagonal matrix, then you can use SimpleGMRES(; blocksize = <known block size>) to further improve performance. This often shows up in Neural Networks where the Jacobian wrt the Inputs (almost always) is a Uniform Block Diagonal matrix of Block Size = size of the input divided by the batch size.

Full List of Methods

Polyalgorithms

LinearSolve.DefaultLinearSolver — Type

DefaultLinearSolver(;safetyfallback=true)

The default linear solver. This is the algorithm chosen when solve(prob) is called. It's a polyalgorithm that detects the optimal method for a given A, b and hardware (Intel, AMD, GPU, etc.).

Keyword Arguments

safetyfallback: determines whether to fallback to a column-pivoted QR factorization when an LU factorization fails. This can be required if A is rank-deficient. Defaults to true.