How vectorization can make Julia faster, but also slower.

Table of Contents

Vectorization, at least for the purposes of this topic, is when operations on a collection of values is transformed into instructions that are vectorized, that is, single instruction multiple data (SIMD).

Julia’s JIT compiler is already capable of vectorizing expressions, but it can still be surprising to see how different expressions turn out to leverage this to good and bad effects.

Let’s take an example function, taking in a single scalar, and return a scalar, which we intend to run on a large array and sum the results:

\(f(X) : \sum_{x \in X} \sin(x)^2 + \cos(x)^2\)

The observant reader will realize this has a closed form [1], and even constant, solution, but the runtime/compiler does not know this, so has to execute the computation.

Let’s get the requisite packages listed first

using BenchmarkTools
import Random
using Base.Threads
using Distributed
using ThreadsX

Let’s define random input

Random.seed!(42)
N = 1024
X = rand(N, N)

Now we list the different ways you can try to optimize the computation.

Expression #

sin(x)^2 + cos(x)^2

Scalar function #

function ftx(x)
    return sin(x)^2 + cos(x)^2
end

Imperative reduction in place #

function fx(X)
    r = 0.0
    for x in X
        r = r +( sin(x)^2 + cos(x)^2)
    end
    return r
end

Custom threaded function in place #

Custom threaded function using first element of a column to save the result (so threadsafe iirc). Using column major slicing for speed

function sr!(X)
    @threads for i in 1:size(X, 2)
        @inbounds X[1,i] = sum(sin.(X[:,i]) .^2 .+ cos.(X[:,i]) .^2)
    end
    return sum(X[1,:])
end

Let’s test #

I won’t print the results here, because they’re likely going to change, and it’s a better experience if you think which would be faster. I will share observations on what happens.

Recall, in Julia, if a is an operator, a.() tells Julia to apply the function element wise.

@btime runs the expression and measures with mean/sd. @code_native shows what the machine instructions will be.

Inerestingly, the expression, on my machine, is turned into mapreduce code, or its equivalent.

## @code_native shows a lot of calls to mapreduce, so JIT is converting this
@btime Y = sum(sin.(X).^2 .+ cos.(X).^2)

Vectorize function call #

@btime sum(ftx.(X))

Map reduce (functional programming) #

@btime mapreduce(x->sin(x)^2+cos(x)^2,+, X);

This leads to an allocation of 16 bytes, 2 64 bit floats, so very little memory overhead

Imperative loop #

@btime fx(X)

Leads also to 16 bytes allocation, Julia allocates 2 floats, no more.

Threaded loop modify in place #

Y = copy(X)
@btime sr!(Y)

Distributed #

The previous example is parallel using threads, let’s use the distributed model.

@btime @distributed (+) for x in X
    sin(x)^2 + cos(x)^2
end

ThreadsX #

This is a specialized package, and very fast, though not as lean in memory use.

@btime ThreadsX.mapreduce(x -> sin(x)^2 + cos(x)^2, +, X)

The key advantage here is that you can use the standard functional programming code, so almost no code changes.

Who won? #

That depends, I specifically omitted most performance results, because they will vary between versions, and per machine, for example not all architectures have the same vectorized capabilities.

The reason for this experiment, for me, was to optimize an expression in an inner (3-dimensional) loop, taking hours to compute, with high memory overhead. Simple prefixing the loops with @threads only worsened things, so I broke it down to the expression alone, and optimized for costs:

Changes in code
Extra dependencies
my time
portability of results

and gains:

memory (linear, O(1))
runtime

For example, for me ThreadsX.jl was faster than all other solutions, and apart from a dependency on the package and a single prefix change, had little cost. However, it’s not O(1) in memory usage, e.g. the solution trades memory for speed. The leanest (in memory) were not necessarily the fastest. In general, I like writing in the functional paradigm, given that it matches more my intent than say a for loop, and there’s a better overview of potential side effects.

As a final note, if you trade memory for speed, your trade-off can fail as your data grows, for example cache access patterns that work for your 1e6 array may not work in 1e9 when alignment comes into play, or the number of threads / task starts to vary.

Don’t try to optimize without tests in place. It’s very easy to introduce a race, or instability in the algorithm, rounding errors and so on. Without having automated tests that verify the output of slow versus fast versions, there’s no reason to do this at all.

Addendum [1] #

If you recognized the trigonometric identity as the expression in the sum, it’s instantly clear that the results is always N if |X| = N. A very smart compiler could actually throw away the trig function calls, as well as the sum, and just return

X = ...
function f(X)
    return reduce((x,y) -> x*y, size(X)...)
end

Which would be as many multiplications as you have dimensions, so complexity instead of O(N), is O(d) with d « N.

Attribution #

The snippets above are extended from a workshop Alex Razoumov and Marie Burle at Westgrid. Newer versions of these workshops have improved and more detailed coverage of parallel and distributed Julia.