Computation Time is much longer for Float64 than Float32 #8

JonasIsensee · 2018-10-26T16:31:29Z

This is really odd.
The computation time both for creating the graphs and for searching
is nearly 2× times longer for Float64 compared to Float32 data points.

How can this be? (The C++ library doesn't show this behaviour)
I thought my architecture is optimized for 64bit..

using HNSW
dim = 128
num_elements = 10000


data = [rand(Float32,dim) for n=1:num_elements];
 hnsw = HierarchicalNSW(data; efConstruction = 200, M = 16)
@time add_to_graph!(hnsw)
 # --> 4.820807 seconds (415.10 k allocations: 158.142 MiB, 1.29% gc time)

data = [rand(Float64,dim) for n=1:num_elements]
 hnsw = HierarchicalNSW(data; efConstruction = 200, M = 16)
@code_warntype  add_to_graph!(hnsw) #to trigger compilation
@time add_to_graph!(hnsw)

# -->   7.995207 seconds (379.92 k allocations: 294.682 MiB, 0.71% gc time)

The text was updated successfully, but these errors were encountered:

zgornel · 2018-11-08T15:06:21Z

I'm afraid this is more normal than it looks and, a good result ^_^. The slower speed is mostly due to the fact that more Float32s fit inside the CPU's registers (and cache). Most probably, each 64-bit register can act as two 32-bit registers and this can improve micro-parallelism - two 32-bit register operations can be performed for each 64-bit one (things are probably more complicated in practice but modern super-scalar architectures can do this sort of thing quite well). With Julia's proficiency at generating highly optimized code, this is probably the case.

This effect can be observed for matrix multiplication...

using BenchmarkTools
benchmark(N) = begin
           A = rand(Float32, N,N);
           B = rand(Float64, N,N);
           At = A'
           Bt = B'
           println("Matrix multiplication $N x $N:")
           print("\tFloat32 ");
           @btime $A * $At;
           print("\tFloat64 ");
           @btime $B * $Bt;
end
# run
benchmark.([10,100,1000]);

Results on my box (Core i7 3840M, 2.8Ghz, 4 cores, 8MB cache):

Matrix multiplication 10 x 10:
        Float32   561.470 ns (1 allocation: 544 bytes)
        Float64   624.624 ns (1 allocation: 896 bytes)
Matrix multiplication 100 x 100:
        Float32   43.520 μs (2 allocations: 39.14 KiB)
        Float64   62.532 μs (2 allocations: 78.20 KiB)
Matrix multiplication 1000 x 1000:
        Float32   22.744 ms (2 allocations: 3.81 MiB)
        Float64   39.698 ms (2 allocations: 7.63 MiB)ations: 7.63 MiB)

(Ran julia 1.0.1 with OPENBLAS_NUM_THREADS=1, forcing single threaded operation for the multiplication)

By the way, this may be interesting.
More generic info here and here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computation Time is much longer for Float64 than Float32 #8

Computation Time is much longer for Float64 than Float32 #8

JonasIsensee commented Oct 26, 2018

zgornel commented Nov 8, 2018

Computation Time is much longer for Float64 than Float32 #8

Computation Time is much longer for Float64 than Float32 #8

Comments

JonasIsensee commented Oct 26, 2018

zgornel commented Nov 8, 2018