Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation Time is much longer for Float64 than Float32 #8

Open
JonasIsensee opened this issue Oct 26, 2018 · 1 comment
Open

Computation Time is much longer for Float64 than Float32 #8

JonasIsensee opened this issue Oct 26, 2018 · 1 comment

Comments

@JonasIsensee
Copy link
Member

This is really odd.
The computation time both for creating the graphs and for searching
is nearly 2× times longer for Float64 compared to Float32 data points.

How can this be? (The C++ library doesn't show this behaviour)
I thought my architecture is optimized for 64bit..

using HNSW
dim = 128
num_elements = 10000


data = [rand(Float32,dim) for n=1:num_elements];
 hnsw = HierarchicalNSW(data; efConstruction = 200, M = 16)
@time add_to_graph!(hnsw)
 # --> 4.820807 seconds (415.10 k allocations: 158.142 MiB, 1.29% gc time)

data = [rand(Float64,dim) for n=1:num_elements]
 hnsw = HierarchicalNSW(data; efConstruction = 200, M = 16)
@code_warntype  add_to_graph!(hnsw) #to trigger compilation
@time add_to_graph!(hnsw)

# -->   7.995207 seconds (379.92 k allocations: 294.682 MiB, 0.71% gc time)
@zgornel
Copy link
Contributor

zgornel commented Nov 8, 2018

I'm afraid this is more normal than it looks and, a good result ^_^. The slower speed is mostly due to the fact that more Float32s fit inside the CPU's registers (and cache). Most probably, each 64-bit register can act as two 32-bit registers and this can improve micro-parallelism - two 32-bit register operations can be performed for each 64-bit one (things are probably more complicated in practice but modern super-scalar architectures can do this sort of thing quite well). With Julia's proficiency at generating highly optimized code, this is probably the case.

This effect can be observed for matrix multiplication...

using BenchmarkTools
benchmark(N) = begin
           A = rand(Float32, N,N);
           B = rand(Float64, N,N);
           At = A'
           Bt = B'
           println("Matrix multiplication $N x $N:")
           print("\tFloat32 ");
           @btime $A * $At;
           print("\tFloat64 ");
           @btime $B * $Bt;
end
# run
benchmark.([10,100,1000]);

Results on my box (Core i7 3840M, 2.8Ghz, 4 cores, 8MB cache):

Matrix multiplication 10 x 10:
        Float32   561.470 ns (1 allocation: 544 bytes)
        Float64   624.624 ns (1 allocation: 896 bytes)
Matrix multiplication 100 x 100:
        Float32   43.520 μs (2 allocations: 39.14 KiB)
        Float64   62.532 μs (2 allocations: 78.20 KiB)
Matrix multiplication 1000 x 1000:
        Float32   22.744 ms (2 allocations: 3.81 MiB)
        Float64   39.698 ms (2 allocations: 7.63 MiB)ations: 7.63 MiB)

(Ran julia 1.0.1 with OPENBLAS_NUM_THREADS=1, forcing single threaded operation for the multiplication)

By the way, this may be interesting.
More generic info here and here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants