Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly improved API. Dramatically improved implementation #1

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Slightly improved API. Dramatically improved implementation #1

wants to merge 4 commits into from

Conversation

Steve132
Copy link

I dramatically improved the implementation of encode and decode to use efficient bit-twiddling and scaling to perform the morton-code encoding and decoding in constant time with only a few operations. The resulting code should be significantly faster.

I also changed the api to use pass-by-value where it was more efficient and more appropriate for a C-API. The resulting call-by-value change decreased error checking and should increase performance over pass-by-reference

The API has no reason to pass the range arguments by reference.  Doing so slows down the function.  
The arguments are not return values, and the first thing the code does is dereference them and copy them into stack variables.  Passing the input ranges by value skips this step and has fewer errors.  Furthermore, the arguments are small enough (2 primatives) that putting them onto the stack by value can make the function call inline even faster.
Same change as before: change input arguments to stack based arguments for efficiency.  This also allows us to skip the case where the inputs == NULL in the error check
I re-wrote the encode and decode functions to provide a DRAMATICALLY faster interface that does no branching and uses only a few 64-bit integer operations with unrolled loops to do the interleaving/deinterleaving using binary magic numbers for efficient morton codes.  The code should go much much much faster now.
@yinqiwen
Copy link
Owner

I got a compile error

geohash.c: In function ‘geohash_decode’:
geohash.c:161:5: error: ‘ilato’ undeclared (first use in this function)
ilato = xyhilo; //get back the original integer coordinates
^
geohash.c:161:5: note: each undeclared identifier is reported only once for each function it appears in
geohash.c:162:5: error: ‘ilono’ undeclared (first use in this function)
ilono = xyhilo >> 32;
^
geohash.c:170:42: warning: incompatible implicit declaration of built-in function ‘ldexp’ [enabled by default]
area->latitude.min = lat_range.min + ldexp(ilato, -step) * lat_scale;

And IMO it's better to keep the old simple&stupid implementation, since this optimize one is hard to understand.

@yinqiwen
Copy link
Owner

And about the 'geohash_decode' function, origin implementation and this optimize one have different result for same input.

@yinqiwen
Copy link
Owner

I finally understand your implementation, it's brilliant.
I'll merge this by hand, and add two functions 'geohash_decode_fast/geohash_encode_fast'.

@Steve132
Copy link
Author

So, sorry about the compile errors. I imagine they are easy to fix. I wrote this at work without access to a C compiler.

With regard to not matching the identical input and output, I didn't test this because, again, no C compiler. However, any differences should be minor

If you manage to fix the differences (which should be trivial, I'll do it if you don't), then I STRONGLY recommend not having two new functions. There's no reason to have two different APIs to do exactly the same thing but just one is different. The unix philosophy: "There should be one, and exactly one, obvious way to do something".

Users of your API shouldn't have to choose between the 'fast' way and the 'slow' way...it's confusing. Just provide one way.

@mattsta
Copy link

mattsta commented Sep 16, 2014

For comparison: the original encode took 140 ns on my machine (compiled with clang -O2) and the new encode now takes 0.57 ns to encode.

Decodes went from 100 ns to the same 0.57 ns.

(So, prior performance was 7 million encodes per second and 10 million decodes per second; now we can do 1.7 billion per second with either encodes or decodes).

New version performance comparison versus compiler optimization levels:

  • clang -O0: encode = 2 ns; decode = 0.57 ns
  • gcc -O0: encode = 101 ns; decode = 108 ns
  • clang -O1 and higher: encode = 0.57 ns; decode = 0.57 ns
  • gcc -O1: encode = 36 ns; decode = 54 ns
  • gcc -O2: encode = 30 ns; decode = 49 ns
  • gcc -O3: encode = 19 ns; decode = 21 ns

@Steve132
Copy link
Author

mattsta: can you post the assembly for the encode-decode on clang -O1 vs gcc -O3? I'm amazed there is almost a factor of 30 difference in performance and I really want to see what the compiler is doing.

@mattsta
Copy link

mattsta commented Sep 16, 2014

Thanks for offering to check. I put files at https://gist.github.com/mattsta/cd338cb0253e85d8dfe0. It will be easier to just clone the gist and compare locally (git clone https://gist.github.com/mattsta/cd338cb0253e85d8dfe0). If you're feeling dangerous, you can try to open the gist in your browser, but the page is very large: Direct links: clang -O1 and gcc -O3.

The assembly files are from the version of geohash-int I keep at https://github.com/mattsta/geohash-int

Assembly was generated with: [gcc-4.9 | clang] geohash.c -o [compiler-options].s -O[0,1,2,3] -S -g -fverbose-asm

The timing results were found using my dumb looping timing tester: https://github.com/mattsta/krmt/blob/master/geo/test.c

It's possible the higher optimization levels were removing some useful steps since I didn't use the results of the function anywhere again.

@yinqiwen
Copy link
Owner

I tested with gcc 3.4.5/4.8.2 in linux, it's strange that the new decode function is slightly slower with gcc3.4.5 -O2, it seems that new glibc is much faster with math function 'ldexp'. I put all test record at
https://gist.github.com/yinqiwen/c6d0be258356996bd16b

@yinqiwen
Copy link
Owner

@Steve132 @mattsta Now I use simple divide & shift operations instead of math function 'ldexp', the decode is about 300% faster than using ' ldexp'.
Test record: https://gist.github.com/yinqiwen/c6d0be258356996bd16b

@mattsta
Copy link

mattsta commented Sep 18, 2014

Using the new divide & shift decode, my time for a decode (gcc-4.9 -O2) is now 29 ns from the previous 50 ns. Good fix!

@Steve132
Copy link
Author

Try this out:

static inline double fastldiexp(double d,uint8 step)
{
    union
    { 
        double fval;
        struct
        {
            unsigned s:1;
            unsigned e:11;
            uint32_t m:52;
        };
    }
    fval=d;
    e-=step;
    return fval;
}


area->latitude.min = lat_range.min+fastldiexp(ilato,step)*lat_scale;
area->latitude.max = lat_range.min+fastldiexp(ilato+1,step)*lat_scale;
area->longitude.min = lon_range.min+fastldiexp(ilono,step)*lon_scale;
area->longitude.max = lon_range.min+fastldiexp(ilono+1,step)*lon_scale;

@mattsta
Copy link

mattsta commented Sep 19, 2014

That's some crazy compiler voodoo there. After fixing 8 syntax errors, it decodes at 36 ns compared to 29 ns for the all-shifting version. (gcc-4.9 -O2)

I think we've found the optimal results for now with the divide and shift operations. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants