Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace std::vector for better speed; also fix clang C++11 #25

Open
wants to merge 1 commit into
base: 2.x.y
Choose a base branch
from

Conversation

mitchblank
Copy link

First, sorry that this PR contains more than one thing. I have a few other changes I want to propose for timsort and I'll try to keep those in standalone PRs. Because of the need for C++11 support to test this I ended up having to fix that bug as part of this performance work. I can break the libc++ changes out into a different PR if you'd rather merge that as a separate PR first.

In my application I need to sort a vector of pointers. I was profiling timsort and was surprised at how much time was being spent in copy_to_tmp() This method was implemented as:

tmp_.clear();
tmp_.reserve(len);
GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_));

Unfortunately this performs badly on simple types like pointers. The GFX_TIMSORT_MOVE_RANGE() macro itself can reduce to a single memmove() but using std::back_inserter (which is required for move-only types) breaks this optimization. Instead of a nice SSE-optimized memory copy you end up with an element-by-element loop with a vector capacity check each time.

As an experiment I did some metaprogramming so that TriviallyCopyable types would just do:

tmp_.assign(begin, begin + len);

This basically fixed the problem, but it's still not quite perfect, since the non-trivial case still is doing extra capacity checks that aren't required. I did a more aggressive fix that replaced the std::vector use entirely with a custom data structure (since we don't really need a general purpose vector here, just a managed array) which bought another couple percent in speed.

First I had to make two preparatory changes, though:

  1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode) The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with _LIBCPP_VERSION set. Not sure why.

Rather than one large binary expression I instead used a larger (but hopefully easier to understand) ifdef decision tree. This can also be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1] As before we default to using C++11 constructs unless we see some evidence that it won't work. However we now let modern versions of clang/libc++ pass.

  1. The parts of the TimSort class that don't very based on LessFunction are now in their own set of classes such as TimSortState

This is partially just a cleanup I needed to make some template metaprogramming less gross. However, it's a good idea in any case. It's not unusual for a program to need to sort a type of data multiple ways, which means expanding the TimSort<RandomAccessIterator,LessFunction> template multiple times. With this change, the compiler can reuse the expansion of TimSortState<RandomAccessIterator> between them. If nothing else, this should compile faster.

Now with those things out of the way, I could get to the meat of the change: replacing the std::vector<value_t> tmp_; with a custom TimSortMergeSpace<> type.

This class allocates itself like a vector, but the only "setter" is a move_in() method that replaces its contents via move construction (similar to what the std::back_inserter loop was doing) We don't construct elements before they're needed (even if we allocated them) so it will work even for
non-default-constructable types. The move-loop is faster than before since we don't need to re-check for capacity at every insertion.

However, on C++11 we do even better: we use template-specialization to provide an alternate implementation of this data type for types that pass std::is_trivially_copyable<>. The big advantage is that we can just use std::memcpy() to refill the merge buffer. The code is also simpler in general since we don't need to worry about construction/destruction of the buffer elements.

Since a lot of the overall cost of the timsort algorithm is spent merging, making this data structure as fast as possible is important. This change makes sorting randomized sequences about 10% faster when working with trivially-copyable types.

While I was there I also replaced the std::vector<run> pending_ with my own TimSortRunStack<> type. This doesn't have the same importance for performance, but it's another place where we don't really need the full STL vector support... just a simple resizable stack. Since I was replacing one vector I thought it was more consistent to just replace both. This also removes the header dependency on <vector>

RESULTS: "make bench" on Xcode 10.1, Mac Pro:

  • RANDOMIZED SEQUENCE [int] size=100K
    Before: 0.851
    After: 0.745

  • RANDOMIZED SEQUENCE [std::string] size=100K
    Before: 5.389
    After: 3.735

The improvement with int is due to the vector replacement. The bigger improvement with std::string is making C++11 work with the libc++ STL so that move optimizations get applied.

@@ -10,7 +10,7 @@

#include "timsort.hpp"

#if ENABLE_STD_MOVE
#if GFX_TIMSORT_USE_CXX11
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ENABLE_STD_MOVE macro was actually removed by the e39ca95 change in 2017, so this #warning was erroneously always printing "disabled"

I think the cleanest thing to do is to just let the GFX_TIMSORT_USE_CXX11 define leak out of timsort.hpp (i.e. it's not #undefed like the other macros) so it can be tested here.

@mitchblank
Copy link
Author

Huh, it looks like the linux+clang environment (not sure which STL) that Travis is using doesn't have the type traits stuff in the expected place:

clang++ -I. -I/opt/brew/include -L/opt/brew/lib -Wall -Wextra -g  -std=c++11 test/test.cpp -lboost_unit_test_framework -o .bin/test-with-cpp11
In file included from test/test.cpp:11:
./timsort.hpp:430:61: error: no member named 'is_trivially_copyable' in
      namespace 'std'

...which is odd since I'm including <type_traits>. I'll have to investigate that a little more, although I might not have time until next weekend at the earliest. Hopefully others can look through this PR in the meantime even if it can't be merged yet.

@Morwenn
Copy link
Member

Morwenn commented Jan 14, 2019

Concerning your issue with Travis: the Linux Clang uses libstdc++ so you need to install a matching g++ along with Clang.

Otherwise I had made similar changes to the timsort implementation based of this one that I ship in cpp-sort: https://github.com/Morwenn/cpp-sort/blob/master/include/cpp-sort/detail/timsort.h

I only replaced the big vector by a std::unique_ptr and leveraged std::move to get the free optimizations for trvially copyable types. I once try to optimize more to use std::memcpy but there wasn't any noticeable difference so I gave up on those changes (I guess that std::memmove only costs a few additional comparisons when the ranges don't overlap). I would have contributed my improvements back but since they heavily use C++11 and C++14 features I didn't bother :/

Now I would be interested in comparing the relative speed of our improvements to see whether I could steal some parts of yours. I will benchmark those later to see whether there is a noticeable difference.

@mitchblank
Copy link
Author

Concerning your issue with Travis: the Linux Clang uses libstdc++

Yeah, it's not clear to me what version of libstdc++ is being used though. The messages indicate that gcc 4.8.2 is installed in the environment but the __GLIBCXX__ < 20150422 check should disable C++11 if compiling against a libstdc++ older than the one that shipped with gcc 5.1. So I'm guessing it's compiled against a different STL than the gcc one. Unless _LIBCPP_VERSION is also set?

libstdc++ got std::is_trivally_copyable a bit later than the other <type_traits> but it was still in 2014 and so I would expect them to be available at that __GLIBCXX__ level: https://gcc.gnu.org/viewcvs/gcc/trunk/libstdc%2B%2B-v3/include/std/type_traits?view=patch&r1=216032&r2=216031&pathrev=216032

@mitchblank
Copy link
Author

I guess that std::memmove only costs a few additional comparisons when the ranges don't overlap)

Well at least the glibc implementation always copies backwards in the case that src<dst Whether that has any impact I don't know -- probably not.

Basically performance-wise:

  • Replacing the back_inserter with std::move(): obvious improvement once you get rid of that per-element loop. timsort<int> gets ~9% faster
  • ...Then replacing the vector with a custom buffer: a much smaller win, but still seems statistically significant.. probably another ~2% or so
  • ...The std::move()->memcpy() change was the last thing I did. I wouldn't swear that it's faster, but over a bunch of manual runs of bench it seemed to give slightly better numbers on average. This one was a sub-1% improvement though, so it's almost noise.

I left the memcpy() change since there isn't any theoretical reason that it wouldn't be faster. My only concern is whether it's safe to do &*in_begin to convert the iterator to a pointer for any RandomAccessIterator

I think it's probably OK here -- most of the caveats about that concern the case where the iterator is pointing at end() and a debug STL build complains in operator*(), but that's not an issue here since in_begin is always pointing to a real input element. In theory though an iterator that returned a proxy object might not work though... but then is it likely that std::is_trivally_copyable<std::iterator_traits<T>::value_type> is true anyways?

@mitchblank
Copy link
Author

OK, still busted in travis. I'll have to set up a Ubuntu 14.04.5 LTS VM when I get a chance and see if it replicates there with clang.

@mitchblank
Copy link
Author

OK, installed a VM. Issue is fairly clear.

By default 14.04LTS came with a libstdc++ snapshot from gcc 4.8 (4.8.4-2ubuntu1~14.04.4) However because they packaged it themselves it has __GLIBCXX__ set to 20150426

I know you're not supposed to rely on the exact value of __GLIBCXX__, but it's not clear what the best way to do this feature detection for a header-only library. I'll dig around and see if I can find something reasonable to key off of.

@mitchblank
Copy link
Author

It seems the documented way of doing this is to rely instead on _GLIBCXX_RELEASE, but that only exists starting at gcc 7.1, so that would have the unfortunate effect of denying c++11 usage to gcc 5/6

@mitchblank
Copy link
Author

OK, it took some ugly hacks, but it now rejects the incomplete C++11 support in the 5-year-old Ubuntu headers that Travis is using, letting the CI tests pass.

@mattreecebentley
Copy link
Contributor

mattreecebentley commented Jan 14, 2019

I only replaced the big vector by a std::unique_ptr and leveraged std::move to get the free optimizations for trvially copyable types. I once try to optimize more to use std::memcpy but there wasn't any noticeable difference so I gave up on those changes

Depends what optimization level your compiler is at. Did you test debug mode? Debug speed is just as important for many fields including gaming. Unclear whether you're talking about comparing using std::move to memcpy here or memcpy to memmove.

Some more context: https://www.plflib.org/blog.htm#notmystdfilln

1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode)  The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with `_LIBCPP_VERSION` set.  Not sure why.

Thanks for that - I think at the time there was no type traits support in libcpp, that's why. But thanks for updating me, now I have to update a bunch of my code too.

@mitchblank
Copy link
Author

Some more context: https://www.plflib.org/blog.htm#notmystdfilln

That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With -std=c++11 -O2 -march=native:

------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
memory_memcpy        8350 ns       8344 ns      82896
memory_memmove       8341 ns       8338 ns      84025
memory_copy          8352 ns       8347 ns      83763
memory_move          8365 ns       8360 ns      83269

With -O0 they're a few percent slower but, again, all basically the same. Probably it's using generic mem*() implementations?

memory_memcpy        8753 ns       8744 ns      78766
memory_memmove       8733 ns       8724 ns      80247
memory_copy          8726 ns       8720 ns      79595
memory_move          8719 ns       8714 ns      79808

Now if you compiled with an STL that didn't understand modern type_traits you might see some wider differences... although it's still possible an optimization phase would detect the copy-loop and replace it.

In my application I need to sort a vector of pointers.  I was profiling
timsort and was surprised at how much time was being spent in copy_to_tmp()
This method was implemented as:

    tmp_.clear();
    tmp_.reserve(len);
    GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_));

Unfortunately this performs badly on simple types like pointers.  THe
GFX_TIMSORT_MOVE_RANGE() macro itself can reduce to a single memmove()
but using std::back_inserter (which is required for move-only types)
breaks this optimization.  Instead of a nice SSE-optimized memory copy
you end up with an element-by-element loop with a vector capacity check
each time.

As an experiment I did some metaprogramming so that trivially_constructable
types would just do:

    tmp_.assign(begin, begin + len);

This basiscally fixed the problem, but it's still not *quite* perfect, since
the non-trivial case still is doing extra capacity checks that aren't required.
I did a more aggressive fix that replaced the std::vector use entirely with
a custom datastructure (since we don't really need a general purpose vector
here, just a managed array) which bought another couple percent in speed.

First I had to make two prepratory changes, though:

1. The C++11 support was broken on libc++ (which is the default STL for
   most clang installs, including Xcode)  The changes @mattreecebentley
   made to auto-detect C++11 were mostly good but they specifically
   rejected anything with "_LIBCPP_VERSION" set.  Not sure why.

   Rather than one large binary expression I instead used a larger (but
   hopefully easier to understand) ifdef decision tree.  This can also
   be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1]
   As before we default to using C++11 constructs unless we see some
   evidence that it won't work.  However we now let modern versions
   of clang/libc++ pass.

2. The parts of the "TimSort" class that don't very based on LessFunction
   are now in their own set of classes such as "TimSortState"

   This is partially just a cleanup I needed to make some template
   metaprogramming less gross.  However, it's a good idea in any case.
   It's not unusual for a program to need to sort a type of data multiple
   ways, which means expanding the "TimSort<RandomAccessIterator,LessFunction>"
   multiple times.  With this change, the compiler can reuse the expansion
   of "TimSortState<RandomAccessIterator>" between them.  If nothing else,
   this should compile faster.

Now with that out of the way, I could get to the meat of the change: replacing
the "std::vector<value_t> tmp_;" with a custom "TimSortMergeSpace<>" type.

This class allocates itself like a vector, but the only "setter" is a "move_in()"
method that replaces its contents via move construction (similar to
what the std::back_inserter loop was doing)  We don't construct elements
before they're needed (even if we allocated them) so it will work even for
non-default-constructable types.  The move-loop is faster than before since
we don't need to re-check for capacity at every insertion.

However, on C++11 we do even better: we use template-specialization to
provide an alternate implementation of this data type for types that
pass std::is_trivially_copyable<>.  The big advantage is that we can just
use std::memcpy() to refill the merge buffer.  The code is also simpler
in general since we don't need to worry about construction/destruction of
the buffer elements.

Since a lot of the overall cost of the timsort algorithm is spent merging,
making this data structure as fast as possible is important.  This change
makes soring randomized sequences about 10% faster when working with
trivially-copyable types.

While I was there I also replaced the "std::vector<run> pending_" with
my own "TimSortRunStack<>" type.  This doesn't have the same importance
for performance, but it's another place where we don't really need the
full STL vector support... just a simple resizable stack.  Since I
was replacing vector I thought it was more consistent to just replace
both.  This also removes the header depenedncy on <vector>

RESULTS: "make bench" on Xcode 10.1, Mac Pro:

  RANDOMIZED SEQUENCE [int] size=100K
    Before: 0.851
    After: 0.745

  RANDOMIZED SEQUENCE [std::string] size=100K
    Before: 5.389
    After: 3.735

The improvement with "int" is due to the vector replacement.  The bigger
improvement with "std::string" is making C++11 work with the libc++ STL so
that move optimizations get applied.
@mattreecebentley
Copy link
Contributor

mattreecebentley commented Jan 15, 2019

Some more context: https://www.plflib.org/blog.htm#notmystdfilln

That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With -std=c++11 -O2 -march=native:

Yes, I noted in the blog that clang optimizes this at -02, didn't realise it did it at -O0 as well. The main point is, you can't rely on the compiler. In GCC it only optimized at -O3, but I brought it up and it may be fixed in a future revision.

@Morwenn
Copy link
Member

Morwenn commented Jan 15, 2019

As far as I know every STL that supports C++14 (which is my own target) already has the type traits and uses compile-time dispatch to optimize std::copy and std::move down to std::memcpy and std::memmove when the iterator types can be treated as pointers to trivially copyable types, so the apart from an overlap check with std::memmove there shouldn't be much of a difference. The difference wrt std::fill vs. std::memset in the blog post is explained by the fast that libstdc++ only optimizes std::fill to std::memset for character types (there is probably a reason but I don't know which), moreover libc++ recently removed the optimization for std::fill because their built-in memset wasn't constexpr (but they might bring the optimization back with std::is_constant_evaluated).

If you want cool memory optimizations, tvanslyke/timsort-cpp implements more of them than usual algorithms, but as far as I know, the whole reinterpret_cast gig in the project is UB by the standard despite the author taking great care to avoid alignment issues.

@mitchblank
Copy link
Author

Yes, I agree with @Morwenn -- as long as the STL is taking advantage of C++11 <type_traits>, then the std::copy()->memmove() transformation should always happen before the optimizer even sees the code.

A std::fill->memset() transformation requires the compiler to actually detect the int-wise initializing loop can be replaced with a byte-wise memset() (which is dependent on the int being a constant matching a particular pattern), so you won't get that at -O0

That tvanslyke timsort version is interesting.. unfortunately I need a more portable C++ version (fast on C++11, but still works on others) which is why I'm focussed on improving this gfx one. His changes to the binary_sort are interesting; I was looking in the same area for possible future work.

@mattreecebentley
Copy link
Contributor

Like I said, it wasn't an issue with libstdc++, it was an issue with GCC. Cheers-

c_.deallocate(c_.startp_, c_.alloc_limitp_ - c_.startp_);
}
void push_back(RandomAccessIterator const runBase, LengthType const runLen) {
struct run *nend = c_.endp_ + 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this results in undefined behavior the first time around, since c_.endp == nullptr due to being initialized by contents() : startp_(0), endp_(0), alloc_limitp_(0)

I think the easiest fix is just to check if (c_.endp_ == c_.alloc_limitp_) and compute c_.endp_ + 1 only on the else branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants