-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace std::vector for better speed; also fix clang C++11 #25
base: 2.x.y
Are you sure you want to change the base?
Conversation
@@ -10,7 +10,7 @@ | |||
|
|||
#include "timsort.hpp" | |||
|
|||
#if ENABLE_STD_MOVE | |||
#if GFX_TIMSORT_USE_CXX11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ENABLE_STD_MOVE
macro was actually removed by the e39ca95 change in 2017, so this #warning
was erroneously always printing "disabled"
I think the cleanest thing to do is to just let the GFX_TIMSORT_USE_CXX11
define leak out of timsort.hpp (i.e. it's not #undef
ed like the other macros) so it can be tested here.
Huh, it looks like the linux+clang environment (not sure which STL) that Travis is using doesn't have the type traits stuff in the expected place:
...which is odd since I'm including |
Concerning your issue with Travis: the Linux Clang uses libstdc++ so you need to install a matching g++ along with Clang. Otherwise I had made similar changes to the timsort implementation based of this one that I ship in cpp-sort: https://github.com/Morwenn/cpp-sort/blob/master/include/cpp-sort/detail/timsort.h I only replaced the big vector by a Now I would be interested in comparing the relative speed of our improvements to see whether I could steal some parts of yours. I will benchmark those later to see whether there is a noticeable difference. |
Yeah, it's not clear to me what version of libstdc++ is being used though. The messages indicate that gcc 4.8.2 is installed in the environment but the libstdc++ got |
Well at least the glibc implementation always copies backwards in the case that Basically performance-wise:
I left the I think it's probably OK here -- most of the caveats about that concern the case where the iterator is pointing at |
OK, still busted in travis. I'll have to set up a Ubuntu 14.04.5 LTS VM when I get a chance and see if it replicates there with clang. |
OK, installed a VM. Issue is fairly clear. By default 14.04LTS came with a libstdc++ snapshot from gcc 4.8 ( I know you're not supposed to rely on the exact value of |
It seems the documented way of doing this is to rely instead on |
OK, it took some ugly hacks, but it now rejects the incomplete C++11 support in the 5-year-old Ubuntu headers that Travis is using, letting the CI tests pass. |
351641b
to
9cd4359
Compare
Depends what optimization level your compiler is at. Did you test debug mode? Debug speed is just as important for many fields including gaming. Unclear whether you're talking about comparing using std::move to memcpy here or memcpy to memmove. Some more context: https://www.plflib.org/blog.htm#notmystdfilln
Thanks for that - I think at the time there was no type traits support in libcpp, that's why. But thanks for updating me, now I have to update a bunch of my code too. |
That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With
With
Now if you compiled with an STL that didn't understand modern |
In my application I need to sort a vector of pointers. I was profiling timsort and was surprised at how much time was being spent in copy_to_tmp() This method was implemented as: tmp_.clear(); tmp_.reserve(len); GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_)); Unfortunately this performs badly on simple types like pointers. THe GFX_TIMSORT_MOVE_RANGE() macro itself can reduce to a single memmove() but using std::back_inserter (which is required for move-only types) breaks this optimization. Instead of a nice SSE-optimized memory copy you end up with an element-by-element loop with a vector capacity check each time. As an experiment I did some metaprogramming so that trivially_constructable types would just do: tmp_.assign(begin, begin + len); This basiscally fixed the problem, but it's still not *quite* perfect, since the non-trivial case still is doing extra capacity checks that aren't required. I did a more aggressive fix that replaced the std::vector use entirely with a custom datastructure (since we don't really need a general purpose vector here, just a managed array) which bought another couple percent in speed. First I had to make two prepratory changes, though: 1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode) The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with "_LIBCPP_VERSION" set. Not sure why. Rather than one large binary expression I instead used a larger (but hopefully easier to understand) ifdef decision tree. This can also be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1] As before we default to using C++11 constructs unless we see some evidence that it won't work. However we now let modern versions of clang/libc++ pass. 2. The parts of the "TimSort" class that don't very based on LessFunction are now in their own set of classes such as "TimSortState" This is partially just a cleanup I needed to make some template metaprogramming less gross. However, it's a good idea in any case. It's not unusual for a program to need to sort a type of data multiple ways, which means expanding the "TimSort<RandomAccessIterator,LessFunction>" multiple times. With this change, the compiler can reuse the expansion of "TimSortState<RandomAccessIterator>" between them. If nothing else, this should compile faster. Now with that out of the way, I could get to the meat of the change: replacing the "std::vector<value_t> tmp_;" with a custom "TimSortMergeSpace<>" type. This class allocates itself like a vector, but the only "setter" is a "move_in()" method that replaces its contents via move construction (similar to what the std::back_inserter loop was doing) We don't construct elements before they're needed (even if we allocated them) so it will work even for non-default-constructable types. The move-loop is faster than before since we don't need to re-check for capacity at every insertion. However, on C++11 we do even better: we use template-specialization to provide an alternate implementation of this data type for types that pass std::is_trivially_copyable<>. The big advantage is that we can just use std::memcpy() to refill the merge buffer. The code is also simpler in general since we don't need to worry about construction/destruction of the buffer elements. Since a lot of the overall cost of the timsort algorithm is spent merging, making this data structure as fast as possible is important. This change makes soring randomized sequences about 10% faster when working with trivially-copyable types. While I was there I also replaced the "std::vector<run> pending_" with my own "TimSortRunStack<>" type. This doesn't have the same importance for performance, but it's another place where we don't really need the full STL vector support... just a simple resizable stack. Since I was replacing vector I thought it was more consistent to just replace both. This also removes the header depenedncy on <vector> RESULTS: "make bench" on Xcode 10.1, Mac Pro: RANDOMIZED SEQUENCE [int] size=100K Before: 0.851 After: 0.745 RANDOMIZED SEQUENCE [std::string] size=100K Before: 5.389 After: 3.735 The improvement with "int" is due to the vector replacement. The bigger improvement with "std::string" is making C++11 work with the libc++ STL so that move optimizations get applied.
9cd4359
to
5ceb868
Compare
Yes, I noted in the blog that clang optimizes this at -02, didn't realise it did it at -O0 as well. The main point is, you can't rely on the compiler. In GCC it only optimized at -O3, but I brought it up and it may be fixed in a future revision. |
As far as I know every STL that supports C++14 (which is my own target) already has the type traits and uses compile-time dispatch to optimize If you want cool memory optimizations, tvanslyke/timsort-cpp implements more of them than usual algorithms, but as far as I know, the whole |
Yes, I agree with @Morwenn -- as long as the STL is taking advantage of C++11 A That tvanslyke timsort version is interesting.. unfortunately I need a more portable C++ version (fast on C++11, but still works on others) which is why I'm focussed on improving this gfx one. His changes to the binary_sort are interesting; I was looking in the same area for possible future work. |
Like I said, it wasn't an issue with libstdc++, it was an issue with GCC. Cheers- |
7faa2da
to
6fd48e9
Compare
c_.deallocate(c_.startp_, c_.alloc_limitp_ - c_.startp_); | ||
} | ||
void push_back(RandomAccessIterator const runBase, LengthType const runLen) { | ||
struct run *nend = c_.endp_ + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this results in undefined behavior the first time around, since c_.endp == nullptr
due to being initialized by contents() : startp_(0), endp_(0), alloc_limitp_(0)
I think the easiest fix is just to check if (c_.endp_ == c_.alloc_limitp_)
and compute c_.endp_ + 1
only on the else
branch.
First, sorry that this PR contains more than one thing. I have a few other changes I want to propose for timsort and I'll try to keep those in standalone PRs. Because of the need for C++11 support to test this I ended up having to fix that bug as part of this performance work. I can break the libc++ changes out into a different PR if you'd rather merge that as a separate PR first.
In my application I need to sort a vector of pointers. I was profiling timsort and was surprised at how much time was being spent in
copy_to_tmp()
This method was implemented as:Unfortunately this performs badly on simple types like pointers. The
GFX_TIMSORT_MOVE_RANGE()
macro itself can reduce to a singlememmove()
but usingstd::back_inserter
(which is required for move-only types) breaks this optimization. Instead of a nice SSE-optimized memory copy you end up with an element-by-element loop with a vector capacity check each time.As an experiment I did some metaprogramming so that TriviallyCopyable types would just do:
This basically fixed the problem, but it's still not quite perfect, since the non-trivial case still is doing extra capacity checks that aren't required. I did a more aggressive fix that replaced the
std::vector
use entirely with a custom data structure (since we don't really need a general purpose vector here, just a managed array) which bought another couple percent in speed.First I had to make two preparatory changes, though:
_LIBCPP_VERSION
set. Not sure why.Rather than one large binary expression I instead used a larger (but hopefully easier to understand) ifdef decision tree. This can also be overridden on the command-line with
-DGFX_TIMSORT_USE_CXX11=[0|1]
As before we default to using C++11 constructs unless we see some evidence that it won't work. However we now let modern versions of clang/libc++ pass.TimSort
class that don't very based onLessFunction
are now in their own set of classes such asTimSortState
This is partially just a cleanup I needed to make some template metaprogramming less gross. However, it's a good idea in any case. It's not unusual for a program to need to sort a type of data multiple ways, which means expanding the
TimSort<RandomAccessIterator,LessFunction>
template multiple times. With this change, the compiler can reuse the expansion ofTimSortState<RandomAccessIterator>
between them. If nothing else, this should compile faster.Now with those things out of the way, I could get to the meat of the change: replacing the
std::vector<value_t> tmp_;
with a customTimSortMergeSpace<>
type.This class allocates itself like a vector, but the only "setter" is a
move_in()
method that replaces its contents via move construction (similar to what the std::back_inserter loop was doing) We don't construct elements before they're needed (even if we allocated them) so it will work even fornon-default-constructable types. The move-loop is faster than before since we don't need to re-check for capacity at every insertion.
However, on C++11 we do even better: we use template-specialization to provide an alternate implementation of this data type for types that pass
std::is_trivially_copyable<>
. The big advantage is that we can just usestd::memcpy()
to refill the merge buffer. The code is also simpler in general since we don't need to worry about construction/destruction of the buffer elements.Since a lot of the overall cost of the timsort algorithm is spent merging, making this data structure as fast as possible is important. This change makes sorting randomized sequences about 10% faster when working with trivially-copyable types.
While I was there I also replaced the
std::vector<run> pending_
with my ownTimSortRunStack<>
type. This doesn't have the same importance for performance, but it's another place where we don't really need the full STLvector
support... just a simple resizable stack. Since I was replacing one vector I thought it was more consistent to just replace both. This also removes the header dependency on<vector>
RESULTS: "make bench" on Xcode 10.1, Mac Pro:
RANDOMIZED SEQUENCE [int] size=100K
Before: 0.851
After: 0.745
RANDOMIZED SEQUENCE [std::string] size=100K
Before: 5.389
After: 3.735
The improvement with
int
is due to the vector replacement. The bigger improvement withstd::string
is making C++11 work with the libc++ STL so that move optimizations get applied.