Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replace std::vector; also fix clang C++11
In my application I need to sort a vector of pointers. I was profiling timsort and was surprised at how much time was being spent in copy_to_tmp() This method was implemented as: tmp_.clear(); tmp_.reserve(len); GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_)); Unfortunately this performs badly on simple types like pointers. THe GFX_TIMSORT_MOVE_RANGE() macro itself can reduce to a single memmove() but using std::back_inserter (which is required for move-only types) breaks this optimization. Instead of a nice SSE-optimized memory copy you end up with an element-by-element loop with a vector capacity check each time. As an experiment I did some metaprogramming so that trivially_constructable types would just do: tmp_.assign(begin, begin + len); This basiscally fixed the problem, but it's still not *quite* perfect, since the non-trivial case still is doing extra capacity checks that aren't required. I did a more aggressive fix that replaced the std::vector use entirely with a custom datastructure (since we don't really need a general purpose vector here, just a managed array) which bought another couple percent in speed. First I had to make two prepratory changes, though: 1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode) The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with "_LIBCPP_VERSION" set. Not sure why. Rather than one large binary expression I instead used a larger (but hopefully easier to understand) ifdef decision tree. This can also be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1] As before we default to using C++11 constructs unless we see some evidence that it won't work. However we now let modern versions of clang/libc++ pass. 2. The parts of the "TimSort" class that don't very based on LessFunction are now in their own set of classes such as "TimSortState" This is partially just a cleanup I needed to make some template metaprogramming less gross. However, it's a good idea in any case. It's not unusual for a program to need to sort a type of data multiple ways, which means expanding the "TimSort<RandomAccessIterator,LessFunction>" multiple times. With this change, the compiler can reuse the expansion of "TimSortState<RandomAccessIterator>" between them. If nothing else, this should compile faster. Now with that out of the way, I could get to the meat of the change: replacing the "std::vector<value_t> tmp_;" with a custom "TimSortMergeSpace<>" type. This class allocates itself like a vector, but the only "setter" is a "move_in()" method that replaces its contents via move construction (similar to what the std::back_inserter loop was doing) We don't construct elements before they're needed (even if we allocated them) so it will work even for non-default-constructable types. The move-loop is faster than before since we don't need to re-check for capacity at every insertion. However, on C++11 we do even better: we use template-specialization to provide an alternate implementation of this data type for types that pass std::is_trivially_copyable<>. The big advantage is that we can just use std::memcpy() to refill the merge buffer. The code is also simpler in general since we don't need to worry about construction/destruction of the buffer elements. Since a lot of the overall cost of the timsort algorithm is spent merging, making this data structure as fast as possible is important. This change makes soring randomized sequences about 10% faster when working with trivially-copyable types. While I was there I also replaced the "std::vector<run> pending_" with my own "TimSortRunStack<>" type. This doesn't have the same importance for performance, but it's another place where we don't really need the full STL vector support... just a simple resizable stack. Since I was replacing vector I thought it was more consistent to just replace both. This also removes the header depenedncy on <vector> RESULTS: "make bench" on Xcode 10.1, Mac Pro: RANDOMIZED SEQUENCE [int] size=100K Before: 0.851 After: 0.745 RANDOMIZED SEQUENCE [std::string] size=100K Before: 5.389 After: 3.735 The improvement with "int" is due to the vector replacement. The bigger improvement with "std::string" is making C++11 work with the libc++ STL so that move optimizations get applied.
- Loading branch information