Benchmark for implementations of behavior "take-in-shuffled-cycles".
Given a list: List<T>
and a number n: Int
. Algorithm needs to uniformly randomly produce one of sequences bound with following conditions:
- every element of the sequence is some element from the
list
, - if you break the sequence into sections of length
list.size
(last section can be smaller) than every section will have no repetitions (under the condition that initiallist
has no repetitions).
First idea is to concatenate enough shuffled copies of list and take enough elements. Straightforward implementation
(0 until sectionsNumber).flatMap {
list.shuffled()
}.take(n)
gets the name StraightforwardShuffledCopies
.
Idea of optimization is that for the last section one can stop shuffle algorithm when it produced enough elements. This optimization make turn out to be substantial if n << list.size
.
Also when reimplementing shuffle
in own code one can reuse intermediate collection (which turned out to be a bad idea).
The flow of this approach reminds sequence
generation.
Omitting some details we got something like this:
sequence {
while (true) {
section.reset() // whatever it means
list.indices.forEach { i ->
val j = random.nextInt(list.size - i) + i
yield(section[j])
section[j] = section[i]
// we don't need to actually set `section[i]` element since it will not be used
}
}
}.take(n)
This implementations is called ViaSequence
.
Unfortunately sequence
generation involves some overhead which is (as shown below) much bigger than the economy from optimizations, at least for single-threaded use.
So our next step is to get rid of sequence
mechanisms rewriting the algorithm in an old-fashioned style, like this:
val fullSectionsNum = n / list.size
repeat(fullSectionsNum) {
result.addAll(list.shuffled(random))
}
val incompleteSectionSize = n - fullSectionsNum * list.size
val section = list.toMutableList()
for (i in 0 until incompleteSectionSize) {
val j = random.nextInt(list.size - i) + i
result.add(section[j])
section[j] = section[i]
}
This implementation is called OptimizedShuffledCopies
.
Benchmark aspects.
- Runs numbers. Empirically found number of
4096
seems to be good enough for all purposes. list
size. We use256
as a regular value but also try64
and1024
.- Elements to take. We use this parameter in form of factor upon
list size
-take cycles
. Start with3.5
but also try0.5
(to cover the case where not all elements are used),9.5
(as a bigger number),1
and5
(to cover case when result consists of complete cycles). - Single/multi threaded.
Below used terms
seq
for sequential i.e. single-threaded andpar
for parallel i.e. using all cores available (via coroutines).
Technical details.
Start each benchmark with enough (0x10000
) JIT-warmup runs.
The results of the actual runs are not used, but optimizer doesn't punish us for that.
CPU used in actual benchmarking is i7-3770 @ 3.40 GHz (4 actual cores), OS is Win 10.
-
In single-threaded mode
ViaSequence
is always an outsider with typical factor of1.5
against competitors (sometimes more than3
). Even for its best case withtake cycles < 1
(where optimization actually works) it is slightly worse thanStraightforwardShuffledCopies
. And unsurprisingly the wort case is for integertake cycles
. Blamesequence
generation overhead. -
In the same conditions
OptimizedShuffledCopies
is noticeably better thanStraightforwardShuffledCopies
also with typical factor around1.5
. Totally the optimization seems to be a worth investment (if one is actually needed in your profile). -
For
OptimizedShuffledCopies
it turned out that for complete segments usage ofaddAll(list.shuffled())
is a little cheaper than reuse of intermediate collection and resetting it. I think that is because in the former caseSystem.arraycopy
is used. -
Everyone benefits from parallelism, but
StraightforwardShuffledCopies
does worse than others, so in a couple of tests is loses toViaSequence
. I don't have an explanation for that. What is even more interesting, ifIterable<T>.shuffled(Random)
is replaced withIterable<T>.shuffled()
this implementation becomes slower in multi-threaded mode than in single-thread (which remains the same). The reason for it is that in latter casejava.util.Random
is used ultimately. And although it is thread-safe it may become quite inefficient if intensively used from several threads simultaneously. Shuffle scenario is one of the best examples of such situation.
Long story short optimized version is always faster than straightforward one typically around 1.5
times.
But one need to write it in old-fashioned style since sequence
generation overhead eats all of the advantage (and even more).
Benchmark[mode: seq, runs: 4096, list size: 256, take cycles: 3.5]
21188 nanos -- StraightforwardShuffledCopies
29714 nanos -- ViaSequence
8809 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: seq, runs: 4096, list size: 64, take cycles: 3.5]
4346 nanos -- StraightforwardShuffledCopies
7230 nanos -- ViaSequence
2111 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: seq, runs: 4096, list size: 1024, take cycles: 3.5]
62011 nanos -- StraightforwardShuffledCopies
124952 nanos -- ViaSequence
35701 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: seq, runs: 4096, list size: 256, take cycles: 0.5]
3180 nanos -- StraightforwardShuffledCopies
4164 nanos -- ViaSequence
1837 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: seq, runs: 4096, list size: 256, take cycles: 9.5]
38166 nanos -- StraightforwardShuffledCopies
82904 nanos -- ViaSequence
23844 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: seq, runs: 4096, list size: 256, take cycles: 1.0]
2414 nanos -- StraightforwardShuffledCopies
8202 nanos -- ViaSequence
2372 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: seq, runs: 4096, list size: 256, take cycles: 5.0]
13789 nanos -- StraightforwardShuffledCopies
43085 nanos -- ViaSequence
12100 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 256, take cycles: 3.5]
7601 nanos -- StraightforwardShuffledCopies
8763 nanos -- ViaSequence
3397 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 64, take cycles: 3.5]
1892 nanos -- StraightforwardShuffledCopies
2686 nanos -- ViaSequence
803 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 1024, take cycles: 3.5]
19724 nanos -- StraightforwardShuffledCopies
30796 nanos -- ViaSequence
10372 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 256, take cycles: 0.5]
1186 nanos -- StraightforwardShuffledCopies
1772 nanos -- ViaSequence
657 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 256, take cycles: 9.5]
10932 nanos -- StraightforwardShuffledCopies
23662 nanos -- ViaSequence
5656 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 256, take cycles: 1.0]
796 nanos -- StraightforwardShuffledCopies
2437 nanos -- ViaSequence
1042 nanos -- OptimizedShuffledCopies
================================================================
Benchmark[mode: par, runs: 4096, list size: 256, take cycles: 5.0]
4218 nanos -- StraightforwardShuffledCopies
10963 nanos -- ViaSequence
3865 nanos -- OptimizedShuffledCopies
================================================================