added vectorization to generate_n #6215

Johan511 · 2023-03-29T19:02:43Z

added vectorization to generate_n

hpx::generate_n calls std::generate_n, if in parallel mode it splits up the work into chunks and calls generate_n on each chunk.
Previously no execution policy was specified for std::generate_n (defaulted to seq), this PR changes it and mentions seq or unseq based on hpx::execution policy mentioned by user

par_unseq:

par:

hkaiser · 2023-03-31T19:40:54Z

@Johan511 could you create graphs that use the same y-axis limits, please?

hkaiser · 2023-03-31T19:52:10Z

The code you proposed touches on sequential operations only. Could you measure the sequential speedup as well?

Johan511 · 2023-04-01T04:22:51Z

The change works for par_unseq too as parallel version of generate_n works by calling sequential generate on chunks. Will post speedups for unseq soon.

Johan511 · 2023-04-08T23:38:39Z

Disposed runs which took more than 0.4ms

unseq :
mean : 0.15

seq : 0.2

Johan511 · 2023-04-09T06:48:18Z

Please note that the performance gains are actually not very significant.
Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations
BM_gen_n_par 6889328 ns 6888372 ns 76
BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations
BM_gen_n_par 124210 ns 124210 ns 4038
BM_gen_n_par_unseq 159027 ns 159020 ns 4949

hkaiser · 2023-04-09T14:56:02Z

Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949

You should always enable all optimizations for performance measurements.

Johan511 · 2023-04-09T15:05:25Z

-O3 flag seems to tries vectorize most loops. Should I try compiling HPX with O2 flag and compare performance of vectorized vs non vectorized?

Often times the performance on vectorization gains seem to be minimal as -O3 seems to already vectorize loops.

hkaiser · 2023-05-03T13:38:28Z

@Johan511 could you please rebase this onto master, now that the release is out?

StellarBot · 2023-05-16T22:03:36Z

Performance test report

HPX Performance

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR	PARALLEL_EXECUTOR	SCHEDULER_EXECUTOR
For Each	-	??	-

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-05-16T21:41:46+00:00
HPX Commit	`dcb5415`	`3f93250`
Clustername	rostam	rostam
Datetime	2023-05-10T14:50:18.616050-05:00	2023-05-16T17:00:01.775607-05:00
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARK	NO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch	-

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-05-16T21:41:46+00:00
HPX Commit	`dcb5415`	`3f93250`
Clustername	rostam	rostam
Datetime	2023-05-10T14:52:35.047119-05:00	2023-05-16T17:02:24.950778-05:00
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATOR	PARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATOR	SCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add	(=)	(=)	(=)
Stream Benchmark - Scale	(=)	(=)	(=)
Stream Benchmark - Triad	(=)	(=)	(=)
Stream Benchmark - Copy	(=)	(=)	(=)

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-05-16T21:41:46+00:00
HPX Commit	`dcb5415`	`3f93250`
Clustername	rostam	rostam
Datetime	2023-05-10T14:52:52.237641-05:00	2023-05-16T17:02:44.582158-05:00
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu
Envfile

Explanation of Symbols

Symbol	MEANING
=	No performance change (confidence interval within ±1%)
(=)	Probably no performance change (confidence interval within ±2%)
(+)/(-)	Very small performance improvement/degradation (≤1%)
+/-	Small performance improvement/degradation (≤5%)
++/--	Large performance improvement/degradation (≤10%)
+++/---	Very large performance improvement/degradation (>10%)
?	Probably no change, but quite large uncertainty (confidence interval with ±5%)
??	Unclear result, very large uncertainty (±10%)
???	Something unexpected…

hkaiser · 2023-07-23T21:04:10Z

inspect was reporting:

/libs/core/algorithms/include/hpx/parallel/algorithms/detail/generate.hpp

*I* missing #include (type_traits) for symbol std::true_type on line 57

Please rebase one more time to pull in all changes from master.

Signed-off-by: Johan511 <[email protected]>

srinivasyadav18 · 2023-10-21T18:10:42Z

@hkaiser I have rebased and added unit tests to ensure everything is working with generate_n algorithm with unseq and par_unseq execution policies.

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq.
However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

hkaiser · 2023-10-23T14:49:02Z

retest lsu

hkaiser · 2023-10-23T14:50:07Z

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq.
However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

Can we construct test cases where the compiler is not able to vectorize things on its own?

Johan511 requested review from aurianer, biddisco, hkaiser and msimberg as code owners March 29, 2023 19:02

hkaiser added type: enhancement type: compatibility issue category: algorithms labels Mar 31, 2023

hkaiser added this to the 1.10.0 milestone Mar 31, 2023

Johan511 force-pushed the generate_n-par_unseq branch from 1234486 to 5e4f001 Compare May 16, 2023 21:41

added vectorisation to generate_n

c0d05c9

Signed-off-by: Johan511 <[email protected]>

srinivasyadav18 force-pushed the generate_n-par_unseq branch from 5e4f001 to c0d05c9 Compare October 21, 2023 16:44

srinivasyadav18 added 2 commits October 21, 2023 12:09

Add missing #include <type_traits>

b0686fd

Add generaten unseq unit tests

07fe128

msimberg removed request for msimberg, biddisco and aurianer November 1, 2023 08:36

hkaiser removed this from the 1.10.0 milestone May 3, 2024

hkaiser added this to the 1.11.0 milestone May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added vectorization to generate_n #6215

added vectorization to generate_n #6215

Johan511 commented Mar 29, 2023 •

edited

Loading

hkaiser commented Mar 31, 2023

hkaiser commented Mar 31, 2023

Johan511 commented Apr 1, 2023

Johan511 commented Apr 8, 2023 •

edited

Loading

Johan511 commented Apr 9, 2023 •

edited

Loading

hkaiser commented Apr 9, 2023

Without -O3 flag

With -O3 flag

Johan511 commented Apr 9, 2023

hkaiser commented May 3, 2023

StellarBot commented May 16, 2023

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

hkaiser commented Jul 23, 2023

srinivasyadav18 commented Oct 21, 2023

hkaiser commented Oct 23, 2023

hkaiser commented Oct 23, 2023

added vectorization to generate_n #6215

Are you sure you want to change the base?

added vectorization to generate_n #6215

Conversation

Johan511 commented Mar 29, 2023 • edited Loading

hkaiser commented Mar 31, 2023

hkaiser commented Mar 31, 2023

Johan511 commented Apr 1, 2023

Johan511 commented Apr 8, 2023 • edited Loading

Johan511 commented Apr 9, 2023 • edited Loading

Without -O3 flag

With -O3 flag

hkaiser commented Apr 9, 2023

Without -O3 flag

With -O3 flag

Johan511 commented Apr 9, 2023

hkaiser commented May 3, 2023

StellarBot commented May 16, 2023

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

hkaiser commented Jul 23, 2023

srinivasyadav18 commented Oct 21, 2023

hkaiser commented Oct 23, 2023

hkaiser commented Oct 23, 2023

Johan511 commented Mar 29, 2023 •

edited

Loading

Johan511 commented Apr 8, 2023 •

edited

Loading

Johan511 commented Apr 9, 2023 •

edited

Loading