-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize merge
algorithm for data sizes equal or greater then 4M items
#1933
base: main
Are you sure you want to change the base?
Conversation
a6164fd
to
d4721ca
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge.h
Outdated
Show resolved
Hide resolved
auto __scratch_idx = __global_idx / __base_diag_part; | ||
|
||
_split_point_t __start; | ||
if (__global_idx % __base_diag_part != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline about the approach to partition based on SLM size and then work within the partitioned blocks in the second kernel.
One advantage of this method (beyond working within SLM for all diagonals in this kernel) would be that there would be no work-item divergence with a branch and mod operation like this. The first partitioning kernel would be lightweight and basically only to establish bounds for the second kernel. Then the second kernel would work within SLM loaded data and search for all diagonals within that block then serial merge, and all work items could be the same (with possible exception for the zeroth work item).
std::forward<_ExecutionPolicy>(__exec), std::forward<_Range1>(__rng1), std::forward<_Range2>(__rng2), | ||
std::forward<_Range3>(__rng3), __comp); | ||
} | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this else branch and its kernel? std::uint32_t
has a well-defined maximum, and we know __n < 4 * 1'048'576
in this branch, so it can always be indexed with this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
But another solution - we may use std::uint16_t
type too for smaller data sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an estimate of how much performance benefit the 16-bit indexing kernel brings? I think it would be best to weigh the impact of this kernel against the increase in JIT time. If the performance benefit is significant, then I am in favor of keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No perf profit, removed.
0eb4649
to
129898c
Compare
…introduce new function __find_start_point_in Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes Signed-off-by: Sergey Kopienko <[email protected]>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…fix compile error Signed-off-by: Sergey Kopienko <[email protected]>
…fix Kernel names Signed-off-by: Sergey Kopienko <[email protected]>
…rename template parameter names in __parallel_merge_submitter Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
d8366c3
to
58eacbf
Compare
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
8f756f0
to
1b6cd34
Compare
…fix review comment Signed-off-by: Sergey Kopienko <[email protected]>
In this PR we optimize
merge
algorithm for data sizes equal or greater then 4M items.The main idea - we doing two submits:
submit
we find split point in some"base"
diagonal's subset.submit
we find split points in all other diagonal and runserial merge
for each diagonal (as before).But when we find split point on the current diagonal, we setup some indexes limits for
rng1
and 'rng2'.For these limits we load split point's data from previous and next
"base"
diagonals, calculated on the step (1).Applying this approach we have good perf profit for biggest data sizes with
float
andint
data types.As additional profit, we have sign performance boost for small and middle data sizes in the
merge_sort
algorithm.