Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix illegal acces mean/stdev, sum add Kahan Summation #2223

Merged
merged 4 commits into from
Mar 16, 2024

Conversation

mfoerste4
Copy link
Collaborator

This PR addresses #2204 and #2205.

  • fixes illegal access / test coverage for mean row-wise kernel
  • fixes illegal access / test coverage for stdev row-wise kernel
  • modified sum kernels to utilize Kahan/Neumaier summation per thread, also increase load per thread to benefit from this

FYI, @tfeher

@mfoerste4 mfoerste4 requested a review from a team as a code owner March 13, 2024 16:15
@mfoerste4 mfoerste4 self-assigned this Mar 13, 2024
@github-actions github-actions bot added the cpp label Mar 13, 2024
@mfoerste4 mfoerste4 added non-breaking Non-breaking change bug Something isn't working and removed cpp labels Mar 13, 2024
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Malte for the PR. It looks good, overall. Could you please check perf impact, and share some details on achieved accuracy?

cpp/test/stats/mean.cu Outdated Show resolved Hide resolved
cpp/test/stats/minmax.cu Outdated Show resolved Hide resolved
cpp/test/stats/stddev.cu Show resolved Hide resolved
cpp/test/stats/sum.cu Show resolved Hide resolved
Comment on lines +76 to 78
raft::myAtomicAdd(smu + thisColId, thread_sum);
__syncthreads();
if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we are still loosing accuracy here, because we cannot do atomic compensated summation. In a follow up PR, we should strive to improve this. A few notes:

  • Within block: instead of shared memory atomics, could we do hierarchical reduction and keep the compensation?
  • Across blocks: one could consider using a mutex to guard access. That is done in fusedl2NN and it might make sense to sync with @mdoijade to discuss pros / cons. Alternatively, dump values per block to temp space, and run a second compensated reduction over them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you would need to make use of extra smem here smu[ColsPerBlk * RowsPerBlkPerIter ] then store the each outputs something like smu[ thisColId * RowsPerBlkPerIter + thisRowId ] = thread_sum , followed by per-thread working on summing up RowsPerBlkPerIter from a single warp0 with kahan algo if RowsPerBlkPerIter is small and for larger RowsPerBlkPerIter like 32 you can use shfl based reduction with kahan algo applied on each of its 5 iteration.

Copy link
Collaborator Author

@mfoerste4 mfoerste4 Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, within the block we can use a second shared memory atomicAdd to store the compensation. With the current blockdim we only have 4 threads adding their intermediate values. I tried that but decided to skip for now until addition across blocks is not compensated afterwards.

Copy link
Collaborator Author

@mfoerste4 mfoerste4 Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raft::myAtomicAdd(smu + thisColId, thread_sum);
__syncthreads();
if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]);
__shared__ Type smu[ColsPerBlk];
__shared__ Type sc[ColsPerBlk];
if (threadIdx.x < ColsPerBlk) {
smu[threadIdx.x] = Type(0);
sc[threadIdx.x] = Type(0);
}
__syncthreads();
// compensate for block addition
{
const Type old_sum = atomicAdd(smu + thisColId, thread_sum);
const Type t = block_sum + thread_sum;
if (abs(old_sum) >= abs(thread_sum)) {
thread_c += (block_sum - t) + thread_sum;
} else {
thread_c += (thread_sum - t) + block_sum;
}
raft::myAtomicAdd(sc + thisColId, thread_c);
}
__syncthreads();
if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId] + sc[thisColId]);

}
Type acc = BlockReduce(temp_storage).Sum(thread_data);
thread_sum += thread_c;
Type acc = BlockReduce(temp_storage).Sum(thread_sum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not compensated right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlockReduce is not, which is why the compensation is added to the value beforehand.

@mfoerste4
Copy link
Collaborator Author

Thanks @tfeher and @mdoijade for the review. I did run a comparison between different approaches of summing up a large array of constant values:
https://nvidia-my.sharepoint.com/:x:/p/mfoerster/EdK2cPvlX9ZDgMHLFtSHaWUB_0UEWrXwDWhXUszEFwdYwg?e=ts7fVV
At least for this example the Blockwise compensation did not change the result at all, but this might be different in case the value decomposition is different.

@github-actions github-actions bot added the cpp label Mar 15, 2024
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Malte for the updates! Also many thanks to @mdoijade for the suggestions for improvement. I think we shall add those in a follow up PR, and settle with the current state for this release, since it already brings significant accuracy improvement. LGTM!

@cjnolet
Copy link
Member

cjnolet commented Mar 16, 2024

/merge

@rapids-bot rapids-bot bot merged commit 7335267 into rapidsai:branch-24.04 Mar 16, 2024
71 checks passed
@mfoerste4 mfoerste4 deleted the stat_bugs branch March 19, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants