Fix illegal acces mean/stdev, sum add Kahan Summation #2223

mfoerste4 · 2024-03-13T16:15:33Z

This PR addresses #2204 and #2205.

fixes illegal access / test coverage for mean row-wise kernel
fixes illegal access / test coverage for stdev row-wise kernel
modified sum kernels to utilize Kahan/Neumaier summation per thread, also increase load per thread to benefit from this

tfeher

Thanks Malte for the PR. It looks good, overall. Could you please check perf impact, and share some details on achieved accuracy?

cpp/test/stats/mean.cu

cpp/test/stats/minmax.cu

cpp/test/stats/stddev.cu

cpp/test/stats/sum.cu

tfeher · 2024-03-13T21:42:29Z

cpp/include/raft/stats/detail/sum.cuh

+  raft::myAtomicAdd(smu + thisColId, thread_sum);
  __syncthreads();
  if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]);


As discussed offline, we are still loosing accuracy here, because we cannot do atomic compensated summation. In a follow up PR, we should strive to improve this. A few notes:

Within block: instead of shared memory atomics, could we do hierarchical reduction and keep the compensation?

Across blocks: one could consider using a mutex to guard access. That is done in fusedl2NN and it might make sense to sync with @mdoijade to discuss pros / cons. Alternatively, dump values per block to temp space, and run a second compensated reduction over them.

I believe you would need to make use of extra smem here smu[ColsPerBlk * RowsPerBlkPerIter ] then store the each outputs something like smu[ thisColId * RowsPerBlkPerIter + thisRowId ] = thread_sum , followed by per-thread working on summing up RowsPerBlkPerIter from a single warp0 with kahan algo if RowsPerBlkPerIter is small and for larger RowsPerBlkPerIter like 32 you can use shfl based reduction with kahan algo applied on each of its 5 iteration.

Yes, within the block we can use a second shared memory atomicAdd to store the compensation. With the current blockdim we only have 4 threads adding their intermediate values. I tried that but decided to skip for now until addition across blocks is not compensated afterwards.

Suggested change

raft::myAtomicAdd(smu + thisColId, thread_sum);

__syncthreads();

if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]);

__shared__ Type smu[ColsPerBlk];

__shared__ Type sc[ColsPerBlk];

if (threadIdx.x < ColsPerBlk) {

smu[threadIdx.x] = Type(0);

sc[threadIdx.x] = Type(0);

}

__syncthreads();

// compensate for block addition

{

const Type old_sum = atomicAdd(smu + thisColId, thread_sum);

const Type t = block_sum + thread_sum;

if (abs(old_sum) >= abs(thread_sum)) {

thread_c += (block_sum - t) + thread_sum;

} else {

thread_c += (thread_sum - t) + block_sum;

}

raft::myAtomicAdd(sc + thisColId, thread_c);

}

__syncthreads();

if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId] + sc[thisColId]);

tfeher · 2024-03-13T21:43:17Z

cpp/include/raft/stats/detail/sum.cuh

  }
-  Type acc = BlockReduce(temp_storage).Sum(thread_data);
+  thread_sum += thread_c;
+  Type acc = BlockReduce(temp_storage).Sum(thread_sum);


This is not compensated right?

The BlockReduce is not, which is why the compensation is added to the value beforehand.

cpp/include/raft/stats/detail/sum.cuh

mfoerste4 · 2024-03-15T16:03:04Z

Thanks @tfeher and @mdoijade for the review. I did run a comparison between different approaches of summing up a large array of constant values:
https://nvidia-my.sharepoint.com/:x:/p/mfoerster/EdK2cPvlX9ZDgMHLFtSHaWUB_0UEWrXwDWhXUszEFwdYwg?e=ts7fVV
At least for this example the Blockwise compensation did not change the result at all, but this might be different in case the value decomposition is different.

tfeher

Thanks Malte for the updates! Also many thanks to @mdoijade for the suggestions for improvement. I think we shall add those in a follow up PR, and settle with the current state for this release, since it already brings significant accuracy improvement. LGTM!

cjnolet · 2024-03-16T04:25:41Z

/merge

illegal acces mean/stdev, Kahan Summation

bf53579

mfoerste4 requested a review from a team as a code owner March 13, 2024 16:15

mfoerste4 self-assigned this Mar 13, 2024

github-actions bot added the cpp label Mar 13, 2024

mfoerste4 added non-breaking Non-breaking change bug Something isn't working and removed cpp labels Mar 13, 2024

tfeher requested changes Mar 13, 2024

View reviewed changes

mdoijade reviewed Mar 15, 2024

View reviewed changes

cpp/include/raft/stats/detail/sum.cuh Outdated Show resolved Hide resolved

mfoerste4 added 2 commits March 15, 2024 15:28

add more testcases

6cc865f

review suggestion

225b6c4

Merge branch 'branch-24.04' into stat_bugs

e448e63

github-actions bot added the cpp label Mar 15, 2024

tfeher approved these changes Mar 15, 2024

View reviewed changes

rapids-bot bot merged commit 7335267 into rapidsai:branch-24.04 Mar 16, 2024
71 checks passed

mfoerste4 deleted the stat_bugs branch March 19, 2024 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix illegal acces mean/stdev, sum add Kahan Summation #2223

Fix illegal acces mean/stdev, sum add Kahan Summation #2223

mfoerste4 commented Mar 13, 2024

tfeher left a comment

tfeher Mar 13, 2024

mdoijade Mar 15, 2024

mfoerste4 Mar 15, 2024 •

edited

Loading

mfoerste4 Mar 15, 2024 •

edited

Loading

tfeher Mar 13, 2024

mfoerste4 Mar 15, 2024

mfoerste4 commented Mar 15, 2024

tfeher left a comment

cjnolet commented Mar 16, 2024

-  raft::myAtomicAdd(smu + thisColId, thread_sum);
-  __syncthreads();
-  if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]);
+  __shared__ Type smu[ColsPerBlk];
+  __shared__ Type sc[ColsPerBlk];
+  if (threadIdx.x < ColsPerBlk) {
+    smu[threadIdx.x] = Type(0);
+    sc[threadIdx.x] = Type(0);
+  }
+  __syncthreads();
+// compensate for block addition
+  {
+    const Type old_sum = atomicAdd(smu + thisColId, thread_sum);
+    const Type t         = block_sum + thread_sum;
+    if (abs(old_sum) >= abs(thread_sum)) {
+      thread_c += (block_sum - t) + thread_sum;
+    } else {
+      thread_c += (thread_sum - t) + block_sum;
+    }
+    raft::myAtomicAdd(sc + thisColId, thread_c);
+  }
+  __syncthreads();
+  if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId] + sc[thisColId]);

Fix illegal acces mean/stdev, sum add Kahan Summation #2223

Fix illegal acces mean/stdev, sum add Kahan Summation #2223

Conversation

mfoerste4 commented Mar 13, 2024

tfeher left a comment

Choose a reason for hiding this comment

tfeher Mar 13, 2024

Choose a reason for hiding this comment

mdoijade Mar 15, 2024

Choose a reason for hiding this comment

mfoerste4 Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

mfoerste4 Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

tfeher Mar 13, 2024

Choose a reason for hiding this comment

mfoerste4 Mar 15, 2024

Choose a reason for hiding this comment

mfoerste4 commented Mar 15, 2024

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Mar 16, 2024

mfoerste4 Mar 15, 2024 •

edited

Loading

mfoerste4 Mar 15, 2024 •

edited

Loading