-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix illegal acces mean/stdev, sum add Kahan Summation #2223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Malte for the PR. It looks good, overall. Could you please check perf impact, and share some details on achieved accuracy?
raft::myAtomicAdd(smu + thisColId, thread_sum); | ||
__syncthreads(); | ||
if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we are still loosing accuracy here, because we cannot do atomic compensated summation. In a follow up PR, we should strive to improve this. A few notes:
- Within block: instead of shared memory atomics, could we do hierarchical reduction and keep the compensation?
- Across blocks: one could consider using a mutex to guard access. That is done in fusedl2NN and it might make sense to sync with @mdoijade to discuss pros / cons. Alternatively, dump values per block to temp space, and run a second compensated reduction over them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you would need to make use of extra smem here smu[ColsPerBlk * RowsPerBlkPerIter ]
then store the each outputs something like smu[ thisColId * RowsPerBlkPerIter + thisRowId ] = thread_sum
, followed by per-thread working on summing up RowsPerBlkPerIter
from a single warp0 with kahan algo if RowsPerBlkPerIter
is small and for larger RowsPerBlkPerIter
like 32 you can use shfl based reduction with kahan algo applied on each of its 5 iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, within the block we can use a second shared memory atomicAdd to store the compensation. With the current blockdim we only have 4 threads adding their intermediate values. I tried that but decided to skip for now until addition across blocks is not compensated afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raft::myAtomicAdd(smu + thisColId, thread_sum); | |
__syncthreads(); | |
if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId]); | |
__shared__ Type smu[ColsPerBlk]; | |
__shared__ Type sc[ColsPerBlk]; | |
if (threadIdx.x < ColsPerBlk) { | |
smu[threadIdx.x] = Type(0); | |
sc[threadIdx.x] = Type(0); | |
} | |
__syncthreads(); | |
// compensate for block addition | |
{ | |
const Type old_sum = atomicAdd(smu + thisColId, thread_sum); | |
const Type t = block_sum + thread_sum; | |
if (abs(old_sum) >= abs(thread_sum)) { | |
thread_c += (block_sum - t) + thread_sum; | |
} else { | |
thread_c += (thread_sum - t) + block_sum; | |
} | |
raft::myAtomicAdd(sc + thisColId, thread_c); | |
} | |
__syncthreads(); | |
if (threadIdx.x < ColsPerBlk && colId < D) raft::myAtomicAdd(mu + colId, smu[thisColId] + sc[thisColId]); |
} | ||
Type acc = BlockReduce(temp_storage).Sum(thread_data); | ||
thread_sum += thread_c; | ||
Type acc = BlockReduce(temp_storage).Sum(thread_sum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not compensated right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BlockReduce is not, which is why the compensation is added to the value beforehand.
Thanks @tfeher and @mdoijade for the review. I did run a comparison between different approaches of summing up a large array of constant values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Malte for the updates! Also many thanks to @mdoijade for the suggestions for improvement. I think we shall add those in a follow up PR, and settle with the current state for this release, since it already brings significant accuracy improvement. LGTM!
/merge |
This PR addresses #2204 and #2205.
FYI, @tfeher