Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617
Labels
cuda array
Stuff about CuArray.
good first issue
Good for newcomers
hard
This is difficult.
speculative
Not sure about this one yet.
A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):
Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream:
CUDA.jl/src/memory.jl
Lines 565 to 569 in a4a9166
This was identified in #2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.
It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:
synchronize()
on each exit path outside of an@async
block to even make it possible to read the data in a valid manner;The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use
CuArray
s in an@async
block without having to explain the asynchronous nature of GPU computing.To illustrate the second point:
Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely
fetch
values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of@async
that inserts this synchronization automatically? Seems like that would hurt portability, though.Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.
Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.
cc @vchuravy
The text was updated successfully, but these errors were encountered: