Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Open
maleadt opened this issue Jan 11, 2025 · 1 comment
Labels
cuda array Stuff about CuArray. good first issue Good for newcomers hard This is difficult. speculative Not sure about this one yet.

Comments

@maleadt
Copy link
Member

maleadt commented Jan 11, 2025

A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):

a = cu(rand(N, 2))

@async begin
  @cuda kernel(a[:, 1])
end

@async begin
  @cuda kernel(a[:, 2])
end

Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream:

CUDA.jl/src/memory.jl

Lines 565 to 569 in a4a9166

# accessing memory on another stream: ensure the data is ready and take ownership
if managed.stream != state.stream
maybe_synchronize(managed)
managed.stream = state.stream
end

This was identified in #2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.


It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:

  1. it would re-introduce the IMO surprising and hard to explain behavior of having to explicitly synchronize() on each exit path outside of an @async block to even make it possible to read the data in a valid manner;
  2. we cannot easily identify when the synchronization is happening between different tasks, unless we would also track tasks operating on array, which doesn't seem straightforward.

The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use CuArrays in an @async block without having to explain the asynchronous nature of GPU computing.

To illustrate the second point:

device!(0)
a = cu(rand(N, 2))
@cuda kernel(a[:, 1])
device!(1)
@cuda kernel(a[:, 2])

# is detected the same as

device!(0)
a = cu(rand(N, 2))
@async begin
  device!(0)
  @cuda kernel(a[:, 1])
end
@async begin
  device!(1)
  @cuda kernel(a[:, 2])
end

Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely fetch values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of @async that inserts this synchronization automatically? Seems like that would hurt portability, though.

Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.


Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.

cc @vchuravy

@maleadt maleadt added cuda array Stuff about CuArray. hard This is difficult. speculative Not sure about this one yet. labels Jan 11, 2025
@maleadt
Copy link
Member Author

maleadt commented Jan 11, 2025

For the easy workaround, I'd add a simple synchronized field or so to the Managed struct and have maybe_synchronize bail out if its set. Some non-exported (as they're supposed to be temporary) interface methods like CUDA.enable_synchronization!(a::CuArray, enabled::Bool=true) could be used to toggle that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. good first issue Good for newcomers hard This is difficult. speculative Not sure about this one yet.
Projects
None yet
Development

No branches or pull requests

1 participant