Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

maleadt · 2025-01-11T20:54:13Z

A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):

a = cu(rand(N, 2))

@async begin
  @cuda kernel(a[:, 1])
end

@async begin
  @cuda kernel(a[:, 2])
end

Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream:

CUDA.jl/src/memory.jl

Lines 565 to 569 in a4a9166

    
           # accessing memory on another stream: ensure the data is ready and take ownership 
        
           if managed.stream != state.stream 
        
             maybe_synchronize(managed) 
        
             managed.stream = state.stream 
        
           end

This was identified in #2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.

It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:

it would re-introduce the IMO surprising and hard to explain behavior of having to explicitly synchronize() on each exit path outside of an @async block to even make it possible to read the data in a valid manner;
we cannot easily identify when the synchronization is happening between different tasks, unless we would also track tasks operating on array, which doesn't seem straightforward.

The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use CuArrays in an @async block without having to explain the asynchronous nature of GPU computing.

To illustrate the second point:

device!(0)
a = cu(rand(N, 2))
@cuda kernel(a[:, 1])
device!(1)
@cuda kernel(a[:, 2])

# is detected the same as

device!(0)
a = cu(rand(N, 2))
@async begin
  device!(0)
  @cuda kernel(a[:, 1])
end
@async begin
  device!(1)
  @cuda kernel(a[:, 2])
end

Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely fetch values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of @async that inserts this synchronization automatically? Seems like that would hurt portability, though.

Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.

Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.

cc @vchuravy

The text was updated successfully, but these errors were encountered:

maleadt · 2025-01-11T21:14:33Z

For the easy workaround, I'd add a simple synchronized field or so to the Managed struct and have maybe_synchronize bail out if its set. Some non-exported (as they're supposed to be temporary) interface methods like CUDA.enable_synchronization!(a::CuArray, enabled::Bool=true) could be used to toggle that.

maleadt added cuda array Stuff about CuArray. hard This is difficult. speculative Not sure about this one yet. labels Jan 11, 2025

maleadt added the good first issue Good for newcomers label Jan 11, 2025

efaulhaber mentioned this issue Jan 16, 2025

[Proof of Concept] Multi-GPU prototype (single node) trixi-framework/PointNeighbors.jl#89

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

maleadt commented Jan 11, 2025

maleadt commented Jan 11, 2025

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Comments

maleadt commented Jan 11, 2025

maleadt commented Jan 11, 2025