Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelAbstractions version of GPUArrays #525

Closed
wants to merge 4 commits into from

Conversation

leios
Copy link
Contributor

@leios leios commented Mar 29, 2024

Right, tests are passing locally, but still a bunch of small things to do:

  • Rebase / squash
  • Cleanup (make sure the workflows rely on the appropriate branches for testing)
  • Docs
  • Corresponding PRs
    • AMDGPU
    • CUDA
    • Metal
    • oneAPI
  • Check for performance regressions

This PR supersedes #451 and should be ready for review next week. Just giving everyone a sneak peek now.

Also seems to fix #530

@leios leios mentioned this pull request Mar 29, 2024
4 tasks
Project.toml Outdated Show resolved Hide resolved
@leios leios force-pushed the yoyoyo_rebase_time branch from d05e5cd to 0800eee Compare March 30, 2024 14:33
@leios
Copy link
Contributor Author

leios commented Mar 30, 2024

This PR breaks the interface to CUDA, so the buildkite tests will fail unless I point to a specific PR (working on that now). How do I do that with buildkite?

Also: if this is merged, we might want to create a new release

@maleadt
Copy link
Member

maleadt commented Apr 5, 2024

Maybe temporarily change the Pkg invocation during CI to pick up the accompanying back-end PRs:

Pkg.develop(; name="CUDA")

@leios
Copy link
Contributor Author

leios commented May 22, 2024

Ok, can't quite figure out the oneAPI test failures. It is breaking here: https://github.com/leios/GPUArrays.jl/blob/yoyoyo_rebase_time/test/testsuite/statistics.jl#L66 (and at a similar line above).

Here is the compare function:

function compare(f, AT::Type{<:AbstractGPUArray}, xs...; kwargs...)
    # copy on the CPU, adapt on the GPU, but keep Ref's
    cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), xs)
    gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x)
, xs)

    cpu_out = f(cpu_in...)
    gpu_out = f(gpu_in...)

    test_result(cpu_out, gpu_out; kwargs...)
end

and here are the relevant test_result functions:

function test_result(a::AbstractArray{T}, b::AbstractArray{T};
                     kwargs...) where {T<:NTuple{N,<:Number} where {N}}
    ET = eltype(T)
    ≈(reinterpret(ET, collect(a)), reinterpret(ET, collect(b)); kwargs...)
end
function test_result(as::NTuple{N,Any}, bs::NTuple{N,Any}; kwargs...) where {N}
    all(zip(as, bs)) do (a, b)
        test_result(a, b; kwargs...)
    end
end

So I took the bodies from these and just pasted them into the test:

        @testset "cov" begin
            s = 100
            @test compare(cov, AT, rand(ET, s))
            @test compare(cov, AT, rand(ET, s, 2))

            # copy of `compare` contents
            f = A->cov(A; dims=2)
            rand_array = rand(ET, s, 2)
            cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), rand_array)
            gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x), rand_array)
            cpu_out = f(cpu_in)
            gpu_out = f(gpu_in)
            
            # test_result prints `true` here
            println("cov:", '\t', ET, '\t', AT, '\t', cpu_out == gpu_out, '\t',
                    test_result(cpu_out, gpu_out))
            @test compare(A->cov(A; dims=2), AT, rand(ET, s, 2))
            if ET <: Real
                @test compare(cov, AT, rand(ET(1):ET(100), s))
            end

This returns true.

So now I don't understand why the compare(...) function isn't passing, but the same lines are when just put into the function. Also interesting note: the tests all pass when the dummy array size is smaller (like 10).

Also: I am just testing this by throwing it against CI because I couldn't find an Intel GPU...

@leios
Copy link
Contributor Author

leios commented Jun 6, 2024

oneAPI passes the tests now. I don't know why. The best I can figure is that we are somehow no longer triggering JuliaGPU/oneAPI.jl#442 (which seems to be hardware dependent)

@leios
Copy link
Contributor Author

leios commented Jun 11, 2024

check

Quick note for performance regressions. I ran all the tests on this branch (blue) and plotted them here against master (orange / red). In general, they are the same speed. I also ran the cases where master was faster separately and found that these tests are still generally the same speed.

I can look into this in more detail by automating the process, but I think it might be better to come up with a specific test case where KA is almost certainly slower than the current GPUArrays DSL. It would be a good idea to list all the reasons why KA could be slow in an issue or something so we can tackle them.

@leios leios force-pushed the yoyoyo_rebase_time branch 2 times, most recently from dcd7468 to 38f4302 Compare June 27, 2024 12:37
src/host/broadcast.jl Outdated Show resolved Hide resolved
@leios leios force-pushed the yoyoyo_rebase_time branch from 38f4302 to c370523 Compare June 27, 2024 16:31
@leios leios marked this pull request as ready for review June 27, 2024 18:11
@leios
Copy link
Contributor Author

leios commented Jun 27, 2024

Ok, it seems like this branch works and is ready for review. There is some overall cleanup left, but I'll do that after.

@maleadt
Copy link
Member

maleadt commented Jul 22, 2024

Oh, I didn't realize this was ready for review. We should get this merged!

About the CUDA.jl branch, leios/CUDA.jl@b085472, what is the reason this requires a separate CuArrayBackend?

@leios
Copy link
Contributor Author

leios commented Jul 22, 2024

You are right. All the XArrayBackends can be removed. Let me mess around and test locally.

The main thing that stalled the PR is that I couldn't figure out the CI on the CUDA side and got swamped with other things.

@maleadt
Copy link
Member

maleadt commented Jul 22, 2024

Now that we're past JuliaCon I should have the time to help out, so feel free to just list issues here.

In parallel, I'll be looking at packaging POCL so that we can hopefully move forwards on an improved CPU back-end for KA.jl too.

@leios
Copy link
Contributor Author

leios commented Jul 22, 2024

I was literally just about to create an issue in KA about that. I'll go ahead and rebase everything up for this (these) PRs

@leios leios force-pushed the yoyoyo_rebase_time branch from c370523 to c1f6283 Compare July 22, 2024 09:19
@leios
Copy link
Contributor Author

leios commented Jul 22, 2024

Just rebased up (also had to revert the enzyme stuff). All tests pass locally on AMDGPU. Could we rerun the CI to make sure the errors are consistent on each backend's master / main?

@leios leios force-pushed the yoyoyo_rebase_time branch from c1f6283 to 93e212b Compare July 22, 2024 12:16
@leios
Copy link
Contributor Author

leios commented Jul 22, 2024

So the main problem is with launch_heuristic and launch_configuration as well as KernelAbstractions.launch_config. These are kinda conflicting and I'm not actually sure how to write the launch_heuristic function in CUDA/gpuarrays.jl.

This one doesn't work because we need to read in an ndrange when doing the config.

@inline function GPUArrays.launch_heuristic(::CUDABackend, f::F, args::Vararg{Any,N};
                                            elements::Int, elements_per_thread::Int) where {F,N}

    obj = f(CUDABackend())
    ndrange, workgroupsize, iterspace, dynamic = KA.launch_config(obj, nothing,
                                                                  nothing)

    # this might not be the final context, since we may tune the workgroupsize
    ctx = KA.mkcontext(obj, ndrange, iterspace)
    kernel = @cuda launch=false obj.f(ctx, args...)

    # launching many large blocks) lowers performance, as observed with broadcast, so cap
    # the block size if we don't have a grid-stride kernel (which would keep the grid small)
    if elements_per_thread > 1
        launch_configuration(kernel.fun)
    else
        launch_configuration(kernel.fun; max_threads=256)
    end
end

The tests passed earlier because we weren't calling the right launch_heuristic from CUDA.

@leios leios force-pushed the yoyoyo_rebase_time branch 2 times, most recently from 7936cf7 to 3560049 Compare July 23, 2024 09:05
@maleadt
Copy link
Member

maleadt commented Jul 25, 2024

Something is really wrong with Metal.jl's map! on this PR:

julia> map!(-, Metal.zeros(Float32, 1), Metal.ones(Float32, 1))
1-element MtlVector{Float32, Private}:
 -1.0

julia> map!(-, Metal.zeros(Float32, 2), Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
 0.0
 0.0

julia> Metal.zeros(Float32, 2) .= .-(Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
 -1.0
 -1.0

It doesn't seem launch configuration related because I can see 2 threads being launched here, as expected.

@leios
Copy link
Contributor Author

leios commented Jul 25, 2024

Map goes through launch_heuristic on Metal's side. I might have really messed up the logic somehow? AMDGPU doesn't have launch_heuristic and instead falls back to the default one defined here (in GPUArrays). It seems to be fine:

julia> map!(-, AMDGPU.zeros(Float32, 1), AMDGPU.ones(Float32, 1))
1-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0

julia> map!(-, AMDGPU.zeros(Float32, 2), AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0
 -1.0

julia> AMDGPU.zeros(Float32, 2) .= .-(AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0
 -1.0

I might be tired, but those look right to me. I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? JuliaGPU/Metal.jl#328

@maleadt
Copy link
Member

maleadt commented Jul 25, 2024

I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? JuliaGPU/Metal.jl#328

That doesn't help, as the launch configuration seems correct: config = (threads = 2, blocks = 1, elements_per_thread = 1), so the launch heuristic isn't to blame (and to be sure, I made it fall back to the GPUArrays definition, which didn't change a thing).

@maleadt
Copy link
Member

maleadt commented Jul 25, 2024

This looks like a miscompilation in Metal.jl. I'll investigate. For a workaround: removing the for i in 1:nelem seems to help, so maybe we should get rid of the grid-stride processing here already. elements_per_thread is going to be always 1 anyway, and it would simplify the "scary" indexing in the kernel ((J-1)*nelem + j). It would probably be best to add grid-stride indexing support to KA.jl before revisiting this in the future.

@maleadt
Copy link
Member

maleadt commented Jul 25, 2024

Actually, found another workaround: add @inbounds to the CartesianIndices constructor.

@leios
Copy link
Contributor Author

leios commented Jul 25, 2024

Just to be clear, there are 2 options:

  1. @inbounds J_c = CartesianIndices(axes(bc))[(J-1)*nelem + j]
  2. Remove the launch_heuristic / elements_per_thread approach.

You are in favor of 2 with the hopes of tackling it in KA down the road?

@maleadt
Copy link
Member

maleadt commented Jul 25, 2024

Yeah. As I mentioned on Slack, I haven't noticed much performance improvements from grid stride loops here, and seeing how it uglifies the indexing I'd prefer we just get rid of it right now until KA.jl properly supports them.

In the case we need to do manual CartesianIndex construction though, it should be annotated @inbounds, not to avoid the Metal.jl compiler bug but to avoid the unnecessary trap branch.

@leios
Copy link
Contributor Author

leios commented Jul 26, 2024

Running out of time to keep debugging this today, so I'll just write everything down.

After removing the launch_heuristic and elements_per_thread approach, I am getting errors with map for certain array sizes and also in minimum!, maximum! and extrema! for different array sizes. The issue with broadcast is weird:

BoundsError: attempt to access Base.Broadcast.Broadcasted{JLArrays.JLArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{JLArrays.JLDeviceArray{ComplexF64, 2}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}} at index [1]

So it's failing to access a Broadcasted of size (2,2) at index 1. Note that the same lines in the REPL do work:

map!(+, AMDGPU.zeros(Float32, 10,10), AMDGPU.ones(Float32, 10,1))

So the errors only appear when running ] test for broadcasting.

The other issue is with minimum!, maximum! and extrema! for different array sizes. Again, I am struggling to exactly pinpoint this issue, but it happens when sz = (10, 10, 10) and red = (1, 10, 10). It doesn't break on other array sizes. To be clear sz = (10, 10, 10) and red = (1, 1, 1) seems to work.

Long story short. There's something wrong with map. I am sure it's just a stupid typo that I can't quite pick out right now:

function Base.map!(f, dest::AnyGPUArray, xs::AbstractArray...)
    # custom broadcast, ignoring the container size mismatches
    # (avoids the reshape + view that our mapreduce impl has to do)
    indices = LinearIndices.((dest, xs...))
    common_length = minimum(length.(indices))
    common_length==0 && return

    bc = Broadcast.instantiate(Broadcast.broadcasted(f, xs...))
    if bc isa Broadcast.Broadcasted
        bc = Broadcast.preprocess(dest, bc)
    end

    # grid-stride kernel
    @kernel function map_kernel(dest, bc)
        j = @index(Global, Linear)
        @inbounds dest[j] = bc[j]
    end

    kernel = map_kernel(get_backend(dest))
    config = KernelAbstractions.launch_config(kernel, common_length, nothing)
    kernel(dest, bc; ndrange = config[1], workgroupsize = config[2])

    if eltype(dest) <: BrokenBroadcast
        throw(ArgumentError("Map operation resulting in $(eltype(eltype(dest))) is not GPU compatible"))
    end

    return dest
end

@maleadt
Copy link
Member

maleadt commented Aug 29, 2024

FYI, although I haven't had the time to look into this further, the Metal miscompilation has been fixed.

@leios
Copy link
Contributor Author

leios commented Aug 29, 2024

Oh, great! Tbh, I had to put this on hold for August because our daycare is closed this month and I'm juggling childcare duties. I should be able to pick it back up in September

@leios leios force-pushed the yoyoyo_rebase_time branch from 6aae0a8 to 56fbd8f Compare September 16, 2024 12:15
@leios leios force-pushed the yoyoyo_rebase_time branch from c7b99a6 to c2bd9f4 Compare September 16, 2024 12:56
@leios
Copy link
Contributor Author

leios commented Sep 16, 2024

I couldn't quite get 00c8dd4 to work, so I reverted it to check to see if Metal would build.

It seems like now CUDA and AMDGPU (locally) both pass, but I'm not sure what's going on with Metal and oneAPI

@maleadt
Copy link
Member

maleadt commented Sep 19, 2024

The launch_heuristic/launch_configuration is not something that can stay, though. I have some time tomorrow and next week, so I can have a look if you want.

@maleadt
Copy link
Member

maleadt commented Sep 20, 2024

#559 is looking good (all CI green), so let's close this branch on favor of it. Huge thanks to @leios for doing the grunt work!

@maleadt maleadt closed this Sep 20, 2024
@vchuravy
Copy link
Member

Congratulations @leios !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPUArrays seeding does not generate reproducible values past a certain array size
3 participants