-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KernelAbstractions version of GPUArrays #525
Conversation
d05e5cd
to
0800eee
Compare
This PR breaks the interface to CUDA, so the buildkite tests will fail unless I point to a specific PR (working on that now). How do I do that with buildkite? Also: if this is merged, we might want to create a new release |
Maybe temporarily change the Pkg invocation during CI to pick up the accompanying back-end PRs: GPUArrays.jl/.buildkite/pipeline.yml Line 13 in 4623226
|
Ok, can't quite figure out the oneAPI test failures. It is breaking here: https://github.com/leios/GPUArrays.jl/blob/yoyoyo_rebase_time/test/testsuite/statistics.jl#L66 (and at a similar line above). Here is the compare function:
and here are the relevant
So I took the bodies from these and just pasted them into the test:
This returns So now I don't understand why the Also: I am just testing this by throwing it against CI because I couldn't find an Intel GPU... |
cc5662a
to
1fd096b
Compare
oneAPI passes the tests now. I don't know why. The best I can figure is that we are somehow no longer triggering JuliaGPU/oneAPI.jl#442 (which seems to be hardware dependent) |
Quick note for performance regressions. I ran all the tests on this branch (blue) and plotted them here against master (orange / red). In general, they are the same speed. I also ran the cases where master was faster separately and found that these tests are still generally the same speed. I can look into this in more detail by automating the process, but I think it might be better to come up with a specific test case where KA is almost certainly slower than the current GPUArrays DSL. It would be a good idea to list all the reasons why KA could be slow in an issue or something so we can tackle them. |
dcd7468
to
38f4302
Compare
38f4302
to
c370523
Compare
Ok, it seems like this branch works and is ready for review. There is some overall cleanup left, but I'll do that after. |
Oh, I didn't realize this was ready for review. We should get this merged! About the CUDA.jl branch, leios/CUDA.jl@b085472, what is the reason this requires a separate CuArrayBackend? |
You are right. All the The main thing that stalled the PR is that I couldn't figure out the CI on the CUDA side and got swamped with other things. |
Now that we're past JuliaCon I should have the time to help out, so feel free to just list issues here. In parallel, I'll be looking at packaging POCL so that we can hopefully move forwards on an improved CPU back-end for KA.jl too. |
I was literally just about to create an issue in KA about that. I'll go ahead and rebase everything up for this (these) PRs |
c370523
to
c1f6283
Compare
Just rebased up (also had to revert the enzyme stuff). All tests pass locally on AMDGPU. Could we rerun the CI to make sure the errors are consistent on each backend's master / main? |
c1f6283
to
93e212b
Compare
So the main problem is with This one doesn't work because we need to read in an ndrange when doing the config.
The tests passed earlier because we weren't calling the right |
7936cf7
to
3560049
Compare
Something is really wrong with Metal.jl's
It doesn't seem launch configuration related because I can see 2 threads being launched here, as expected. |
Map goes through
I might be tired, but those look right to me. I just pushed a commit to Metal that removes |
That doesn't help, as the launch configuration seems correct: |
This looks like a miscompilation in Metal.jl. I'll investigate. For a workaround: removing the |
Actually, found another workaround: add |
Just to be clear, there are 2 options:
You are in favor of 2 with the hopes of tackling it in KA down the road? |
Yeah. As I mentioned on Slack, I haven't noticed much performance improvements from grid stride loops here, and seeing how it uglifies the indexing I'd prefer we just get rid of it right now until KA.jl properly supports them. In the case we need to do manual |
Running out of time to keep debugging this today, so I'll just write everything down. After removing the
So it's failing to access a
So the errors only appear when running The other issue is with Long story short. There's something wrong with map. I am sure it's just a stupid typo that I can't quite pick out right now:
|
FYI, although I haven't had the time to look into this further, the Metal miscompilation has been fixed. |
Oh, great! Tbh, I had to put this on hold for August because our daycare is closed this month and I'm juggling childcare duties. I should be able to pick it back up in September |
6aae0a8
to
56fbd8f
Compare
c7b99a6
to
c2bd9f4
Compare
I couldn't quite get 00c8dd4 to work, so I reverted it to check to see if Metal would build. It seems like now CUDA and AMDGPU (locally) both pass, but I'm not sure what's going on with Metal and oneAPI |
The |
Congratulations @leios ! |
Right, tests are passing locally, but still a bunch of small things to do:
This PR supersedes #451 and should be ready for review next week. Just giving everyone a sneak peek now.
Also seems to fix #530