-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable building kineto, add CUPTI dep #305
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe/meta.yaml:
This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12689415146. Examine the logs at this URL for more detail. |
Hmm, shouldn't my pull request start CI jobs now, or did I misunderstand the purpose of requesting open-gpu-server access? |
it definitely gets overloaded.... |
Could you launch the CI for me? |
@mgorny - just launched CI. We should skip most builds if we're still testing to save resources, but I am assuming builds are likely to pass? |
Kinda — one of the the three packages has been causing issues in #298, so I'm trying to find out which one was it. Feel free to cancel osx and aarch64 builds, it happened on linux-64. |
In fact, it only happened for generic blas + CUDA, which is IMO the only job that we'd need to be running here. |
There's a second layer of access control in https://github.com/conda-forge/.cirun/blob/master/.access.yml, but you should be pulled in through the reference to the |
Except that this time mkl + cpu failed/crashed :-/. Though I'm not sure if it's really kineto-related or a fluke. |
Well, okay, then an existing fluke. Unfortunately but irrelevant. Let's try the next package. |
Huh? we hadn't seen if CI passes with cupti yet? The single test failure on a non-CUDA job was irrelevant. |
Ah, sorry. I've gotten confused by the jobs being cancelled. |
This reverts commit 193d481.
That was me manually pruning the list so that we only run the single job that's relevant. |
OK, perhaps I spoke too soon - I hadn't considered the possibility that
might have something to do with kineto being enabled. If the CUDA job ends up passing, we can rerun the MKL+CPU job again to see if it was a fluke or if it reproduces. |
Oh my, it is kineto after all! I'm going to do the other two deps, triton dep (#166) and ccache instructions in a separate pull request. |
Part of that is already in the windows PR. Could you review there? |
Oh, sorry, didn't check mail in time. Will do. |
Ok, so I've been able to reproduce the test failures on a GPU-enabled host, and they were failing due to CUDA running out of GPU memory. Besides that, I'm seeing a lot of:
Which leads me to the following:
|
Great that you managed to debug this!
Sounds reasonable to me
Perhaps |
Hmm, after some more testing: it seems that perhaps it could be circumvented by either skipping We could also consider using |
It seems that I was overly optimistic after all, and even non-large tests eventually start running out of memory. Trying with lower |
The plot thickens: the tests were apparently passing only because of high
Should I skip them as flaky? |
Yes, please skip them with a comment. Doesn't sound flaky to me though, more likely a test that has a bug in the sense that it implicitly depends on high parallelism. |
Okay, this time I've tested all 4 |
due to the extreme runtime in emulation, and the almost non-existent variance between python versions, this is a better trade-off than testing nothing, or being stuck in emulation for hours.
Another failure (possibly flaky)
|
Sigh, now we hit what looks like a random crash. |
There's one failure on the re-enabled aarch+CUDA tests
Given that the test is named The profiling test that crashed also has some specific assumptions that seem very tight, so I'll skip that one too for now. |
…onda-forge-pinning 2025.01.09.11.45.06
Dammit, some more pointless failures:
|
Wow, the CUDA MKL build looks even worse
|
And it's not just accuracy problems:
|
This is weird, because the tests passed for me. I could imagine precision problems on CUDA testing because I don't have a nvidia GPU on my machine, but non-CUDA failures are really weird. |
New failures were CUDA-only, so that part at least is in line with your testing |
Well, that mkl failure looks to happen on CPU. Lemme try again locally on a GPU-enabled host. |
Are you referring to the test class / name (i.e.
is green, whereas
is red. |
This bit:
It explicitly says it's running on CPU. |
And yeah, can't reproduce on the GPU-enabled host either. |
In my years of building and using pytorch/tensorflow and related scientific software, if the host or the runner has any kind of memory pressure the tests can start to fail. Memory allocation failures are real, and memory not being initialized correctly can manifest itself as real bugs. I would:
|
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)Fixes #76
Let's try enabling new dependencies separately, to see which one caused CI problems.