Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

Open
jlowe opened this issue Oct 3, 2023 · 4 comments
Open

[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

jlowe opened this issue Oct 3, 2023 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@jlowe
Copy link
Member

jlowe commented Oct 3, 2023

Drivers that are CUDA 12.1+ compatible provide the ability to programmatically control GPU core dumps, see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__COREDUMP.html#group__CUDA__COREDUMP. This can remove the limitations encountered with #9238 where we cannot always programmatically control the environment variables of an executor process.

We should add native bindings to use these APIs and the ability to safely detect when they are available.

@jlowe jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 3, 2023
@mattahrens
Copy link
Collaborator

As part of this, we can start compiling on CUDA 12.2 (R535).

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Oct 10, 2023

Hi @jlowe, just to double confirm, even after we build JNI on cuda 12.2, for this core dump feature, if it's only supported from Drivers that are CUDA 12.1+, your code will automatically detect the driver version like 12.0.x then disable the feature. Make sure no failure on the old Driver with CUDA 12.0.x. Am I right?

@jlowe
Copy link
Member Author

jlowe commented Oct 10, 2023

Make sure no failure on the old Driver with CUDA 12.0.x. Am I right?

Yes. We will need a test pipeline against a CUDA 12.0 driver to help verify there are no regressions there.

@pxLi
Copy link
Collaborator

pxLi commented Oct 11, 2023

Make sure no failure on the old Driver with CUDA 12.0.x. Am I right?

Yes. We will need a test pipeline against a CUDA 12.0 driver to help verify there are no regressions there.

Thanks for the clarification!
Can we also add some flags for CI to mark those cases that should be verified in older drivers? This would help save much resources and time by enabling tests only w/ specific labels and no need to run others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants