[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

jlowe · 2023-10-03T15:27:40Z

Drivers that are CUDA 12.1+ compatible provide the ability to programmatically control GPU core dumps, see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__COREDUMP.html#group__CUDA__COREDUMP. This can remove the limitations encountered with #9238 where we cannot always programmatically control the environment variables of an executor process.

We should add native bindings to use these APIs and the ability to safely detect when they are available.

mattahrens · 2023-10-03T21:55:48Z

As part of this, we can start compiling on CUDA 12.2 (R535).

GaryShen2008 · 2023-10-10T03:21:57Z

Hi @jlowe, just to double confirm, even after we build JNI on cuda 12.2, for this core dump feature, if it's only supported from Drivers that are CUDA 12.1+, your code will automatically detect the driver version like 12.0.x then disable the feature. Make sure no failure on the old Driver with CUDA 12.0.x. Am I right?

jlowe · 2023-10-10T13:42:08Z

Make sure no failure on the old Driver with CUDA 12.0.x. Am I right?

Yes. We will need a test pipeline against a CUDA 12.0 driver to help verify there are no regressions there.

pxLi · 2023-10-11T01:00:33Z

Make sure no failure on the old Driver with CUDA 12.0.x. Am I right?

Yes. We will need a test pipeline against a CUDA 12.0 driver to help verify there are no regressions there.

Thanks for the clarification!
Can we also add some flags for CI to mark those cases that should be verified in older drivers? This would help save much resources and time by enabling tests only w/ specific labels and no need to run others

jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 3, 2023

mattahrens assigned jlowe Oct 3, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 3, 2023

GaryShen2008 mentioned this issue Oct 10, 2023

[BUILD] Build JNI and cuDF on cuda 12.2 (Driver R535) for cuda12 classifier NVIDIA/spark-rapids-jni#1484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

jlowe commented Oct 3, 2023

mattahrens commented Oct 3, 2023

GaryShen2008 commented Oct 10, 2023 •

edited

Loading

jlowe commented Oct 10, 2023

pxLi commented Oct 11, 2023 •

edited

Loading

[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

[FEA] Leverage CUDA 12.1 programmatic GPU core dump controls #9370

Comments

jlowe commented Oct 3, 2023

mattahrens commented Oct 3, 2023

GaryShen2008 commented Oct 10, 2023 • edited Loading

jlowe commented Oct 10, 2023

pxLi commented Oct 11, 2023 • edited Loading

GaryShen2008 commented Oct 10, 2023 •

edited

Loading

pxLi commented Oct 11, 2023 •

edited

Loading