Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CUDA 12.2 #1320

Merged
merged 13 commits into from
Feb 10, 2024
Merged

Support CUDA 12.2 #1320

merged 13 commits into from
Feb 10, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Jan 11, 2024

Description

Notes for Reviewers

This is part of ongoing work to build and test packages against CUDA 12.2.2 across all of RAPIDS.

For more details see:

Planning a second round of PRs to revert these references back to a proper branch-24.{nn} release branch of shared-workflows once rapidsai/shared-workflows#166 is merged.

(created with rapids-reviser)

@github-actions github-actions bot added the conda Related to conda and conda configuration label Jan 11, 2024
@jameslamb jameslamb changed the title add CUDA 12.2 support for conda packages and wheels WIP: add CUDA 12.2 support for conda packages and wheels Jan 11, 2024
@jameslamb jameslamb changed the title WIP: add CUDA 12.2 support for conda packages and wheels WIP: (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels Jan 11, 2024
@jameslamb jameslamb changed the title WIP: (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels Jan 11, 2024
@jameslamb jameslamb marked this pull request as ready for review January 11, 2024 22:51
@jameslamb jameslamb requested a review from a team as a code owner January 11, 2024 22:51
@jakirkham jakirkham added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 5 - DO NOT MERGE Hold off on merging; see PR for details labels Jan 13, 2024
@jameslamb jameslamb changed the base branch from branch-24.02 to branch-24.04 January 22, 2024 15:39
rapids-bot bot pushed a commit that referenced this pull request Jan 25, 2024
…endencies.yaml (#1329)

Contributes to rapidsai/build-planning#13.

Updates `update-version.sh` to correctly handle RAPIDS dependencies like `cudf-cu12==24.2.*`.

This also pulls in some dependency refactoring originally added in #1320, which allows greater use of dependencies.yaml globs (and therefore less maintenance effort to support new CUDA versions).

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1329
@jameslamb jameslamb changed the title (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels Support CUDA 12.2 Jan 25, 2024
@jameslamb jameslamb requested a review from jakirkham January 26, 2024 14:46
@jameslamb
Copy link
Member Author

After merging in latest branch-24.04, one build (conda-python-tests (12.0.1, ubuntu22.04, arm64, 3.10, a100)) is failing like this:

self = <numba.cuda.cudadrv.driver.CtypesLinker object at 0xfffe56f4c790>
ptx = b'//\n// Generated by NVIDIA NVVM Compiler\n//\n// Compiler Build ID: CL-33567101\n// Cuda compilation tools, release ...t%rd84, %rd83, %rd82;\n\tld.global.u64 \t%rd85, [%rd74];\n\tst.global.u8 \t[%rd84], %rd85;\n\n$L__BB0_27:\n\tret;\n\n}'
name = '<cudapy-ptx>'

    def add_ptx(self, ptx, name='<cudapy-ptx>'):
        ptxbuf = c_char_p(ptx)
        namebuf = c_char_p(name.encode('utf8'))
        self._keep_alive += [ptxbuf, namebuf]
        try:
            driver.cuLinkAddData(self.handle, enums.CU_JIT_INPUT_PTX,
                                 ptxbuf, len(ptx), namebuf, 0, None, None)
        except CudaAPIError as e:
>           raise LinkerError("%s\n%s" % (e, self.error_log))
E           numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
E           ptxas application ptx input, line 9; fatal   : Unsupported .version 8.3; current version is '8.2'

/opt/conda/envs/test/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py:2830: LinkerError
------------------------------ Captured log call -------------------------------
ERROR    numba.cuda.cudadrv.driver:driver.py:396 Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR

(build link)

I don't see a similar error in other recent runs of the pr workflow here: https://github.com/rapidsai/cuspatial/actions/workflows/pr.yaml

I found numba/numba#8961 (comment) which suggests that maybe this error is about a mismatch between the driver version and the PTX numba is using. That corresponds with this warning I see in the build:

/opt/conda/envs/test/lib/python3.10/site-packages/cudf/utils/_numba.py:18: UserWarning: CUDA Toolkit is newer than CUDA driver. Numba features will not work in this configuration.

nvidia-smi output from that failing job shows that v535 of the driver is being used.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:01:00.0 Off |                    0 |
| N/A   31C    P0              32W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

I see cuda-version=12.0 being installed.

+ cuda-version                       12.0  hffde075_2                conda-forge       21kB

.... but tons of other CTK 12.3 stuff

...
+ cuda-cccl_linux-aarch64              12.3.101  h579c4fd_0                          conda-forge
+ cuda-cudart-static_linux-aarch64     12.3.101  hac28a21_0                          conda-forge
+ cuda-cudart_linux-aarch64            12.3.101  hac28a21_0                          conda-forge
+ cuda-nvvm-dev_linux-aarch64          12.3.107  h579c4fd_0                          conda-forge
+ cuda-crt-dev_linux-aarch64           12.3.107  h579c4fd_0                          conda-forge
...
+ cuda-cudart-dev_linux-aarch64        12.3.101  hac28a21_0                          conda-forge
+ cuda-nvcc-dev_linux-aarch64          12.3.107  h579c4fd_0                          conda-forge
...
+ libnvjitlink                         12.3.101  hac28a21_0                          conda-forge
...
+ cuda-nvrtc                           12.3.107  hac28a21_0                          conda-forge
+ cuda-nvvm-tools                      12.3.107  hac28a21_0                          conda-forge
+ cuda-crt-tools                       12.3.107  h579c4fd_0                          conda-forge
+ cuda-nvvm-impl                       12.3.107  hac28a21_0                          conda-forge
+ nvcomp                                  3.0.5  hed029d7_0                          conda-forge
+ cuda-cudart-static                   12.3.101  hac28a21_0                          conda-forge
+ cuda-cudart                          12.3.101  hac28a21_0                          conda-forge
...
+ libcusparse                        12.2.0.103  hac28a21_0                          conda-forge
+ libcublas                            12.3.4.1  hac28a21_0                          conda-forge
+ cuda-nvcc-tools                      12.3.107  hac28a21_0                          conda-forge
+ cuda-cudart-dev                      12.3.101  hac28a21_0                          conda-forge
+ ucx                                    1.15.0  h0461c73_3                          conda-forge
+ cuda-python                            12.2.1  py310hbbb1677_0                     conda-forge
+ numba                                  0.57.1  py310h16a1930_0                     conda-forge
...
+ libcusolver                        11.5.4.101  hac28a21_0                          conda-forge
+ cuda-nvcc-impl                       12.3.107  hac28a21_0                          conda-forge
...
+ libkvikio                           24.04.00a  cuda12_240126_g4bd5378_0            rapidsai-nightly
+ librmm                             24.04.00a5  cuda12_240126_g19d8ef96_5           rapidsai-nightly
+ rmm                                24.04.00a5  cuda12_py310_240126_g19d8ef96_5     rapidsai-nightly
+ libcudf                           24.04.00a12  cuda12_240119_ge939ee19e8_12        rapidsai-nightly
+ cudf                              24.04.00a12  cuda12_py310_240119_ge939ee19e8_12  rapidsai-nightly
+ cuproj                            24.04.00a12  cuda12_py310_240126_gab9c4de8_12    /tmp/python_channel
+ libcuspatial                      24.04.00a12  cuda12_240126_gab9c4de8_12          /tmp/cpp_channel
+ cuspatial                         24.04.00a12  cuda12_py310_240126_gab9c4de8_12    /tmp/python_channel

Interestingly, it looks like the CUDA 12.2 build is correctly getting CTK 12.2 stuff.

(build link)

My first guess is that this is about a missed condition in dependencies.yaml. The 12.0 job is the only arm64 one in the test matrix. I'll investigate that.

@jameslamb
Copy link
Member Author

Ah! It looks like installing the package, cuda-version is getting upgraded!

- cuda-version                             12.0  hffde075_2                          conda-forge             Cached
+ cuda-version                             12.3  h32bc705_2                          conda-forge               21k

(build link)

I suspect this might be exactly the same issue as we observed over in cucim:

@jakirkham
Copy link
Member

Toggling for CI

@jakirkham jakirkham closed this Jan 29, 2024
@jakirkham jakirkham reopened this Jan 29, 2024
@jakirkham
Copy link
Member

Still seeing the same issue James described above. Here's the relevant snippet from CI

  - cuda-version                             12.0  hffde075_2                          conda-forge             Cached
  + cuda-version                             12.3  h32bc705_2                          conda-forge               21kB

@bdice
Copy link
Contributor

bdice commented Jan 30, 2024

Still seeing the same issue James described above. Here's the relevant snippet from CI

  - cuda-version                             12.0  hffde075_2                          conda-forge             Cached
  + cuda-version                             12.3  h32bc705_2                          conda-forge               21kB

This should be resolved by the fixes discussed in rapidsai/build-planning#8 (comment).

@jakirkham
Copy link
Member

Indeed this is looking better. Thanks Bradley! 🙏

Next up is the Pandas 2 upgrade: #1338

@bdice bdice removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Feb 10, 2024
@bdice
Copy link
Contributor

bdice commented Feb 10, 2024

/merge

@rapids-bot rapids-bot bot merged commit eeab2a5 into rapidsai:branch-24.04 Feb 10, 2024
61 checks passed
rapids-bot bot pushed a commit that referenced this pull request Feb 20, 2024
Follow-up to #1320

For all GitHub Actions configs, replaces uses of the `test-cuda-12.2` branch on `shared-workflows`
with `branch-24.04`, now that rapidsai/shared-workflows#166 has been merged.

### Notes for Reviewers

This is part of ongoing work to build and test packages against CUDA 12.2 across all of RAPIDS.

For more details see:

* rapidsai/build-planning#7

*(created with `rapids-reviser`)*

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: #1343
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conda Related to conda and conda configuration improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants