Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

alexbarghi-nv · 2024-10-03T14:28:13Z

We were pulling the wrong packages because the PyTorch version constraint wasn't tight enough. Hopefully these sorts of issues will be resolved in the cugraph-gnn repository going forward, where we can pin a specific pytorch version for testing.

ci/test_python.sh

raydouglass

Looks good to me

…x-pytorch

jameslamb · 2024-10-03T21:19:40Z

(Summarizing some offline conversations, to get this into the public record here on GitHub)

For the last few days (unsure how long), CI jobs here targeting branch-24.10 have been silently getting 24.12 nightly packages. This PR fixes that, and that's exposing a dependency conflict for cugraph-pyg.

conda cannot install cugraph-pyg and pytorch>=2.3,<2.4 together, because there are not any pyg packages that support pytorch>=2.3.

full conda solve error trace (click me)

Looking for: ['cugraph-pyg=24.10', "pytorch[version='>=2.3,<2.4']", 'ogb']
Pinned packages:
  - python 3.10.*
Could not solve for environment specs
The following packages are incompatible
├─ cugraph-pyg 24.10**  is installable with the potential options
│  ├─ cugraph-pyg [24.10.00a84|24.10.00a85|24.10.00a86|24.10.00a94] would require
│  │  └─ pyg >=2.5,<2.6  with the potential options
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 1.12.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1], which can be installed;
│  │     │     ├─ pytorch [1.12.0|1.12.1|1.13.0|1.13.1] would require
│  │     │     │  └─ python >=3.7,<3.8.0a0 , which can be installed;
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1] would require
│  │     │     │  └─ python >=3.8,<3.9.0a0 , which can be installed;
│  │     │     └─ pytorch [1.12.0|1.12.1|...|2.3.1] would require
│  │     │        └─ python >=3.9,<3.10.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 1.13.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|1.13.0|1.13.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1], which can be installed;
│  │     │     └─ pytorch [1.13.0|1.13.1|...|2.3.1] would require
│  │     │        └─ python >=3.11,<3.12.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 2.0.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     └─ pytorch [2.0.0|2.0.1], which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 2.1.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [2.1.0|2.1.1|2.1.2], which can be installed;
│  │     │     └─ pytorch [2.1.0|2.1.2|...|2.3.1] would require
│  │     │        └─ python >=3.12,<3.13.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 2.2.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained);
│  │     │     └─ pytorch [2.2.0|2.2.1|2.2.2], which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ python >=3.8,<3.9.0a0 , which can be installed;
│  │     └─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │        └─ python >=3.9,<3.10.0a0 , which can be installed;
│  ├─ cugraph-pyg 24.10.00a0 would require
│  │  └─ cugraph 24.10.00a0.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a17 would require
│  │  └─ cugraph 24.10.00a17.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a19 would require
│  │  └─ cugraph 24.10.00a19.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a22 would require
│  │  └─ cugraph 24.10.00a22.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a28 would require
│  │  └─ cugraph 24.10.00a28.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a38 would require
│  │  └─ cugraph 24.10.00a38.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a39 would require
│  │  └─ cugraph 24.10.00a39.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a40 would require
│  │  └─ cugraph 24.10.00a40.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a44 would require
│  │  └─ cugraph 24.10.00a44.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a45 would require
│  │  └─ cugraph 24.10.00a45.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a48 would require
│  │  └─ cugraph 24.10.00a48.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a49 would require
│  │  └─ cugraph 24.10.00a49.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a50 would require
│  │  └─ cugraph 24.10.00a50.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a52 would require
│  │  └─ cugraph 24.10.00a52.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a53 would require
│  │  └─ cugraph 24.10.00a53.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a54 would require
│  │  └─ cugraph 24.10.00a54.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a55 would require
│  │  └─ cugraph 24.10.00a55.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a56 would require
│  │  └─ cugraph 24.10.00a56.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a57 would require
│  │  └─ cugraph 24.10.00a57.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a58 would require
│  │  └─ cugraph 24.10.00a58.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a59 would require
│  │  └─ cugraph 24.10.00a59.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a60 would require
│  │  └─ cugraph 24.10.00a60.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a61 would require
│  │  └─ cugraph 24.10.00a61.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a63 would require
│  │  └─ cugraph 24.10.00a63.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a65 would require
│  │  └─ cugraph 24.10.00a65.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a66 would require
│  │  └─ cugraph 24.10.00a66.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a68 would require
│  │  └─ cugraph 24.10.00a68.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a69 would require
│  │  └─ cugraph 24.10.00a69.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a70 would require
│  │  └─ cugraph 24.10.00a70.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a71 would require
│  │  └─ cugraph 24.10.00a71.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a72 would require
│  │  └─ cugraph 24.10.00a72.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a73 would require
│  │  └─ cugraph 24.10.00a73.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a74 would require
│  │  └─ cugraph 24.10.00a74.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a75 would require
│  │  └─ cugraph 24.10.00a75.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a77 would require
│  │  └─ cugraph 24.10.00a77.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a78 would require
│  │  └─ cugraph 24.10.00a78.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a79 would require
│  │  └─ cugraph 24.10.00a79.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a80 would require
│  │  └─ cugraph 24.10.00a80.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a81 would require
│  │  └─ cugraph 24.10.00a81.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a82 would require
│  │  └─ cugraph 24.10.00a82.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a83 would require
│  │  └─ cugraph 24.10.00a83.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg [24.10.00a84|24.10.00a85|...|24.10.00a93] would require
│  │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  ├─ cugraph-pyg [24.10.00a84|24.10.00a85|...|24.10.00a93] would require
│  │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  └─ cugraph-pyg [24.10.00a87|24.10.00a88|24.10.00a89|24.10.00a91|24.10.00a93] would require
│     ├─ pyg >=2.5,<2.6 , which can be installed (as previously explained);
│     └─ pytorch >=2.3,<2.4.0a0  with the potential options
│        ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│        ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│        ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│        ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained);
│        └─ pytorch [2.3.0|2.3.1] conflicts with any installable versions previously reported;
├─ libtorch is installable with the potential options
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_generic_*_2, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_generic_*_3, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_mkl_*_102, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_mkl_*_103, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda112_*_302, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda112_*_303, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda118_*_302, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda118_*_303, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda120_*_302, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda120_*_303, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_4, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_1, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_3, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_101, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_103, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_104, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda112_*_300, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda112_*_301, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_301, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_303, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_304, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_301, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_303, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_300, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_304, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_generic_*_1, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_mkl_*_101, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda118_*_301, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda120_*_301, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda120_*_300, which can be installed;
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cuda120_*_300, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_generic_*_1, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_mkl_*_101, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cuda118_*_301, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cuda120_*_300, which can be installed;
│  └─ libtorch 2.4.0 would require
│     └─ pytorch 2.4.0 cuda120_*_301, which can be installed;
└─ pytorch >=2.3,<2.4  is installable with the potential options
   ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
   ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
   ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
   ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained);
   └─ pytorch [2.3.0|2.3.1] conflicts with any installable versions previously reported.
[rapids-conda-retry] conda returned exit code: 1
[rapids-conda-retry] Exiting, no retryable mamba errors detected: 'ChecksumMismatchError:', 'ChunkedEncodingError:', 'CondaHTTPError:', 'CondaMultiError:', 'Connection broken:', 'ConnectionError:', 'DependencyNeedsBuildingError:', 'EOFError:', 'JSONDecodeError:', 'Multi-download failed', 'Timeout was reached', segfault exit code 139
[rapids-conda-retry

how to reproduce this (click me)

docker run \
    --rm \
    --gpus 1 \
    --env CI=false \
    --env RAPIDS_BUILD_TYPE="pull-request" \
    --env RAPIDS_REPOSITORY="rapidsai/cugraph" \
    --env RAPIDS_REF_NAME=pull-request/4690 \
    --env RAPIDS_SHA=922571b6db5f721a287897b3c5acc81b3fe07f69 \
    -v $(pwd):/opt/work \
    -w /opt/work \
    --network host \
    -it rapidsai/ci-conda:cuda11.8.0-rockylinux8-py3.10 \
    bash

RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"

rapids-logger "Downloading artifacts from previous jobs"
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

rapids-logger "Generate Python testing dependencies"
rapids-dependency-file-generator \
  --output conda \
  --file-key test_python \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg

conda activate test_cugraph_pyg

CONDA_CUDA_VERSION="11.8"
PYG_URL="https://data.pyg.org/whl/torch-2.3.0+cu118.html"

rapids-mamba-retry install \
    --channel "${CPP_CHANNEL}" \
    --channel "${PYTHON_CHANNEL}" \
    --channel pyg \
    "cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
    "pytorch>=2.3,<2.4" \
    "ogb"

This only shows up in the conda-python-tests / 12.5.1, 3.12, amd64, ubuntu22.04, v100, latest-driver, latest-deps job, because that's the only one where cugraph-pyg installation is currently tested on PRs.

cugraph/ci/test_python.sh

Lines 187 to 189 in 5fad435

    
           if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then 
        
             if [[ "${RUNNER_ARCH}" != "ARM64" ]]; then 
        
               rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg

The PyTorch floor here was raised to pytorch>=2.3,<2.4 in #4615. Logs from that CI job on that PR show the issue:

cugraph                   24.12.00a16     cuda11_py310_240928_g59f70dd1b_16    rapidsai-nightly
cugraph-pyg               24.12.00a16     py310_240928_g59f70dd1b_16    rapidsai-nightly
...
pyg                       2.5.2           py310_torch_2.1.0_cpu    pyg
...
pytorch                   2.1.2           cuda118_py310h6f85f1b_304    conda-forge

(build link)

cugraph-pyg==24.12.* at that point still allow pytorch==2.1.* to be installed, which allowed conda to find a solution with pyg.

So what can we do?

Ideally, there would be pyg packages supporting pytorch>=2.3. It seemed like this PR from around months ago might have added that: pyg-team/pytorch_geometric#9240.

But there are not PyTorch 2.3 conda packages up at https://anaconda.org/pyg/pyg/files?page=3&version=2.5.2&sort=basename&sort_order=desc.

The options I can think of:

relax PyTorch dependency for cugraph-pyg back to pytorch>=2.2
delay cugraph-pyg=24.10 release until there are pyg packages supporting PyTorch 2.3
build pyg packages supporting PyTorch from source and host them on RAPIDS-controlled channels

jameslamb · 2024-10-03T22:02:41Z

update on #4690 (comment)

After offline discussion with @alexbarghi-nv @jakirkham @tingyu66 , we decided to replace uses of pyg::pyg conda packages with conda-forge::pytorch_geometric.

commit: f267c77

They're built from the same sources, and conda-forge::pytorch_geometric is a noarch package without an explicit PyTorch constraint.

jakirkham

Thanks James! 🙏

AIUI this matches what we discussed

Also grepped for any remaining pyg dependency lines to fix and didn't find any

Included one informational note below, but no action needed

Approving to unblock

ci/build_docs.sh

jameslamb · 2024-10-04T13:45:40Z

All of the build and test jobs are now passing, and spot-checking the logs it looks to me like they're using the correct, expected versions of dependencies 🎉

The docs-build is broken, like this:

Extension error (sphinx.ext.autosummary):
Handler <function process_generate_options at 0x7f0c6433e4d0> for event 'builder-inited' threw an exception (exception: no module named cugraph_dgl.convert)

(build link)

The most recent docs build (yesterday) did "succeed" .... but only by using 24.08 packages 😱

  + pytorch                                 2.1.2  cuda118_py310h6f85f1b_304          conda-forge            27MB
  ...
  + dgl                                     1.1.3  cuda112py310hdbdccad_2             conda-forge            44MB
  ...
  + pyg                                     2.5.2  py310_torch_2.1.0_cpu              pyg                     1MB
  ...
  + cugraph                              24.08.00  cuda11_py310_240808_gfc880db0c_0   rapidsai                2MB
  + cugraph-service-server               24.08.00  py310_240808_gfc880db0c_0          rapidsai               44kB
  + cugraph-pyg                          24.08.00  py310_240808_gfc880db0c_0          rapidsai              142kB
  + cugraph-dgl                          24.08.00  py310_0                            rapidsai              122kB

(build link)

It's showing up as a failure now because this PR prevents conda from using non-24.10 RAPIDS packages.

In my experience with sphinx, this type of "no module" error often means "there was an ImportError when trying to import that module", which can point to these other explanations:

a missing dependency
breaking change in some dependency's import paths
shared library loading error

There absolutely is a cugraph_dgl.convert module: https://github.com/rapidsai/cugraph/blob/branch-24.10/python/cugraph-dgl/cugraph_dgl/convert.py

I was able to reproduce this locally on an x86_64 machine with CUDA 12.2, and that revealed the real issue.

code to do that (click me)

docker run \
    --rm \
    --gpus 1 \
    --env CI=false \
    --env RAPIDS_BUILD_TYPE="pull-request" \
    --env RAPIDS_REPOSITORY="rapidsai/cugraph" \
    --env RAPIDS_REF_NAME=pull-request/4690 \
    --env RAPIDS_SHA=f267c771707d4007c6869b4a0a79feb3e0c27700 \
    -v $(pwd):/opt/work \
    -w /opt/work \
    --network host \
    -it rapidsai/ci-conda:cuda11.8.0-ubuntu22.04-py3.10 \
    bash

RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"

CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

rapids-dependency-file-generator \
  --output conda \
  --file-key docs \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --yes -f env.yaml -n docs
conda activate docs

if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then
  CONDA_CUDA_VERSION="11.8"
  DGL_CHANNEL="dglteam/label/cu118"
else
  CONDA_CUDA_VERSION="12.1"
  DGL_CHANNEL="dglteam/label/cu121"
fi

rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  --channel conda-forge \
  --channel nvidia \
  --channel "${DGL_CHANNEL}" \
  "libcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "pylibcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-dgl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-service-server=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-service-client=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "libcugraph_etl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "pylibcugraphops=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "pylibwholegraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  pytorch \
  "cuda-version=${CONDA_CUDA_VERSION}"

python -c "import cugraph_dgl.convert"

DGL backend not selected or invalid. Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase)
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph_dgl/init.py", line 18, in
from cugraph_dgl.graph import Graph
...
File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph/utilities/utils.py", line 410, in getattr
raise RuntimeError(f"This feature requires the {self.name} " "package/module")
RuntimeError: This feature requires the dgl package/module

Following that code shared above, that can reproduced without actually invoking sphinx-build:

python -c "import cugraph_dgl.convert"

Walking down the trace:

python -c "import dgl"

ModuleNotFoundError: No module named 'torchdata'

conda install -c conda-forge torchdata
python -c "import dgl"

ModuleNotFoundError: No module named 'pydantic'

conda install -c conda-forge pydantic
python -c "import dgl"

FileNotFoundError: Cannot find DGL C++ graphbolt library at /opt/conda/envs/docs/lib/python3.10/site-packages/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.post300.so

So what do we do?

I'm not sure.

Looks like dgl's dependency on torchdata was removed in August:

Those seem to have not made it in until v2.4.0 (https://github.com/dmlc/dgl/releases/tag/v2.4.0)

Here in cugraph's CI, we're getting v2.1.0

  + dgl                                2.1.0.cu118  py310_0                              dglteam/label/cu118      606MB

(build link)

I'm not sure how to fix this. The cu118 label for this package doesn't have packages newer than v2.1.0:

https://anaconda.org/dglteam/dgl/files?version=&channel=cu118

Maybe we want the th23_cu118 label instead, now that cugraph is using PyTorch 2.3?

https://anaconda.org/dglteam/dgl/files?version=2.4.0.th23.cu118

jameslamb · 2024-10-04T16:53:50Z

Summarizing recent commits:

dgl appears to have changed its versioning scheme for conda packages. The latest release of dgl (v2.4.0) has not been published to the dglteam channel under the main tag... they're now tags and version numbers that encode the supported PyTorch version and CUDA version.

Here in the 24.10 release of cugraph-dgl we want to support PyTorch 2.3 and CUDA 11.8, so I've switched cugraph-dgl to this runtime requirement:

dgl >= 2.4.0.th23.cu*

and requiring this label on the dglteam channel

--channel dglteam/label/th23_cu118

As @alexbarghi-nv pointed out to me, something similar is being done in cugraph-gnn already: rapidsai/cugraph-gnn#10

For wheels, I've updated the cugraph-dgl wheels' dependency on dgl (only enforced via a pip install in a script, not wheel metadata) from dgl==2.0.0 to dgl==2.2.1... the latest version that wheels have been published for.

jameslamb · 2024-10-07T00:55:00Z

I'm going to merge this. It has a lot of approvals, CI is all passing, and I spot-checked CI logs for builds and tests and saw all the things we're expecting... latest nightlies of cugraph, nx-cugraph, cudf, etc., PyTorch 2.3, and numpy 2.x.

Thanks for the help everyone!

jameslamb · 2024-10-07T00:55:04Z

/merge

jakirkham · 2024-10-07T03:23:09Z

Thanks James! 🙏

## Summary Follow-up to #4690. Proposes consolidating stuff like this in CI scripts: ```shell pip install A pip install B pip install C ``` Into this: ```shell pip install A B C ``` ## Benefits of these changes Reduces the risk of creating a broken environment with incompatible packages. Unlike `conda`, `pip` does not evaluate the requirements of all installed packages when you run `pip` install. Installing `torch` and `cugraph-dgl` at the same time, for example, gives us a chance to find out about packaging issues like *"`cugraph-dgl` and `torch` have conflicting requirements on `{other_package}`"* at CI time. Similar change from `cudf`: rapidsai/cudf#16575 Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #4701

Another steps towards completing the work started in #53 Fixes #15 Contributes to rapidsai/build-planning#111 Proposes changes to get CI running on pull requests for `cugraph-pyg` and `cugraph-dgl` ## Notes for Reviewers Workflows for nightly builds and publishing nightly packages are intentionally not included here. See #58 (comment) Notebook tests are intentionally not added here... they'll be added in the next PR. Pulls in changes from these other upstream PRs that had not been ported over to this repo: * rapidsai/cugraph#4690 * rapidsai/cugraph#4393 Authors: - James Lamb (https://github.com/jameslamb) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Bradley Dice (https://github.com/bdice) URL: #59

set correct pytorch range in ci

2ddb404

alexbarghi-nv requested a review from a team as a code owner October 3, 2024 14:28

alexbarghi-nv requested a review from raydouglass October 3, 2024 14:28

alexbarghi-nv self-assigned this Oct 3, 2024

alexbarghi-nv added bug Something isn't working non-breaking Non-breaking change ci labels Oct 3, 2024

alexbarghi-nv added this to the 24.10 milestone Oct 3, 2024

jameslamb reviewed Oct 3, 2024

View reviewed changes

ci/test_python.sh Outdated Show resolved Hide resolved

jameslamb and others added 2 commits October 3, 2024 09:30

Update ci/test_python.sh

d0d1ea1

add version constraints in conda installs

922571b

jameslamb mentioned this pull request Oct 3, 2024

Ensure cached packages installed in CI test phase rapidsai/build-planning#14

Open

jameslamb changed the title ~~Constrain the PyTorch Version in CI Runs~~ Constrain versions of PyTorch and CI artifacts in CI Runs Oct 3, 2024

raydouglass approved these changes Oct 3, 2024

View reviewed changes

Merge branch 'branch-24.10' of github.com:rapidsai/cugraph into ci-fi…

8364839

…x-pytorch

use conda-forge::torch_geometric instead of pyg::pyg

f267c77

jameslamb requested review from a team as code owners October 3, 2024 21:57

jameslamb requested a review from KyleFromNVIDIA October 3, 2024 21:57

github-actions bot added python conda labels Oct 3, 2024

jakirkham approved these changes Oct 3, 2024

View reviewed changes

ci/build_docs.sh Show resolved Hide resolved

BradReesWork approved these changes Oct 4, 2024

View reviewed changes

pin to dgl 2.4 for conda packages, 2.2.1 for wheels

2124570

jameslamb requested a review from a team as a code owner October 4, 2024 16:36

jameslamb added 2 commits October 4, 2024 09:36

th23 modifier

2ad8e87

use correct channel for conda builds

f880686

rlratzel approved these changes Oct 4, 2024

View reviewed changes

fix channel, use new location for dgl wheels

317b86b

jameslamb changed the title ~~Constrain versions of PyTorch and CI artifacts in CI Runs~~ Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 Oct 4, 2024

rapids-bot bot merged commit 3789b70 into rapidsai:branch-24.10 Oct 7, 2024
131 checks passed

jameslamb mentioned this pull request Oct 7, 2024

Updates docs to describe nx-cugraph based on latest updates for 24.10 #4694

Merged

jakirkham mentioned this pull request Oct 7, 2024

nx-cugraph: add NX_CUGRAPH_AUTOCONFIG=True env var to enable full zero-code change #4685

Merged

jameslamb mentioned this pull request Oct 7, 2024

[NO MRG] Test CI #4679

Closed

alexbarghi-nv deleted the ci-fix-pytorch branch October 7, 2024 15:09

This was referenced Oct 7, 2024

combine pip install calls in wheel-testing scripts #4701

Merged

Pin RAPIDS nightly version in conda installs rapidsai/build-planning#106

Closed

jameslamb mentioned this pull request Oct 21, 2024

add PR CI for cugraph-pyg and cugraph-dgl rapidsai/cugraph-gnn#59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

alexbarghi-nv commented Oct 3, 2024

raydouglass left a comment

jameslamb commented Oct 3, 2024

jameslamb commented Oct 3, 2024

jakirkham left a comment

jameslamb commented Oct 4, 2024

jameslamb commented Oct 4, 2024

jameslamb commented Oct 7, 2024

jameslamb commented Oct 7, 2024

jakirkham commented Oct 7, 2024

Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

Conversation

alexbarghi-nv commented Oct 3, 2024

raydouglass left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 3, 2024

So what can we do?

jameslamb commented Oct 3, 2024

jakirkham left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 4, 2024

So what do we do?

jameslamb commented Oct 4, 2024

jameslamb commented Oct 7, 2024

jameslamb commented Oct 7, 2024

jakirkham commented Oct 7, 2024