Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690

Merged
merged 9 commits into from
Oct 7, 2024

Conversation

alexbarghi-nv
Copy link
Member

We were pulling the wrong packages because the PyTorch version constraint wasn't tight enough. Hopefully these sorts of issues will be resolved in the cugraph-gnn repository going forward, where we can pin a specific pytorch version for testing.

@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner October 3, 2024 14:28
@alexbarghi-nv alexbarghi-nv self-assigned this Oct 3, 2024
@alexbarghi-nv alexbarghi-nv added bug Something isn't working non-breaking Non-breaking change ci labels Oct 3, 2024
@alexbarghi-nv alexbarghi-nv added this to the 24.10 milestone Oct 3, 2024
ci/test_python.sh Outdated Show resolved Hide resolved
@jameslamb jameslamb changed the title Constrain the PyTorch Version in CI Runs Constrain versions of PyTorch and CI artifacts in CI Runs Oct 3, 2024
Copy link
Member

@raydouglass raydouglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@jameslamb
Copy link
Member

(Summarizing some offline conversations, to get this into the public record here on GitHub)

For the last few days (unsure how long), CI jobs here targeting branch-24.10 have been silently getting 24.12 nightly packages. This PR fixes that, and that's exposing a dependency conflict for cugraph-pyg.

conda cannot install cugraph-pyg and pytorch>=2.3,<2.4 together, because there are not any pyg packages that support pytorch>=2.3.

full conda solve error trace (click me)
Looking for: ['cugraph-pyg=24.10', "pytorch[version='>=2.3,<2.4']", 'ogb']
Pinned packages:
  - python 3.10.*
Could not solve for environment specs
The following packages are incompatible
├─ cugraph-pyg 24.10**  is installable with the potential options
│  ├─ cugraph-pyg [24.10.00a84|24.10.00a85|24.10.00a86|24.10.00a94] would require
│  │  └─ pyg >=2.5,<2.6  with the potential options
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 1.12.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1], which can be installed;
│  │     │     ├─ pytorch [1.12.0|1.12.1|1.13.0|1.13.1] would require
│  │     │     │  └─ python >=3.7,<3.8.0a0 , which can be installed;
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1] would require
│  │     │     │  └─ python >=3.8,<3.9.0a0 , which can be installed;
│  │     │     └─ pytorch [1.12.0|1.12.1|...|2.3.1] would require
│  │     │        └─ python >=3.9,<3.10.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 1.13.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|1.13.0|1.13.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1], which can be installed;
│  │     │     └─ pytorch [1.13.0|1.13.1|...|2.3.1] would require
│  │     │        └─ python >=3.11,<3.12.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 2.0.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     └─ pytorch [2.0.0|2.0.1], which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 2.1.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [2.1.0|2.1.1|2.1.2], which can be installed;
│  │     │     └─ pytorch [2.1.0|2.1.2|...|2.3.1] would require
│  │     │        └─ python >=3.12,<3.13.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ pytorch 2.2.*  with the potential options
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│  │     │     ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained);
│  │     │     └─ pytorch [2.2.0|2.2.1|2.2.2], which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  │     ├─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │     │  └─ python >=3.8,<3.9.0a0 , which can be installed;
│  │     └─ pyg [2.5.0|2.5.1|2.5.2] would require
│  │        └─ python >=3.9,<3.10.0a0 , which can be installed;
│  ├─ cugraph-pyg 24.10.00a0 would require
│  │  └─ cugraph 24.10.00a0.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a17 would require
│  │  └─ cugraph 24.10.00a17.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a19 would require
│  │  └─ cugraph 24.10.00a19.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a22 would require
│  │  └─ cugraph 24.10.00a22.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a28 would require
│  │  └─ cugraph 24.10.00a28.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a38 would require
│  │  └─ cugraph 24.10.00a38.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a39 would require
│  │  └─ cugraph 24.10.00a39.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a40 would require
│  │  └─ cugraph 24.10.00a40.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a44 would require
│  │  └─ cugraph 24.10.00a44.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a45 would require
│  │  └─ cugraph 24.10.00a45.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a48 would require
│  │  └─ cugraph 24.10.00a48.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a49 would require
│  │  └─ cugraph 24.10.00a49.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a50 would require
│  │  └─ cugraph 24.10.00a50.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a52 would require
│  │  └─ cugraph 24.10.00a52.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a53 would require
│  │  └─ cugraph 24.10.00a53.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a54 would require
│  │  └─ cugraph 24.10.00a54.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a55 would require
│  │  └─ cugraph 24.10.00a55.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a56 would require
│  │  └─ cugraph 24.10.00a56.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a57 would require
│  │  └─ cugraph 24.10.00a57.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a58 would require
│  │  └─ cugraph 24.10.00a58.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a59 would require
│  │  └─ cugraph 24.10.00a59.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a60 would require
│  │  └─ cugraph 24.10.00a60.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a61 would require
│  │  └─ cugraph 24.10.00a61.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a63 would require
│  │  └─ cugraph 24.10.00a63.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a65 would require
│  │  └─ cugraph 24.10.00a65.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a66 would require
│  │  └─ cugraph 24.10.00a66.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a68 would require
│  │  └─ cugraph 24.10.00a68.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a69 would require
│  │  └─ cugraph 24.10.00a69.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a70 would require
│  │  └─ cugraph 24.10.00a70.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a71 would require
│  │  └─ cugraph 24.10.00a71.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a72 would require
│  │  └─ cugraph 24.10.00a72.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a73 would require
│  │  └─ cugraph 24.10.00a73.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a74 would require
│  │  └─ cugraph 24.10.00a74.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a75 would require
│  │  └─ cugraph 24.10.00a75.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a77 would require
│  │  └─ cugraph 24.10.00a77.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a78 would require
│  │  └─ cugraph 24.10.00a78.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a79 would require
│  │  └─ cugraph 24.10.00a79.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a80 would require
│  │  └─ cugraph 24.10.00a80.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a81 would require
│  │  └─ cugraph 24.10.00a81.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a82 would require
│  │  └─ cugraph 24.10.00a82.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg 24.10.00a83 would require
│  │  └─ cugraph 24.10.00a83.* , which does not exist (perhaps a missing channel);
│  ├─ cugraph-pyg [24.10.00a84|24.10.00a85|...|24.10.00a93] would require
│  │  └─ python >=3.11,<3.12.0a0 , which can be installed;
│  ├─ cugraph-pyg [24.10.00a84|24.10.00a85|...|24.10.00a93] would require
│  │  └─ python >=3.12,<3.13.0a0 , which can be installed;
│  └─ cugraph-pyg [24.10.00a87|24.10.00a88|24.10.00a89|24.10.00a91|24.10.00a93] would require
│     ├─ pyg >=2.5,<2.6 , which can be installed (as previously explained);
│     └─ pytorch >=2.3,<2.4.0a0  with the potential options
│        ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│        ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
│        ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
│        ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained);
│        └─ pytorch [2.3.0|2.3.1] conflicts with any installable versions previously reported;
├─ libtorch is installable with the potential options
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_generic_*_2, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_generic_*_3, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_mkl_*_102, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cpu_mkl_*_103, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda112_*_302, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda112_*_303, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda118_*_302, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda118_*_303, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda120_*_302, which can be installed;
│  ├─ libtorch 2.1.0 would require
│  │  └─ pytorch 2.1.0 cuda120_*_303, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_4, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_1, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_generic_*_3, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_101, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_103, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cpu_mkl_*_104, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda112_*_300, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda112_*_301, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_301, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_303, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda118_*_304, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_301, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_303, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_300, which can be installed;
│  ├─ libtorch 2.1.2 would require
│  │  └─ pytorch 2.1.2 cuda120_*_304, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_generic_*_1, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_mkl_*_101, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda118_*_301, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda120_*_301, which can be installed;
│  ├─ libtorch 2.3.0 would require
│  │  └─ pytorch 2.3.0 cuda120_*_300, which can be installed;
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.3.1 would require
│  │  └─ pytorch 2.3.1 cuda120_*_300, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_generic_*_1, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_generic_*_0, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_mkl_*_100, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cpu_mkl_*_101, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cuda118_*_300, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cuda118_*_301, which can be installed;
│  ├─ libtorch 2.4.0 would require
│  │  └─ pytorch 2.4.0 cuda120_*_300, which can be installed;
│  └─ libtorch 2.4.0 would require
│     └─ pytorch 2.4.0 cuda120_*_301, which can be installed;
└─ pytorch >=2.3,<2.4  is installable with the potential options
   ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
   ├─ pytorch [1.12.0|1.12.1|...|2.3.1], which can be installed (as previously explained);
   ├─ pytorch [1.13.0|1.13.1|...|2.3.1], which can be installed (as previously explained);
   ├─ pytorch [2.1.0|2.1.2|...|2.3.1], which can be installed (as previously explained);
   └─ pytorch [2.3.0|2.3.1] conflicts with any installable versions previously reported.
[rapids-conda-retry] conda returned exit code: 1
[rapids-conda-retry] Exiting, no retryable mamba errors detected: 'ChecksumMismatchError:', 'ChunkedEncodingError:', 'CondaHTTPError:', 'CondaMultiError:', 'Connection broken:', 'ConnectionError:', 'DependencyNeedsBuildingError:', 'EOFError:', 'JSONDecodeError:', 'Multi-download failed', 'Timeout was reached', segfault exit code 139
[rapids-conda-retry
how to reproduce this (click me)
docker run \
    --rm \
    --gpus 1 \
    --env CI=false \
    --env RAPIDS_BUILD_TYPE="pull-request" \
    --env RAPIDS_REPOSITORY="rapidsai/cugraph" \
    --env RAPIDS_REF_NAME=pull-request/4690 \
    --env RAPIDS_SHA=922571b6db5f721a287897b3c5acc81b3fe07f69 \
    -v $(pwd):/opt/work \
    -w /opt/work \
    --network host \
    -it rapidsai/ci-conda:cuda11.8.0-rockylinux8-py3.10 \
    bash

RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"

rapids-logger "Downloading artifacts from previous jobs"
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

rapids-logger "Generate Python testing dependencies"
rapids-dependency-file-generator \
  --output conda \
  --file-key test_python \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg

conda activate test_cugraph_pyg

CONDA_CUDA_VERSION="11.8"
PYG_URL="https://data.pyg.org/whl/torch-2.3.0+cu118.html"

rapids-mamba-retry install \
    --channel "${CPP_CHANNEL}" \
    --channel "${PYTHON_CHANNEL}" \
    --channel pyg \
    "cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
    "pytorch>=2.3,<2.4" \
    "ogb"

This only shows up in the conda-python-tests / 12.5.1, 3.12, amd64, ubuntu22.04, v100, latest-driver, latest-deps job, because that's the only one where cugraph-pyg installation is currently tested on PRs.

cugraph/ci/test_python.sh

Lines 187 to 189 in 5fad435

if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then
if [[ "${RUNNER_ARCH}" != "ARM64" ]]; then
rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg

The PyTorch floor here was raised to pytorch>=2.3,<2.4 in #4615. Logs from that CI job on that PR show the issue:

cugraph                   24.12.00a16     cuda11_py310_240928_g59f70dd1b_16    rapidsai-nightly
cugraph-pyg               24.12.00a16     py310_240928_g59f70dd1b_16    rapidsai-nightly
...
pyg                       2.5.2           py310_torch_2.1.0_cpu    pyg
...
pytorch                   2.1.2           cuda118_py310h6f85f1b_304    conda-forge

(build link)

cugraph-pyg==24.12.* at that point still allow pytorch==2.1.* to be installed, which allowed conda to find a solution with pyg.

So what can we do?

Ideally, there would be pyg packages supporting pytorch>=2.3. It seemed like this PR from around months ago might have added that: pyg-team/pytorch_geometric#9240.

But there are not PyTorch 2.3 conda packages up at https://anaconda.org/pyg/pyg/files?page=3&version=2.5.2&sort=basename&sort_order=desc.

image

The options I can think of:

  • relax PyTorch dependency for cugraph-pyg back to pytorch>=2.2
  • delay cugraph-pyg=24.10 release until there are pyg packages supporting PyTorch 2.3
  • build pyg packages supporting PyTorch from source and host them on RAPIDS-controlled channels

@jameslamb
Copy link
Member

update on #4690 (comment)

After offline discussion with @alexbarghi-nv @jakirkham @tingyu66 , we decided to replace uses of pyg::pyg conda packages with conda-forge::pytorch_geometric.

commit: f267c77

They're built from the same sources, and conda-forge::pytorch_geometric is a noarch package without an explicit PyTorch constraint.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James! 🙏

AIUI this matches what we discussed

Also grepped for any remaining pyg dependency lines to fix and didn't find any

Included one informational note below, but no action needed

Approving to unblock

ci/build_docs.sh Show resolved Hide resolved
@jameslamb
Copy link
Member

All of the build and test jobs are now passing, and spot-checking the logs it looks to me like they're using the correct, expected versions of dependencies 🎉

The docs-build is broken, like this:

Extension error (sphinx.ext.autosummary):
Handler <function process_generate_options at 0x7f0c6433e4d0> for event 'builder-inited' threw an exception (exception: no module named cugraph_dgl.convert)

(build link)

The most recent docs build (yesterday) did "succeed" .... but only by using 24.08 packages 😱

  + pytorch                                 2.1.2  cuda118_py310h6f85f1b_304          conda-forge            27MB
  ...
  + dgl                                     1.1.3  cuda112py310hdbdccad_2             conda-forge            44MB
  ...
  + pyg                                     2.5.2  py310_torch_2.1.0_cpu              pyg                     1MB
  ...
  + cugraph                              24.08.00  cuda11_py310_240808_gfc880db0c_0   rapidsai                2MB
  + cugraph-service-server               24.08.00  py310_240808_gfc880db0c_0          rapidsai               44kB
  + cugraph-pyg                          24.08.00  py310_240808_gfc880db0c_0          rapidsai              142kB
  + cugraph-dgl                          24.08.00  py310_0                            rapidsai              122kB

(build link)

It's showing up as a failure now because this PR prevents conda from using non-24.10 RAPIDS packages.

In my experience with sphinx, this type of "no module" error often means "there was an ImportError when trying to import that module", which can point to these other explanations:

  • a missing dependency
  • breaking change in some dependency's import paths
  • shared library loading error

There absolutely is a cugraph_dgl.convert module: https://github.com/rapidsai/cugraph/blob/branch-24.10/python/cugraph-dgl/cugraph_dgl/convert.py

I was able to reproduce this locally on an x86_64 machine with CUDA 12.2, and that revealed the real issue.

code to do that (click me)
docker run \
    --rm \
    --gpus 1 \
    --env CI=false \
    --env RAPIDS_BUILD_TYPE="pull-request" \
    --env RAPIDS_REPOSITORY="rapidsai/cugraph" \
    --env RAPIDS_REF_NAME=pull-request/4690 \
    --env RAPIDS_SHA=f267c771707d4007c6869b4a0a79feb3e0c27700 \
    -v $(pwd):/opt/work \
    -w /opt/work \
    --network host \
    -it rapidsai/ci-conda:cuda11.8.0-ubuntu22.04-py3.10 \
    bash

RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"

CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

rapids-dependency-file-generator \
  --output conda \
  --file-key docs \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --yes -f env.yaml -n docs
conda activate docs

if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then
  CONDA_CUDA_VERSION="11.8"
  DGL_CHANNEL="dglteam/label/cu118"
else
  CONDA_CUDA_VERSION="12.1"
  DGL_CHANNEL="dglteam/label/cu121"
fi

rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  --channel conda-forge \
  --channel nvidia \
  --channel "${DGL_CHANNEL}" \
  "libcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "pylibcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-dgl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-service-server=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "cugraph-service-client=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "libcugraph_etl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "pylibcugraphops=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  "pylibwholegraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
  pytorch \
  "cuda-version=${CONDA_CUDA_VERSION}"

python -c "import cugraph_dgl.convert"

DGL backend not selected or invalid. Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase)
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph_dgl/init.py", line 18, in
from cugraph_dgl.graph import Graph
...
File "/opt/conda/envs/docs/lib/python3.10/site-packages/cugraph/utilities/utils.py", line 410, in getattr
raise RuntimeError(f"This feature requires the {self.name} " "package/module")
RuntimeError: This feature requires the dgl package/module

Following that code shared above, that can reproduced without actually invoking sphinx-build:

python -c "import cugraph_dgl.convert"

Walking down the trace:

python -c "import dgl"

ModuleNotFoundError: No module named 'torchdata'

conda install -c conda-forge torchdata
python -c "import dgl"

ModuleNotFoundError: No module named 'pydantic'

conda install -c conda-forge pydantic
python -c "import dgl"

FileNotFoundError: Cannot find DGL C++ graphbolt library at /opt/conda/envs/docs/lib/python3.10/site-packages/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.post300.so

So what do we do?

I'm not sure.

Looks like dgl's dependency on torchdata was removed in August:

Those seem to have not made it in until v2.4.0 (https://github.com/dmlc/dgl/releases/tag/v2.4.0)

Here in cugraph's CI, we're getting v2.1.0

  + dgl                                2.1.0.cu118  py310_0                              dglteam/label/cu118      606MB

(build link)

I'm not sure how to fix this. The cu118 label for this package doesn't have packages newer than v2.1.0:

https://anaconda.org/dglteam/dgl/files?version=&channel=cu118

Maybe we want the th23_cu118 label instead, now that cugraph is using PyTorch 2.3?

https://anaconda.org/dglteam/dgl/files?version=2.4.0.th23.cu118

@jameslamb jameslamb requested a review from a team as a code owner October 4, 2024 16:36
@jameslamb
Copy link
Member

Summarizing recent commits:

dgl appears to have changed its versioning scheme for conda packages. The latest release of dgl (v2.4.0) has not been published to the dglteam channel under the main tag... they're now tags and version numbers that encode the supported PyTorch version and CUDA version.

Here in the 24.10 release of cugraph-dgl we want to support PyTorch 2.3 and CUDA 11.8, so I've switched cugraph-dgl to this runtime requirement:

dgl >= 2.4.0.th23.cu*

and requiring this label on the dglteam channel

--channel dglteam/label/th23_cu118

As @alexbarghi-nv pointed out to me, something similar is being done in cugraph-gnn already: rapidsai/cugraph-gnn#10

For wheels, I've updated the cugraph-dgl wheels' dependency on dgl (only enforced via a pip install in a script, not wheel metadata) from dgl==2.0.0 to dgl==2.2.1... the latest version that wheels have been published for.

@jameslamb jameslamb changed the title Constrain versions of PyTorch and CI artifacts in CI Runs Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 Oct 4, 2024
@jameslamb
Copy link
Member

I'm going to merge this. It has a lot of approvals, CI is all passing, and I spot-checked CI logs for builds and tests and saw all the things we're expecting... latest nightlies of cugraph, nx-cugraph, cudf, etc., PyTorch 2.3, and numpy 2.x.

Thanks for the help everyone!

@jameslamb
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 3789b70 into rapidsai:branch-24.10 Oct 7, 2024
131 checks passed
@jakirkham
Copy link
Member

Thanks James! 🙏

@jameslamb jameslamb mentioned this pull request Oct 7, 2024
@alexbarghi-nv alexbarghi-nv deleted the ci-fix-pytorch branch October 7, 2024 15:09
rapids-bot bot pushed a commit that referenced this pull request Oct 10, 2024
## Summary

Follow-up to #4690.

Proposes consolidating stuff like this in CI scripts:

```shell
pip install A
pip install B
pip install C
```

Into this:

```shell
pip install A B C
```

## Benefits of these changes

Reduces the risk of creating a broken environment with incompatible packages. Unlike `conda`, `pip` does not evaluate the requirements of all installed packages when you run `pip` install.

Installing `torch` and `cugraph-dgl` at the same time, for example, gives us a chance to find out about packaging issues like *"`cugraph-dgl` and `torch` have conflicting requirements on `{other_package}`"* at CI time.

Similar change from `cudf`: rapidsai/cudf#16575

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Alex Barghi (https://github.com/alexbarghi-nv)

URL: #4701
rapids-bot bot pushed a commit to rapidsai/cugraph-gnn that referenced this pull request Oct 30, 2024
Another steps towards completing the work started in #53 

Fixes #15

Contributes to rapidsai/build-planning#111

Proposes changes to get CI running on pull requests for `cugraph-pyg` and `cugraph-dgl`

## Notes for Reviewers

Workflows for nightly builds and publishing nightly packages are intentionally not included here. See #58 (comment)

Notebook tests are intentionally not added here... they'll be added in the next PR.

Pulls in changes from these other upstream PRs that had not been ported over to this repo:

* rapidsai/cugraph#4690
* rapidsai/cugraph#4393

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - Bradley Dice (https://github.com/bdice)

URL: #59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci conda non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants