Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

[RELEASE] wholegraph v24.10 #225

Merged
merged 19 commits into from
Oct 9, 2024
Merged

[RELEASE] wholegraph v24.10 #225

merged 19 commits into from
Oct 9, 2024

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-24.10 and v24.10 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.10 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-24.10 into main for the release

raydouglass and others added 19 commits July 19, 2024 15:07
Forward-merge branch-24.08 into branch-24.10
Forward-merge branch-24.08 into branch-24.10
It looks like the `Dockerfile` in this repo is fairly old (PyTorch 22.10). I don't know if it is useful -- we have largely deleted Dockerfiles in each RAPIDS repo now that we have devcontainers.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - https://github.com/linhu-nv
  - Brad Rees (https://github.com/BradReesWork)

URL: #184
Allow users to specify the entry size on each rank.

        node_feat_wm_embedding = wgth.create_embedding(
            ...
            embedding_entry_partition=[283071, 401722, 356680, 329221, 238065, 238060, 217897, 384313]
        )

1. embedding_entry_partition[i] indicates the number of embedding entries stored on the rank i.
2. If embedding_entry_partition is None, embedding will be partitioned equally.
3. Only chunked device and distributed host/device are supported.

Authors:
  - https://github.com/zhuofan1123

Approvers:
  - https://github.com/linhu-nv
  - Brad Rees (https://github.com/BradReesWork)

URL: #194
…rsion (#203)

Contributes to rapidsai/build-planning#58.

`scikit-build-core==0.10.0` was released today (https://github.com/scikit-build/scikit-build-core/releases/tag/v0.10.0), and wheel-building configurations across RAPIDS are incompatible with it.

This proposes upgrading to that version and fixing configuration here in a way that:

* is compatible with that new `scikit-build-core` version
* takes advantage of the forward-compatibility mechanism (`minimum-version`) that `scikit-build-core` provides, to reduce the risk of needing to do this again in the future

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham

URL: #203
We have many users running the [Kubeflow training operator](https://github.com/kubeflow/training-operator) who are also interested in using Wholegraph. For our MPIJobs users, many of them still use [HorovodRun](https://github.com/horovod/horovod/tree/master) as the startup command. Therefore, we want to add HorovodRun as one of the Wholegraph launch agents so our users can use Wholegraph on top of Kubeflow.

The new function will be similar to the existing MPI launcher agent, where the horovod library is only imported on demand. The horovod.tensorflow library will be used solely for the Horovod initialization command due to the issue with horovod.torch (see horovod/horovod#4009). After the Horovod initialization, the program can continue to run normal PyTorch code within each rank just like the mpi4py.

fixes #201

Authors:
  - Tommy Li (https://github.com/Tomcli)

Approvers:
  - https://github.com/linhu-nv
  - Brad Rees (https://github.com/BradReesWork)

URL: #200
A few small tweaks to `update-version.sh` for alignment across RAPIDS. This PR removes the `UCX_PY` version HTTP call from `update-version.sh` because it is not used.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #204
This PR updates pre-commit hooks to the latest versions that are supported without causing style check errors.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #206
Contributes to rapidsai/build-planning#88

Finishes the work of dropping Python 3.9 support.

This project stopped building / testing against Python 3.9 as of rapidsai/shared-workflows#235.
This PR updates configuration and docs to reflect that.

## Notes for Reviewers

### How I tested this

Checked that there were no remaining uses like this:

```shell
git grep -E '3\.9'
git grep '39'
git grep 'py39'
```

And similar for variations on Python 3.8 (to catch things that were missed the last time this was done).

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #209
This PR removes the NumPy<2 pin.  `wholegraph` does not appear to be a heavy user of NumPy or CuPy, so it should be fine to simply remove the pin.
For other RAPIDS projects with heavier dependency, CuPy 13.3.0 was required (just released) to have sufficient good CuPy/NumPy interoperability.

Authors:
  - Sebastian Berg (https://github.com/seberg)

Approvers:
  - https://github.com/jakirkham

URL: #208
This PR updates rapidsai/pre-commit-hooks to the version 0.4.0.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #213
Contributes to rapidsai/build-planning#40

This PR adds support for Python 3.12.

## Notes for Reviewers

This is part of ongoing work to add Python 3.12 support across RAPIDS.
It temporarily introduces a build/test matrix including Python 3.12, from rapidsai/shared-workflows#213.

A follow-up PR will revert back to pointing at the `branch-24.10` branch of `shared-workflows` once all
RAPIDS repos have added Python 3.12 support.

### This will fail until all dependencies have been updates to Python 3.12

CI here is expected to fail until all of this project's upstream dependencies support Python 3.12.

This can be merged whenever all CI jobs are passing.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #214
Just adds the existing license to the `pylibwholegraph` conda recipe.

Authors:
  - Ray Douglass (https://github.com/raydouglass)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #215
Contributes to rapidsai/build-planning#102

Fixes #217

## Notes for Reviewers

### How I tested this

Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from rapidsai/shared-workflows#246.

Observed the exact same failures with CUDA 11.4 reported in rapidsai/build-planning#102.

```text
...
  + nccl                     2.10.3.1  hcad2f07_0                  rapidsai-nightly     125MB
...
./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST 
./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST 
```

([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966022370/job/30453393224?pr=218))

Pushed a commit adding a floor of `nccl>=2.18.1.1`. Saw all tests pass with CUDA 11.4 😁 

```text
...
  + nccl                     2.22.3.1  hee583db_1                  conda-forge          131MB
...
(various log messages showing all tests passed)
```

([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966210441/job/30454147250?pr=218))

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/linhu-nv
  - https://github.com/jakirkham

URL: #218
Follow-up to #218 

This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context.

cc @linhu-nv for awareness

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham

URL: #223
@raydouglass raydouglass requested review from a team as code owners October 4, 2024 19:46
@raydouglass raydouglass requested review from KyleFromNVIDIA and removed request for a team October 4, 2024 19:46
Copy link

copy-pr-bot bot commented Oct 4, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@raydouglass raydouglass merged commit 83243a3 into main Oct 9, 2024
725 of 741 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants