Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: wholegraph feature store tests failing on CUDA 11.8.0, Python 3.11, arm64 (nightly tests) #4817

Open
jameslamb opened this issue Dec 9, 2024 · 1 comment
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@jameslamb
Copy link
Member

jameslamb commented Dec 9, 2024

Version

24.12

Which installation method(s) does this occur on?

Conda

Describe the bug.

Noticed failures like the following in the conda-python-tests / 11.8.0, 3.11, arm64, ubuntu20.04, a100, latest-driver, latest-deps job on 24.12 nightly tests:

FAILED tests/data_store/test_gnn_feat_storage_wholegraph.py::test_feature_storage_wholegraph_backend - assert 0 > 0
FAILED tests/data_store/test_gnn_feat_storage_wholegraph.py::test_feature_storage_wholegraph_backend_mg - assert 0 > 0
= 2 failed, 3250 passed, 762 skipped, 51 deselected, 988 warnings in 5770.76s (1:36:10) =
full stacktrace (click me)
___________________ test_feature_storage_wholegraph_backend ____________________
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/runner.py", line 341, in from_call
    result: TResult | None = func()
                             ^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/runner.py", line 242, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/threadexception.py", line 92, in pytest_runtest_call
    yield from thread_exception_runtest_hook()
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/threadexception.py", line 68, in thread_exception_runtest_hook
    yield
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/unraisableexception.py", line 95, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/unraisableexception.py", line 70, in unraisable_exception_runtest_hook
    yield
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/logging.py", line 846, in pytest_runtest_call
    yield from self._runtest_for(item, "call")
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/logging.py", line 829, in _runtest_for
    yield
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/capture.py", line 880, in pytest_runtest_call
    return (yield)
            ^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/skipping.py", line 257, in pytest_runtest_call
    return (yield)
            ^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/runner.py", line 174, in pytest_runtest_call
    item.runtest()
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/python.py", line 1627, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/python.py", line 159, in pytest_pyfunc_call
    result = testfunction(**testargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/cugraph/cugraph/python/cugraph/cugraph/tests/data_store/test_gnn_feat_storage_wholegraph.py", line 82, in test_feature_storage_wholegraph_backend
    assert world_size > 0
AssertionError: assert 0 > 0
----------------------------- Captured stdout call -----------------------------
gpu count: 0
__________________ test_feature_storage_wholegraph_backend_mg __________________
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/runner.py", line 341, in from_call
    result: TResult | None = func()
                             ^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/runner.py", line 242, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/threadexception.py", line 92, in pytest_runtest_call
    yield from thread_exception_runtest_hook()
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/threadexception.py", line 68, in thread_exception_runtest_hook
    yield
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/unraisableexception.py", line 95, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/unraisableexception.py", line 70, in unraisable_exception_runtest_hook
    yield
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/logging.py", line 846, in pytest_runtest_call
    yield from self._runtest_for(item, "call")
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/logging.py", line 829, in _runtest_for
    yield
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/capture.py", line 880, in pytest_runtest_call
    return (yield)
            ^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/skipping.py", line 257, in pytest_runtest_call
    return (yield)
            ^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/runner.py", line 174, in pytest_runtest_call
    item.runtest()
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/python.py", line 1627, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/test/lib/python3.11/site-packages/_pytest/python.py", line 159, in pytest_pyfunc_call
    result = testfunction(**testargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/cugraph/cugraph/python/cugraph/cugraph/tests/data_store/test_gnn_feat_storage_wholegraph.py", line 100, in test_feature_storage_wholegraph_backend_mg
    assert world_size > 0
AssertionError: assert 0 > 0
----------------------------- Captured stdout call -----------------------------
gpu count: 0

(build link)

Saw them again on a manual re-run of that test job a few hours later.

Minimum reproducible example

See CI links above.

Note this is only happening for that one conda-python-tests job, not on any other configuration or with wheels.

Relevant log output

N/A

Environment details

See CI links above.

output of the last 'conda install' before the failing tests (click me)

Package Version Build Channel Size
───────────────────────────────────────────────────────────────────────────────────────────────────────────────
Install:
───────────────────────────────────────────────────────────────────────────────────────────────────────────────

  • cugraph-service-client 24.12.00a87 py311_241209_g58075dd39_87 /tmp/python_channel 58kB
  • click 8.1.7 unix_pyh707e725_1 conda-forge Cached
  • toolz 1.0.0 pyhd8ed1ab_1 conda-forge 52kB
  • cloudpickle 3.1.0 pyhd8ed1ab_1 conda-forge 26kB
  • locket 1.0.0 pyhd8ed1ab_0 conda-forge 8kB
  • zipp 3.21.0 pyhd8ed1ab_1 conda-forge Cached
  • sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge 26kB
  • tblib 3.0.0 pyhd8ed1ab_0 conda-forge 17kB
  • zict 3.0.0 pyhd8ed1ab_1 conda-forge 36kB
  • xyzservices 2024.9.0 pyhd8ed1ab_1 conda-forge 47kB
  • attrs 24.2.0 pyh71513ae_1 conda-forge 56kB
  • aiohappyeyeballs 2.4.4 pyhd8ed1ab_1 conda-forge 19kB
  • pynvml 11.5.3 pyhd8ed1ab_1 conda-forge 48kB
  • partd 1.4.2 pyhd8ed1ab_0 conda-forge 21kB
  • importlib-metadata 8.5.0 pyha770c72_1 conda-forge Cached
  • yaml 0.2.5 hf897c2e_2 conda-forge Cached
  • msgpack-python 1.1.0 py311hc07b1fb_0 conda-forge 102kB
  • psutil 6.1.0 py311ha879c10_0 conda-forge Cached
  • tornado 6.4.2 py311h5487e9b_0 conda-forge 860kB
  • lz4 4.3.3 py311h2db3614_1 conda-forge 40kB
  • libwebp-base 1.4.0 h31becfc_0 conda-forge 364kB
  • contourpy 1.3.1 py311hc07b1fb_0 conda-forge 289kB
  • libjpeg-turbo 3.0.0 h31becfc_1 conda-forge 647kB
  • lerc 4.0.0 h4de3ea5_0 conda-forge 262kB
  • libdeflate 1.22 h86ecc28_0 conda-forge 70kB
  • libpng 1.6.44 hc4a20ef_0 conda-forge 295kB
  • pthread-stubs 0.4 h86ecc28_1002 conda-forge 8kB
  • xorg-libxdmcp 1.1.5 h57736b2_0 conda-forge 21kB
  • xorg-libxau 1.0.11 h86ecc28_1 conda-forge 16kB
  • frozenlist 1.5.0 py311ha879c10_0 conda-forge 61kB
  • multidict 6.1.0 py311h58d527c_1 conda-forge 64kB
  • propcache 0.2.1 py311ha879c10_0 conda-forge 53kB
  • libnl 3.11.0 h86ecc28_0 conda-forge 769kB
  • liblzma-devel 5.6.3 h86ecc28_1 conda-forge 377kB
  • xz-gpl-tools 5.6.3 h2dbfc1b_1 conda-forge 33kB
  • xz-tools 5.6.3 h86ecc28_1 conda-forge 96kB
  • libgpg-error 1.51 h05609ea_1 conda-forge 278kB
  • attr 2.5.1 h4e544f5_1 conda-forge 75kB
  • cytoolz 1.0.0 py311h5487e9b_1 conda-forge 388kB
  • pyyaml 6.0.2 py311ha879c10_1 conda-forge Cached
  • libtiff 4.7.0 hca96517_2 conda-forge 465kB
  • freetype 2.12.1 hf0a5ef3_2 conda-forge 642kB
  • libxcb 1.17.0 h262b8f6_0 conda-forge 397kB
  • yarl 1.18.3 py311ha879c10_0 conda-forge 152kB
  • xz 5.6.3 h2dbfc1b_1 conda-forge 23kB
  • libgcrypt-lib 1.11.0 h86ecc28_2 conda-forge 635kB
  • libcap 2.71 h51d75a7_0 conda-forge 107kB
  • openjpeg 2.5.2 h0d9d63b_0 conda-forge 375kB
  • lcms2 2.16 h922389a_0 conda-forge 296kB
  • libudev1 256.9 h1187dce_2 conda-forge 152kB
  • libsystemd0 256.9 hd54d049_0 conda-forge 431kB
  • pillow 11.0.0 py311hb2a0dd2_0 conda-forge 42MB
  • rdma-core 54.0 h1d056c8_1 conda-forge 1MB
  • ucx 1.17.0 h587c540_3 conda-forge 7MB
  • ucx-proc 1.0.0 gpu rapidsai 5kB
  • pylibraft 24.12.00a47 cuda11_py311_241209_g0af5afb0_47 rapidsai-nightly 319kB
  • libcugraphops 24.12.00a8 cuda11_241209_gcd7356b5_8 rapidsai-nightly 74MB
  • libucxx 0.41.00a cuda11_241209_g243d143_33 rapidsai-nightly 265kB
  • ucx-py 0.41.00a13 py311_241209_g38af753_13 rapidsai-nightly 396kB
  • ucxx 0.41.00a cuda11_py3.11_241209_g243d143_33 rapidsai-nightly 488kB
  • aiosignal 1.3.1 pyhd8ed1ab_1 conda-forge 13kB
  • dask-core 2024.11.2 pyhff2d567_1 conda-forge 903kB
  • bokeh 3.6.2 pyhd8ed1ab_1 conda-forge 5MB
  • distributed 2024.11.2 pyhff2d567_1 conda-forge 802kB
  • dask-expr 1.1.19 pyhd8ed1ab_0 conda-forge 186kB
  • dask 2024.11.2 pyhff2d567_1 conda-forge 8kB
  • libcugraph 24.12.00a87 cuda11_241209_g58075dd39_87 /tmp/cpp_channel 566MB
  • aiohttp 3.11.9 py311h58d527c_0 conda-forge 912kB
  • rapids-dask-dependency 24.12.00a8 py_0 rapidsai-nightly 19kB
  • pylibcugraph 24.12.00a87 cuda11_py311_241209_g58075dd39_87 /tmp/python_channel 695kB
  • distributed-ucxx 0.41.00a py3.11_241209_g243d143_33 rapidsai-nightly 61kB
  • dask-cudf 24.12.00a400 cuda11_py311_241209_g439321edb4_400 rapidsai-nightly 139kB
  • dask-cuda 24.12.00a15 py311_241209_g075f8be_15 rapidsai-nightly 284kB
  • raft-dask 24.12.00a47 cuda11_py311_241209_g0af5afb0_47 rapidsai-nightly 241kB
  • cugraph 24.12.00a87 cuda11_py311_241209_g58075dd39_87 /tmp/python_channel 1MB
  • cugraph-service-server 24.12.00a87 py311_241209_g58075dd39_87 /tmp/python_channel 54kB

Summary:

Install: 76 packages


</details>

### Other/Misc.

@jakirkham and @alexbarghi-nv saw these exact tests fail exactly this way in https://github.com/rapidsai/cugraph/pull/4703#issuecomment-2403280592. There, the root cause was some mix of PyTorch and cupy versions (I think).

### Code of Conduct

- [x] I agree to follow cuGraph's Code of Conduct
- [x] I have searched the [open bugs](https://github.com/rapidsai/cugraph/issues?q=is%3Aopen+is%3Aissue+label%3Abug) and have found no duplicates for this bug report
@jameslamb jameslamb added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 9, 2024
@jameslamb jameslamb changed the title [BUG]: wholegraph feature store tests failing on CUDA 11.8.0, Python 3.11, arm64 [BUG]: wholegraph feature store tests failing on CUDA 11.8.0, Python 3.11, arm64 (nightly tests) Dec 9, 2024
@vyasr
Copy link
Contributor

vyasr commented Dec 9, 2024

#4808 was supposed to fix this. It looks like something is off there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants