Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Add DASK edgelist and graph support to the Dataset API #4035

Merged
merged 11 commits into from
Jan 9, 2024

Conversation

huiyuxie
Copy link
Contributor

@huiyuxie huiyuxie commented Dec 2, 2023

Hi! I choose to go further with some simple work other than docs. This PR is going to close #3218.

Here is what I have done in this PR:

  1. Added get_dask_edgelist() and get_dask_graph() (and another internal helper function __download_dask_csv()) to Dataset API.
  2. Executed all necessary tests for these new functions.
  3. Improved existing functions in the Dataset API and conducted tests to verify improvements.

Here are some additional details regarding this PR:

  1. The building and testing were conducted using version 23.12 instead of the default 24.02. Since Cugraph-ops library is no longer open, I failed to build from source using version 24.02. I built and tested the code in version 23.12 and then transferred the updated file to 24.02 before creating this PR. (I would appreciate any guidance on how to build from version 24.02 for external contributors).
  2. All tests from the test file have passed, but some warnings remain, as shown below
============================================================ warnings summary ============================================================
cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0]
cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0]
cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
  /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/cudf/core/index.py:3284: FutureWarning: cudf.StringIndex is deprecated and will be removed from cudf in a future version. Use cudf.Index with the appropriate dtype instead.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

I think above warnings came from the function call from_dask_cudf_edgelist but currently I have no idea how to remove them. I will do my best to address it if anyone has any ideas about it.

  1. The get_edgelist() function returns a deep copy of the object, but this is not supported for get_dask_edgelist() since only shallow copy is allowed for Dask cuDF dataframe (see docs). This will lead to a problem where if a user modifies the dataframe, the changes will be reflected in the internal self._edgelist object. So when get_dask_graph() is called later, the resulting graph will differ from the one directly constructed from the data file.
  2. I am uncertain about the requirements for (1) Identifying datasets and (2) Adding them to Dataset. If there is a need to add another function for determining whether a dataset requires MG handling based on its size, or to tag the dataset metadata (.yaml file) to indicate the necessity for MG processing, please let me know. Also, I welcome any suggestions for further features.
  3. When I ran pytest on other test files, the most common warnings were
/home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/dask_cudf/io/csv.py:79: FutureWarning: `chunksize` is deprecated and will be removed in the future. Please use `blocksize` instead.

The keyword chunksize is no longer in use (check docs here). I have checked all related functions in the repository and found that they currently use chunksize. If there is a need to change them to blocksize, I will create another PR to address this issue.

Any comments and suggestions are welcome!

@huiyuxie huiyuxie requested a review from a team as a code owner December 2, 2023 10:02
Copy link

copy-pr-bot bot commented Dec 2, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@huiyuxie
Copy link
Contributor Author

huiyuxie commented Dec 2, 2023

Also, I added @pytest.mark.skip(reason="MG not supported on CI") to each MG test in test file at last (after testing). If it should be removed for some reasons, please let me know.

Please help me add appropriate labels to this PR.

Next week is my final week, so I may sometimes delay my responses. I will try my best to check for any updates and comments here in my spare time. :)

@huiyuxie huiyuxie changed the title [FEA]: Add Dask edgelist and graph support to to Dataset API [FEA]: Add DASK edgelist and graph support to the Dataset API Dec 2, 2023
@rlratzel rlratzel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 4, 2023
@nv-rliu nv-rliu assigned nv-rliu and unassigned nv-rliu Dec 4, 2023
@nv-rliu nv-rliu self-requested a review December 4, 2023 17:59
@github-actions github-actions bot added the python label Dec 7, 2023
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR. I think it looks good but I have a couple suggestions.

python/cugraph/cugraph/datasets/dataset.py Outdated Show resolved Hide resolved
python/cugraph/cugraph/datasets/dataset.py Outdated Show resolved Hide resolved
@rlratzel
Copy link
Contributor

/ok to test

@huiyuxie
Copy link
Contributor Author

Sorry for the delay! I have modified the file download method based on suggestions and all the tests have been passed again.

@huiyuxie huiyuxie requested a review from rlratzel December 21, 2023 17:07
@rlratzel

This comment was marked as duplicate.

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rlratzel

This comment was marked as duplicate.

@rlratzel
Copy link
Contributor

rlratzel commented Jan 9, 2024

/merge

@rapids-bot rapids-bot bot merged commit c7b720d into rapidsai:branch-24.02 Jan 9, 2024
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA]: Add MG DASK edgelist and graph support to Datasets API
3 participants