Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FEA]: Add DASK edgelist and graph support to the Dataset API (#4035)
Hi! I choose to go further with some simple work other than docs. This PR is going to close #3218. Here is what I have done in this PR: 1. Added `get_dask_edgelist()` and `get_dask_graph()` (and another internal helper function `__download_dask_csv()`) to Dataset API. 2. Executed all necessary tests for these new functions. 3. Improved existing functions in the Dataset API and conducted tests to verify improvements. Here are some additional details regarding this PR: 1. The building and testing were conducted using version 23.12 instead of the default 24.02. Since Cugraph-ops library is no longer open, I failed to build from source using version 24.02. I built and tested the code in version 23.12 and then transferred the updated file to 24.02 before creating this PR. (I would appreciate any guidance on how to build from version 24.02 for external contributors). 2. All tests from the test file have passed, but some warnings remain, as shown below ```bash ============================================================ warnings summary ============================================================ cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/cudf/core/index.py:3284: FutureWarning: cudf.StringIndex is deprecated and will be removed from cudf in a future version. Use cudf.Index with the appropriate dtype instead. warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ``` I think above warnings came from the function call `from_dask_cudf_edgelist` but currently I have no idea how to remove them. I will do my best to address it if anyone has any ideas about it. 3. The `get_edgelist()` function returns a deep copy of the object, but this is not supported for `get_dask_edgelist()` since only shallow copy is allowed for Dask cuDF dataframe (see [docs](https://docs.rapids.ai/api/dask-cudf/legacy/api/#dask_cudf.DataFrame.copy)). This will lead to a problem where if a user modifies the dataframe, the changes will be reflected in the internal `self._edgelist` object. So when `get_dask_graph()` is called later, the resulting graph will differ from the one directly constructed from the data file. 4. I am uncertain about the requirements for (1) Identifying datasets and (2) Adding them to Dataset. If there is a need to add another function for determining whether a dataset requires MG handling based on its size, or to tag the dataset metadata (.yaml file) to indicate the necessity for MG processing, please let me know. Also, I welcome any suggestions for further features. 5. When I ran pytest on other test files, the most common warnings were ```bash /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/dask_cudf/io/csv.py:79: FutureWarning: `chunksize` is deprecated and will be removed in the future. Please use `blocksize` instead. ``` The keyword `chunksize` is no longer in use (check [docs](https://docs.rapids.ai/api/dask-cudf/legacy/api/#dask_cudf.read_csv) here). I have checked all related functions in the repository and found that they currently use `chunksize`. If there is a need to change them to `blocksize`, I will create another PR to address this issue. Any comments and suggestions are welcome! Authors: - Huiyu Xie (https://github.com/huiyuxie) - Rick Ratzel (https://github.com/rlratzel) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4035
- Loading branch information