[FEA]: Add parameter to prevent persisted edgelists in `datasets` API #4241

nv-rliu · 2024-03-14T19:57:58Z

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Low (would be nice)

Please provide a clear description of problem this feature solves

When cugraph.datasets objects are used to clean-up MG tests (ex. #4197), they often need to store edge-lists for SG and MG (dask_cudf) usage. However, the current implementation of datasets requires constant calls to unload to avoid these issues.

This also happened to interfere with CI due to the fact that edge-lists were persisted between files.

Describe your ideal solution

Similar to how MG algorithms have a flag that developers use for testing/debugging (perform_expensive_check), perhaps the datasets API should also have a flag that is set when used for testing purposes in order to automatically check for preexisting edge-lists and unload them.

from cugraph.datasets import karate
df = karate.get_edgelist()
ddf = karate.get_dask_edgelist() # This just returns a cudf.DataFrame instead of dask_cudf

# proposed solution
df = karate.get_edgelist(auto_unload=True) # prevents edge-list from persisting for test usage
ddf = karate.get_dask_edgelist(auto_unload=True)

Describe any alternatives you have considered

Since this issue only affects tests, an alternative could be to use fixtures that perform the "check and unload" steps in each unit test.

Additional context

This is part of a general effort to improve readability of the MG tests #4187

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request

The text was updated successfully, but these errors were encountered:

Closes #4241 This PR adds an additional check to the `get_edgelist()` and `get_dask_edgelist()` functions in the Datasets API. This ensures that, when retrieving an edge-list, the internal (`self._edgelist`) type is verified to ensure that the object is SG or MG. In addition, minor improvements have also been made `utils/test_dataset.py` to be more thorough with type checks. Authors: - Ralph Liu (https://github.com/nv-rliu) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4256

nv-rliu added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 14, 2024

nv-rliu added this to the 24.06 milestone Mar 14, 2024

nv-rliu added improvement Improvement / enhancement to an existing function python and removed ? - Needs Triage Need team to review and classify labels Mar 14, 2024

nv-rliu self-assigned this Mar 15, 2024

nv-rliu mentioned this issue Mar 18, 2024

Add Additional Checks to get_edgelist and get_dask_edgelist #4256

Merged

rapids-bot bot closed this as completed in #4256 May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Add parameter to prevent persisted edgelists in `datasets` API #4241

[FEA]: Add parameter to prevent persisted edgelists in `datasets` API #4241

nv-rliu commented Mar 14, 2024 •

edited

Loading

[FEA]: Add parameter to prevent persisted edgelists in datasets API #4241

[FEA]: Add parameter to prevent persisted edgelists in datasets API #4241

Comments

nv-rliu commented Mar 14, 2024 • edited Loading

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

[FEA]: Add parameter to prevent persisted edgelists in `datasets` API #4241

[FEA]: Add parameter to prevent persisted edgelists in `datasets` API #4241

nv-rliu commented Mar 14, 2024 •

edited

Loading