-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the drop-duplicate functionality #4095
Merged
rapids-bot
merged 37 commits into
rapidsai:branch-24.02
from
jnke2016:branch-24.02_enable-drop-duplicates
Feb 2, 2024
Merged
Changes from 13 commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
883644c
add SG support for dropping multi-edges through the CAPI
jnke2016 6113946
add MG support for dropping multi-edges and deprecaate parameter
jnke2016 f92d936
fix style
jnke2016 64ec680
fix copyright
jnke2016 2e4b0a7
Merge remote-tracking branch 'upstream/branch-24.02' into branch-24.0…
jnke2016 0695fed
Merge remote-tracking branch 'upstream/branch-24.02' into branch-24.0…
jnke2016 0442e53
fix typo
jnke2016 fd98039
fix typo
jnke2016 1d87370
reorder arguments
jnke2016 1d39ec6
Merge remote-tracking branch 'upstream/branch-24.02' into branch-24.0…
jnke2016 c4d9580
add 'do_expensive_check'
jnke2016 ad62f33
fix style
jnke2016 779bd2d
update graph creation warning description
jnke2016 5900bd8
update docstrings
jnke2016 705788f
fix style
jnke2016 5fa81c4
drop duplicates when viewing the edgelist
jnke2016 c85f622
drop duplicates when viewing edges and update tests
jnke2016 8f7436e
fix style
jnke2016 7c165e7
remove debug print
jnke2016 fc5e627
remove unused import
jnke2016 50a11c0
Merge remote-tracking branch 'upstream/branch-24.02' into branch-24.0…
jnke2016 f3f1c3f
undo changes to test
jnke2016 77fe0f5
drop duplicate edges, update tests and copyright
jnke2016 7db4a10
fix style
jnke2016 b3077cf
uncommment test
jnke2016 bdcd215
fix typo
jnke2016 2c6892a
remove outdated comment
jnke2016 ff5373a
remove debug print
jnke2016 1efef76
update tests
jnke2016 ebbd33f
fix style
jnke2016 ed76013
update number of edges count
jnke2016 890f885
fix style
jnke2016 ccc5189
reset changes to sg
jnke2016 63a669b
update tests
jnke2016 a6d767f
fix style
jnke2016 e7ea077
update copyright
jnke2016 ad2602c
Merge remote-tracking branch 'upstream/branch-24.02' into branch-24.0…
jnke2016 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Copyright (c) 2019-2023, NVIDIA CORPORATION. | ||
# Copyright (c) 2019-2024, NVIDIA CORPORATION. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
|
@@ -15,6 +15,7 @@ | |
import cudf | ||
import dask_cudf | ||
from dask.distributed import default_client | ||
import warnings | ||
|
||
|
||
def symmetrize_df( | ||
|
@@ -54,6 +55,7 @@ def symmetrize_df( | |
Name of the column in the data frame containing the weight ids | ||
|
||
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
edges instead of dropping them. | ||
|
||
|
@@ -84,6 +86,12 @@ def symmetrize_df( | |
if multi: | ||
return result | ||
else: | ||
warnings.warn( | ||
"Multi is deprecated and the removal of multi edges will no longer be " | ||
"supported from 'symmetrize'. Multi edges will be removed upon creation " | ||
"of graph instance.", | ||
FutureWarning, | ||
) | ||
vertex_col_name = src_name + dst_name | ||
result = result.groupby(by=[*vertex_col_name], as_index=False).min() | ||
return result | ||
|
@@ -128,6 +136,7 @@ def symmetrize_ddf( | |
Name of the column in the data frame containing the weight ids | ||
|
||
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
edges instead of dropping them. | ||
|
||
|
@@ -167,6 +176,12 @@ def symmetrize_ddf( | |
if multi: | ||
return result | ||
else: | ||
warnings.warn( | ||
"Multi is deprecated and the removal of multi edges will no longer be " | ||
"supported from 'symmetrize'. Multi edges will be removed upon creation " | ||
"of graph instance.", | ||
FutureWarning, | ||
) | ||
vertex_col_name = src_name + dst_name | ||
result = _memory_efficient_drop_duplicates( | ||
result, vertex_col_name, len(workers) | ||
|
@@ -181,6 +196,7 @@ def symmetrize( | |
value_col_name=None, | ||
multi=False, | ||
symmetrize=True, | ||
do_expensive_check=False, | ||
): | ||
""" | ||
Take a dataframe of source destination pairs along with associated | ||
|
@@ -208,6 +224,7 @@ def symmetrize( | |
weights column name. | ||
|
||
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here. |
||
edges instead of dropping them. | ||
|
||
|
@@ -234,8 +251,9 @@ def symmetrize( | |
if "edge_id" in input_df.columns and symmetrize: | ||
raise ValueError("Edge IDs are not supported on undirected graphs") | ||
|
||
csg.null_check(input_df[source_col_name]) | ||
csg.null_check(input_df[dest_col_name]) | ||
if do_expensive_check: # FIXME: Optimize this check as it is currently expensive | ||
csg.null_check(input_df[source_col_name]) | ||
csg.null_check(input_df[dest_col_name]) | ||
|
||
if isinstance(input_df, dask_cudf.DataFrame): | ||
output_df = symmetrize_ddf( | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention the warning statement along with
Deprecated
, otherwise it is confusing for the user.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also will suggest filing an issue for tracking this deprecation.