-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the drop-duplicate functionality #4095
Changes from 12 commits
883644c
6113946
f92d936
64ec680
2e4b0a7
0695fed
0442e53
fd98039
1d87370
1d39ec6
c4d9580
ad62f33
779bd2d
5900bd8
705788f
5fa81c4
c85f622
8f7436e
7c165e7
fc5e627
50a11c0
f3f1c3f
77fe0f5
7db4a10
b3077cf
bdcd215
2c6892a
ff5373a
1efef76
ebbd33f
ed76013
890f885
ccc5189
63a669b
a6d767f
e7ea077
ad2602c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Copyright (c) 2019-2023, NVIDIA CORPORATION. | ||
# Copyright (c) 2019-2024, NVIDIA CORPORATION. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
|
@@ -15,6 +15,7 @@ | |
import cudf | ||
import dask_cudf | ||
from dask.distributed import default_client | ||
import warnings | ||
|
||
|
||
def symmetrize_df( | ||
|
@@ -54,6 +55,7 @@ def symmetrize_df( | |
Name of the column in the data frame containing the weight ids | ||
|
||
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
edges instead of dropping them. | ||
|
||
|
@@ -84,6 +86,11 @@ def symmetrize_df( | |
if multi: | ||
return result | ||
else: | ||
warnings.warn( | ||
"'multi' is deprecated as the removal of multi edges is " | ||
"only supported when creating a multi-graph", | ||
FutureWarning, | ||
) | ||
vertex_col_name = src_name + dst_name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'm missing something: does this mean a user who wants a MultiGraph could also want to remove the multiple edges? Why would they want a MultiGraph without multiple edges? It seems like this option would be more useful for a Graph, but it's being deprecated here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, It means that the only way to drop multi edges is by creating a multi-graph. In fact prior to that, we were using the flag There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, thanks. I think my concerns will go away when Vibhu's feedback is addressed. |
||
result = result.groupby(by=[*vertex_col_name], as_index=False).min() | ||
return result | ||
|
@@ -128,6 +135,7 @@ def symmetrize_ddf( | |
Name of the column in the data frame containing the weight ids | ||
|
||
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
edges instead of dropping them. | ||
|
||
|
@@ -167,6 +175,11 @@ def symmetrize_ddf( | |
if multi: | ||
return result | ||
else: | ||
warnings.warn( | ||
"'multi' is deprecated as the removal of multi edges is " | ||
"only supported when creating a multi-graph", | ||
FutureWarning, | ||
) | ||
vertex_col_name = src_name + dst_name | ||
result = _memory_efficient_drop_duplicates( | ||
result, vertex_col_name, len(workers) | ||
|
@@ -181,6 +194,7 @@ def symmetrize( | |
value_col_name=None, | ||
multi=False, | ||
symmetrize=True, | ||
do_expensive_check=False, | ||
): | ||
""" | ||
Take a dataframe of source destination pairs along with associated | ||
|
@@ -208,6 +222,7 @@ def symmetrize( | |
weights column name. | ||
|
||
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here. |
||
edges instead of dropping them. | ||
|
||
|
@@ -234,8 +249,9 @@ def symmetrize( | |
if "edge_id" in input_df.columns and symmetrize: | ||
raise ValueError("Edge IDs are not supported on undirected graphs") | ||
|
||
csg.null_check(input_df[source_col_name]) | ||
csg.null_check(input_df[dest_col_name]) | ||
if do_expensive_check: # FIXME: Optimize this check as it is currently expensive | ||
csg.null_check(input_df[source_col_name]) | ||
csg.null_check(input_df[dest_col_name]) | ||
|
||
if isinstance(input_df, dask_cudf.DataFrame): | ||
output_df = symmetrize_ddf( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention the warning statement along with
Deprecated
, otherwise it is confusing for the user.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also will suggest filing an issue for tracking this deprecation.