-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable expression-based Dask Dataframe support #4325
Conversation
python/cugraph/cugraph/structure/graph_implementation/simpleDistributedGraph.py
Outdated
Show resolved
Hide resolved
output_df[value_col.columns], | ||
output_df[list(value_col.columns)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: This may be a dask-expr bug? Column projection using anything other than a list seems fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not observing this bug locally anymore, but I'd still like to keep this precaution in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave a code comment? Is it worth raising a tracking issue on cuGraph for follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest: I don't actually think this "fix" is required, because removing it doesn't seem to cause test failures for me locally (was probably specific to an earlier combination of dask/dask-expr/dask-cudf). However, I left it for now because it will take a long time for "real CI" to tell me that it actually is a problem.
With that said, I'll be happy to give it a try now that I'm realizing it's only cudf/dask-cudf that is about to freeze (and not cugraph).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try dropping the list
then?
dask_label_df = dask_cudf.from_dask_dataframe(dask_label_df) | ||
dask_label_df = dask_label_df.to_backend("cudf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_dask_dataframe
is now deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that dask-expr has some dispatching/plugin support for different DataFrame implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - Dask documentation is here: https://docs.dask.org/en/latest/how-to/selecting-the-collection-backend.html, and Dask-cudf: https://docs.rapids.ai/api/dask-cudf/stable/#dataframe-creation-from-in-memory-formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Given the complexity of dask's various dispatching mechanisms, I'm not expecting anything other than "pandas" and "cudf" the ever be implemented - Though it's technically possible.
# Avoid "p2p" shuffling in dask for now | ||
config.set({"dataframe.shuffle.method": "tasks"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"p2p" should work fine, but it will rarely provide a performance benefit. It seems best to minimize "optional" changes until the query-planning migration is finished.
from dask_cudf.core import DataFrame as dcDataFrame | ||
from dask_cudf.core import Series as daskSeries | ||
from dask_cudf import DataFrame as dcDataFrame | ||
from dask_cudf import Series as daskSeries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: All imports from dask_cudf.core
should be avoided, because these imports are always using "legacy" dask-cudf. Importing from the top-level dask_cudf
module are automatically routed to the proper API. There is no way to protect against dask_cudf.core
imports yet, because some query-planning logic still needs to find/use specific legacy code.
.to_frame() | ||
.sort_values(0) | ||
.to_frame(name="0") | ||
.sort_values("0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another "precaution" (using numerical column names still seems "fragile" in dask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting the OPS relevant changes. Namely dropping old Dask workarounds (environment variables that are no longer needed) as the underlying issue was resolved.
# TODO: Enable dask query planning (by default) once some bugs are fixed. | ||
# xref: https://github.com/rapidsai/cudf/issues/15027 | ||
export DASK_DATAFRAME__QUERY_PLANNING=False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI this is one of the OPS relevant changes. Basically removing a workaround that is no longer needed
# TODO: Enable dask query planning (by default) once some bugs are fixed. | ||
# xref: https://github.com/rapidsai/cudf/issues/15027 | ||
export DASK_DATAFRAME__QUERY_PLANNING=False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other one. So same change as before just in another place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rick! 🙏
Based on your comment above, do we want to drop this workaround?
output_df[value_col.columns], | ||
output_df[list(value_col.columns)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try dropping the list
then?
Co-authored-by: jakirkham <[email protected]>
/merge |
[WIP] I'm using this PR to debug/add support for
DASK_DATAFRAME__QUERY_PLANNING=True
.NOTES:
Series
reductions dask/dask-expr#1041 [Merged]ToFrame
dask/dask-expr#1044