Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable expression-based Dask Dataframe support #4325

Merged
merged 27 commits into from
May 28, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Apr 9, 2024

[WIP] I'm using this PR to debug/add support for DASK_DATAFRAME__QUERY_PLANNING=True.

NOTES:

@alexbarghi-nv alexbarghi-nv added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 24, 2024
output_df[value_col.columns],
output_df[list(value_col.columns)],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This may be a dask-expr bug? Column projection using anything other than a list seems fragile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not observing this bug locally anymore, but I'd still like to keep this precaution in place.

Copy link
Member

@jakirkham jakirkham May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave a code comment? Is it worth raising a tracking issue on cuGraph for follow up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest: I don't actually think this "fix" is required, because removing it doesn't seem to cause test failures for me locally (was probably specific to an earlier combination of dask/dask-expr/dask-cudf). However, I left it for now because it will take a long time for "real CI" to tell me that it actually is a problem.

With that said, I'll be happy to give it a try now that I'm realizing it's only cudf/dask-cudf that is about to freeze (and not cugraph).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try dropping the list then?

@rjzamora rjzamora changed the title [DNM][WIP] Debug expression-based Dask Dataframe support [WIP] Debug expression-based Dask Dataframe support May 15, 2024
dask_label_df = dask_cudf.from_dask_dataframe(dask_label_df)
dask_label_df = dask_label_df.to_backend("cudf")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_dask_dataframe is now deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that dask-expr has some dispatching/plugin support for different DataFrame implementations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Given the complexity of dask's various dispatching mechanisms, I'm not expecting anything other than "pandas" and "cudf" the ever be implemented - Though it's technically possible.

Comment on lines +40 to +41
# Avoid "p2p" shuffling in dask for now
config.set({"dataframe.shuffle.method": "tasks"})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"p2p" should work fine, but it will rarely provide a performance benefit. It seems best to minimize "optional" changes until the query-planning migration is finished.

Comment on lines -18 to +19
from dask_cudf.core import DataFrame as dcDataFrame
from dask_cudf.core import Series as daskSeries
from dask_cudf import DataFrame as dcDataFrame
from dask_cudf import Series as daskSeries
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: All imports from dask_cudf.core should be avoided, because these imports are always using "legacy" dask-cudf. Importing from the top-level dask_cudf module are automatically routed to the proper API. There is no way to protect against dask_cudf.core imports yet, because some query-planning logic still needs to find/use specific legacy code.

Comment on lines -102 to +103
.to_frame()
.sort_values(0)
.to_frame(name="0")
.sort_values("0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another "precaution" (using numerical column names still seems "fragile" in dask)

@rjzamora rjzamora changed the title [WIP] Debug expression-based Dask Dataframe support Enable expression-based Dask Dataframe support May 20, 2024
@rjzamora rjzamora marked this pull request as ready for review May 20, 2024 14:23
@rjzamora rjzamora requested review from a team as code owners May 20, 2024 14:23
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting the OPS relevant changes. Namely dropping old Dask workarounds (environment variables that are no longer needed) as the underlying issue was resolved.

Comment on lines -6 to -9
# TODO: Enable dask query planning (by default) once some bugs are fixed.
# xref: https://github.com/rapidsai/cudf/issues/15027
export DASK_DATAFRAME__QUERY_PLANNING=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI this is one of the OPS relevant changes. Basically removing a workaround that is no longer needed

Comment on lines -6 to -9
# TODO: Enable dask query planning (by default) once some bugs are fixed.
# xref: https://github.com/rapidsai/cudf/issues/15027
export DASK_DATAFRAME__QUERY_PLANNING=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other one. So same change as before just in another place

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rick! 🙏

Based on your comment above, do we want to drop this workaround?

output_df[value_col.columns],
output_df[list(value_col.columns)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try dropping the list then?

python/cugraph/cugraph/structure/symmetrize.py Outdated Show resolved Hide resolved
@BradReesWork BradReesWork added this to the 24.06 milestone May 28, 2024
@BradReesWork
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 3156569 into rapidsai:branch-24.06 May 28, 2024
136 checks passed
@rjzamora rjzamora deleted the debug-dask-expr branch May 28, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks ci improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants