-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation for Dask cuDF #16671
Update documentation for Dask cuDF #16671
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
cc @quasiben @pentschev @wence- @charlesbluca @madsbk (happy to get feedback from any of you with time and interest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, thanks @rjzamora !
I've left some suggestions on a single comment, for whatever reason GH is not letting me comment on each line individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason GH is not allowing me to comment on each line individually, so I'm doing it all here.
Line 1: the link to the image is wrong, it should be ../../img/rapids_logo.png
, I think that should be allowed but if not you may need to symlink it instead.
Line 73: Unoptimzed
-> Unoptimized
.
Line 108: # Use memory pool for faster allocations
-> # Use 90% of GPU memory as a pool for faster allocations
.
Line 116: result.compute() # ...
is it intentional that you're calling compute()
? That is generally a suboptimal practice and having that in official docs tend to encourage that practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason GH is not allowing me to comment on each line individually, so I'm doing it all here
Weird!
Line 116: result.compute() # ... is it intentional that you're calling compute()? That is generally a suboptimal practice and having that in official docs tend to encourage that practice.
Good catch! That was supposed to be computing a groupby aggregation, but it seems I copied over the wrong example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I changed the example to a groupby aggregation (which is typically safe to compute), and added this note:
"This example uses compute
to materialize a concrete cudf.DataFrame
object in local memory. Never call compute
on a large collection that cannot fit comfortably in the memory of a single GPU! See Dask's documentation on managing computation for more details."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here, I can't comment on the diff lines. Replacing a symlink with a separate file is probably an edge case for the GitHub UI.
General comment: sometimes this says "The user" and sometimes it says "You." Let's use a consistent tense/audience.
Line 105: Just creating the client
variable is enough? You don't have to pass it in or register it somewhere? I am not used to this pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: sometimes this says "The user" and sometimes it says "You." Let's use a consistent tense/audience.
Good point - I'll address this in a follow up.
Line 105: Just creating the client variable is enough? You don't have to pass it in or register it somewhere? I am not used to this pattern.
Yes, this is the common/suggested pattern in dask (unless you are working with multiple clusters at the same time. Which is extremely rare)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a huge improvement over the |
/merge |
Oops, just realized I never got a proper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for writing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here, I can't comment on the diff lines. Replacing a symlink with a separate file is probably an edge case for the GitHub UI.
General comment: sometimes this says "The user" and sometimes it says "You." Let's use a consistent tense/audience.
Line 105: Just creating the client
variable is enough? You don't have to pass it in or register it somewhere? I am not used to this pattern.
Oops. I didn't realize my approval would merge this PR. @rjzamora Can you apply my suggestions here? #16671 (comment) Also we need to make sure this information is all in sync with the dask-cudf docs, which don't mention query planning and other important topics. https://docs.rapids.ai/api/dask-cudf/stable/ |
Yeah, my fault! No worries - I will submit a follow-up today to update the API docs and I'll include your suggestion to use consistent language. |
Follow up to #16671 - Moves most of the information recently added to the dask-cudf README into the dask-cudf Sphinx documentation - Adds a "Quick-start" example to the simplified dask-cudf README Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Bradley Dice (https://github.com/bdice) - Benjamin Zaitlen (https://github.com/quasiben) - Peter Andreas Entschev (https://github.com/pentschev) URL: #16765
Description
General documentation update for Dask cuDF:
README.md
file todask_cudf
(this is currently a symlink to cudf's README, which isn't terribly helpful)dask.dataframe
API (rather than the explicitdask_cudf
API)to_backend
APIChecklist