-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate dask-cudf README improvements to dask-cudf sphinx docs #16765
Changes from all commits
50bb40b
62bddeb
9e46915
11181d8
a26d0fa
2319978
6636989
b5ce1b6
22421a8
3933bb3
ca2a3cf
c286ab5
4b326cb
4c41a51
28c841e
00048c6
58281ea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,9 +5,9 @@ | |
"id": "4c6c548b", | ||
"metadata": {}, | ||
"source": [ | ||
"# 10 Minutes to cuDF and Dask-cuDF\n", | ||
"# 10 Minutes to cuDF and Dask cuDF\n", | ||
"\n", | ||
"Modelled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly towards new users.\n", | ||
"Modelled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask cuDF, geared mainly towards new users.\n", | ||
"\n", | ||
"## What are these Libraries?\n", | ||
"\n", | ||
|
@@ -18,13 +18,14 @@ | |
"[Dask cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call `dask_cudf.read_csv(...)`, your cluster's GPUs do the work of parsing the CSV file(s) by calling [`cudf.read_csv()`](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.read_csv.html).\n", | ||
"\n", | ||
"\n", | ||
"> [!NOTE] \n", | ||
"> This notebook uses the explicit Dask cuDF API (`dask_cudf`) for clarity. However, we strongly recommend that you use Dask's [configuration infrastructure](https://docs.dask.org/en/latest/configuration.html) to set the `\"dataframe.backend\"` to `\"cudf\"`, and work with the `dask.dataframe` API directly. Please see the [Dask cuDF documentation](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) for more information.\n", | ||
"<div class=\"alert alert-block alert-info\">\n", | ||
"<b>Note:</b> This notebook uses the explicit Dask cuDF API (dask_cudf) for clarity. However, we strongly recommend that you use Dask's <a href=\"https://docs.dask.org/en/latest/configuration.html\">configuration infrastructure</a> to set the \"dataframe.backend\" option to \"cudf\", and work with the Dask DataFrame API directly. Please see the <a href=\"https://github.com/rapidsai/cudf/tree/main/python/dask_cudf\">Dask cuDF documentation</a> for more information.\n", | ||
"</div>\n", | ||
"\n", | ||
"\n", | ||
"## When to use cuDF and Dask-cuDF\n", | ||
"## When to use cuDF and Dask cuDF\n", | ||
"\n", | ||
"If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF." | ||
"If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask cuDF." | ||
] | ||
}, | ||
{ | ||
|
@@ -115,7 +116,7 @@ | |
"source": [ | ||
"ds = dask_cudf.from_cudf(s, npartitions=2)\n", | ||
"# Note the call to head here to show the first few entries, unlike\n", | ||
"# cuDF objects, dask-cuDF objects do not have a printing\n", | ||
"# cuDF objects, Dask-cuDF objects do not have a printing\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't want to annoy you with this @rjzamora , but just wanted to point out in case you missed that you left a few There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's fine to hyphenate Dask cuDF when it is used as an adjective. |
||
"# representation that shows values since they may not be in local\n", | ||
"# memory.\n", | ||
"ds.head(n=3)" | ||
|
@@ -331,11 +332,11 @@ | |
"id": "b17db919", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we will convert our cuDF dataframe into a dask-cuDF equivalent. Here we call out a key difference: to inspect the data we must call a method (here `.head()` to look at the first few values). In the general case (see the end of this notebook), the data in `ddf` will be distributed across multiple GPUs.\n", | ||
"Now we will convert our cuDF dataframe into a Dask-cuDF equivalent. Here we call out a key difference: to inspect the data we must call a method (here `.head()` to look at the first few values). In the general case (see the end of this notebook), the data in `ddf` will be distributed across multiple GPUs.\n", | ||
"\n", | ||
"In this small case, we could call `ddf.compute()` to obtain a cuDF object from the dask-cuDF object. In general, we should avoid calling `.compute()` on large dataframes, and restrict ourselves to using it when we have some (relatively) small postprocessed result that we wish to inspect. Hence, throughout this notebook we will generally call `.head()` to inspect the first few values of a dask-cuDF dataframe, occasionally calling out places where we use `.compute()` and why.\n", | ||
"In this small case, we could call `ddf.compute()` to obtain a cuDF object from the Dask-cuDF object. In general, we should avoid calling `.compute()` on large dataframes, and restrict ourselves to using it when we have some (relatively) small postprocessed result that we wish to inspect. Hence, throughout this notebook we will generally call `.head()` to inspect the first few values of a Dask-cuDF dataframe, occasionally calling out places where we use `.compute()` and why.\n", | ||
"\n", | ||
"*To understand more of the differences between how cuDF and dask-cuDF behave here, visit the [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) tutorial after this one.*" | ||
"*To understand more of the differences between how cuDF and Dask cuDF behave here, visit the [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) tutorial after this one.*" | ||
] | ||
}, | ||
{ | ||
|
@@ -1680,7 +1681,7 @@ | |
"id": "7aa0089f", | ||
"metadata": {}, | ||
"source": [ | ||
"Note here we call `compute()` rather than `head()` on the dask-cuDF dataframe since we are happy that the number of matching rows will be small (and hence it is reasonable to bring the entire result back)." | ||
"Note here we call `compute()` rather than `head()` on the Dask-cuDF dataframe since we are happy that the number of matching rows will be small (and hence it is reasonable to bring the entire result back)." | ||
] | ||
}, | ||
{ | ||
|
@@ -2393,7 +2394,7 @@ | |
"id": "f6094cbe", | ||
"metadata": {}, | ||
"source": [ | ||
"Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." | ||
"Applying functions to a `Series`. Note that applying user defined functions directly with Dask cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." | ||
] | ||
}, | ||
{ | ||
|
@@ -3492,7 +3493,7 @@ | |
"id": "5ac3b004", | ||
"metadata": {}, | ||
"source": [ | ||
"Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." | ||
"Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask cuDF." | ||
] | ||
}, | ||
{ | ||
|
@@ -4181,7 +4182,7 @@ | |
"id": "aa8a445b", | ||
"metadata": {}, | ||
"source": [ | ||
"To convert the first few entries to pandas, we similarly call `.head()` on the dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert." | ||
"To convert the first few entries to pandas, we similarly call `.head()` on the Dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert." | ||
] | ||
}, | ||
{ | ||
|
@@ -4899,7 +4900,7 @@ | |
"id": "787eae14", | ||
"metadata": {}, | ||
"source": [ | ||
"Note that for the dask-cuDF case, we use `dask_cudf.read_csv` in preference to `dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU." | ||
"Note that for the Dask-cuDF case, we use `dask_cudf.read_csv` in preference to `dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU." | ||
] | ||
}, | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these changes actually format it correctly when viewing the notebook? This is a markdown cell, shouldn't it be formatted as markdown? This is a real question, I have no idea what it should look like and didn't check it myself to how that looks after the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just made this change after seeing that the cell was "off" in my browser: https://docs.rapids.ai/api/cudf/24.10/user_guide/10min/
I had to do a bit of research to learn that you need to use html to make a note like this work in a jupyter notebook.