Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate dask-cudf README improvements to dask-cudf sphinx docs #16765

Merged
merged 17 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/cudf/source/user_guide/10min.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@
"[Dask cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call `dask_cudf.read_csv(...)`, your cluster's GPUs do the work of parsing the CSV file(s) by calling [`cudf.read_csv()`](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.read_csv.html).\n",
"\n",
"\n",
"> [!NOTE] \n",
"> This notebook uses the explicit Dask cuDF API (`dask_cudf`) for clarity. However, we strongly recommend that you use Dask's [configuration infrastructure](https://docs.dask.org/en/latest/configuration.html) to set the `\"dataframe.backend\"` to `\"cudf\"`, and work with the `dask.dataframe` API directly. Please see the [Dask cuDF documentation](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) for more information.\n",
"<div class=\"alert alert-block alert-info\">\n",
"<b>Note:</b> This notebook uses the explicit Dask cuDF API (dask_cudf) for clarity. However, we strongly recommend that you use Dask's <a href=\"https://docs.dask.org/en/latest/configuration.html\">configuration infrastructure</a> to set the \"dataframe.backend\" option to \"cudf\", and work with the Dask DataFrame API directly. Please see the <a href=\"https://github.com/rapidsai/cudf/tree/main/python/dask_cudf\">Dask cuDF documentation</a> for more information.\n",
"</div>\n",
Comment on lines +21 to +23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these changes actually format it correctly when viewing the notebook? This is a markdown cell, shouldn't it be formatted as markdown? This is a real question, I have no idea what it should look like and didn't check it myself to how that looks after the changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just made this change after seeing that the cell was "off" in my browser: https://docs.rapids.ai/api/cudf/24.10/user_guide/10min/

I had to do a bit of research to learn that you need to use html to make a note like this work in a jupyter notebook.

"\n",
"\n",
"## When to use cuDF and Dask-cuDF\n",
rjzamora marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
210 changes: 159 additions & 51 deletions docs/dask_cudf/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,39 +3,42 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

Welcome to dask-cudf's documentation!
Welcome to Dask cuDF's documentation!
=====================================

**Dask-cuDF** (pronounced "DASK KOO-dee-eff") is an extension
**Dask cuDF** (pronounced "DASK KOO-dee-eff") is an extension
library for the `Dask <https://dask.org>`__ parallel computing
framework that provides a `cuDF
<https://docs.rapids.ai/api/cudf/stable/>`__-backed distributed
dataframe with the same API as `Dask dataframes
<https://docs.dask.org/en/stable/dataframe.html>`__.
framework. When installed, Dask cuDF is automatically registered
as the ``"cudf"`` dataframe backend for
`Dask DataFrame <https://docs.dask.org/en/stable/dataframe.html>`__.

.. note::
Neither Dask cuDF nor Dask DataFrame provide support for multi-GPU
or multi-node execution on their own. You must also deploy a
`dask.distributed <https://distributed.dask.org/en/stable/>` cluster
to leverage multiple GPUs. We strongly recommend using `Dask CUDA
<https://docs.rapids.ai/api/dask-cuda/stable/>`__ to simplify the
setup of the cluster, taking advantage of all features of the GPU
and networking hardware.

If you are familiar with Dask and `pandas <pandas.pydata.org>`__ or
`cuDF <https://docs.rapids.ai/api/cudf/stable/>`__, then Dask-cuDF
`cuDF <https://docs.rapids.ai/api/cudf/stable/>`__, then Dask cuDF
should feel familiar to you. If not, we recommend starting with `10
minutes to Dask
<https://docs.dask.org/en/stable/10-minutes-to-dask.html>`__ followed
by `10 minutes to cuDF and Dask-cuDF
by `10 minutes to cuDF and Dask cuDF
<https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html>`__.

When running on multi-GPU systems, `Dask-CUDA
<https://docs.rapids.ai/api/dask-cuda/stable/>`__ is recommended to
simplify the setup of the cluster, taking advantage of all features of
the GPU and networking hardware.

Using Dask-cuDF
Using Dask cuDF
---------------

When installed, Dask-cuDF registers itself as a dataframe backend for
Dask. This means that in many cases, using cuDF-backed dataframes requires
only small changes to an existing workflow. The minimal change is to
select cuDF as the dataframe backend in :doc:`Dask's
configuration <dask:configuration>`. To do so, we must set the option
``dataframe.backend`` to ``cudf``. From Python, this can be achieved
like so::
The Dask DataFrame API (Recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Simply use the `Dask configuration <dask:configuration>` system to
set the ``"dataframe.backend"`` option to ``"cudf"``. From Python,
this can be achieved like so::

import dask

Expand All @@ -44,60 +47,165 @@ like so::
Alternatively, you can set ``DASK_DATAFRAME__BACKEND=cudf`` in the
environment before running your code.

Dataframe creation from on-disk formats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your workflow creates Dask dataframes from on-disk formats
(for example using :func:`dask.dataframe.read_parquet`), then setting
the backend may well be enough to migrate your workflow.

For example, consider reading a dataframe from parquet::
Once this is done, the public Dask DataFrame API will leverage
``cudf`` automatically when a new DataFrame collection is created
from an on-disk format using any of the following ``dask.dataframe``
functions:
rjzamora marked this conversation as resolved.
Show resolved Hide resolved

import dask.dataframe as dd
* :func:`dask.dataframe.read_parquet`
* :func:`dask.dataframe.read_json`
* :func:`dask.dataframe.read_csv`
* :func:`dask.dataframe.read_orc`
* :func:`dask.dataframe.read_hdf`
* :func:`dask.dataframe.from_dict`

# By default, we obtain a pandas-backed dataframe
df = dd.read_parquet("data.parquet", ...)
For example::

import dask.dataframe as dd

To obtain a cuDF-backed dataframe, we must set the
``dataframe.backend`` configuration option::
# By default, we obtain a Pandas-backed dataframe
rjzamora marked this conversation as resolved.
Show resolved Hide resolved
df = dd.read_parquet("data.parquet", ...)

import dask
import dask.dataframe as dd

dask.config.set({"dataframe.backend": "cudf"})
# This gives us a cuDF-backed dataframe
# This now gives us a cuDF-backed dataframe
df = dd.read_parquet("data.parquet", ...)

This code will use cuDF's GPU-accelerated :func:`parquet reader
<cudf.read_parquet>` to read partitions of the data.
When other functions are used to create a new collection
(e.g. :func:`from_map`, :func:`from_pandas`, :func:`from_delayed`,
and :func:`from_array`), the backend of the new collection will
depend on the inputs to those functions. For example::

import pandas as pd
import cudf

# This gives us a Pandas-backed dataframe
rjzamora marked this conversation as resolved.
Show resolved Hide resolved
dd.from_pandas(pd.DataFrame({"a": range(10)}))

# This gives us a cuDF-backed dataframe
dd.from_pandas(cudf.DataFrame({"a": range(10)}))

An existing collection can always be moved to a specific backend
using the :func:`dask.dataframe.DataFrame.to_backend` API::

# This ensures that we have a cuDF-backed dataframe
df = df.to_backend("cudf")

# This ensures that we have a Pandas-backed dataframe
rjzamora marked this conversation as resolved.
Show resolved Hide resolved
df = df.to_backend("pandas")

The explicit Dask cuDF API
~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to providing the ``"cudf"`` backend for Dask DataFrame,
Dask cuDF also provides an explicit ``dask_cudf`` API::

import dask_cudf

# This always gives us a cuDF-backed dataframe
df = dask_cudf.read_parquet("data.parquet", ...)

This API is used implicitly by the Dask DataFrame API when the ``"cudf"``
backend is enabled. Therefore, using it directly will not provide any
performance benefit over the CPU/GPU-portable ``dask.dataframe`` API.
Also, using some parts of the explicit API are incompatible with
automatic query planning (see the next section).

The explicit Dask cuDF API
~~~~~~~~~~~~~~~~~~~~~~~~~~

Dask cuDF now provides automatic query planning by default (RAPIDS 24.06+).
As long as the ``"dataframe.query-planning"`` configuration is set to
``True`` (the default) when ``dask.dataframe`` is first imported, `Dask
Expressions <https://github.com/dask/dask-expr>`__ will be used under the hood.

For example, the following code will automatically benefit from predicate
pushdown when the result is computed::

df = dd.read_parquet("/my/parquet/dataset/")
result = df.sort_values('B')['A']

Unoptimized expression graph (``df.pprint()``)::

Projection: columns='A'
SortValues: by=['B'] shuffle_method='tasks' options={}
ReadParquetFSSpec: path='/my/parquet/dataset/' ...

Simplified expression graph (``df.simplify().pprint()``)::

Projection: columns='A'
SortValues: by=['B'] shuffle_method='tasks' options={}
ReadParquetFSSpec: path='/my/parquet/dataset/' columns=['A', 'B'] ...

.. note::
Dask will automatically simplify the expression graph (within
:func:`optimize`) when the result is converted to a task graph
(via :func:`compute` or :func:`persist`). You do not need to call
:func:`simplify` yourself.


Using Multiple GPUs and Multiple Nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Whenever possible, Dask cuDF (i.e. Dask DataFrame) will automatically try
to partition your data into small-enough tasks to fit comfortably in the
memory of a single GPU. This means the necessary compute tasks needed to
compute a query can often be streamed to a single GPU process for
out-of-core computing. This also means that the compute tasks can be
executed in parallel over a multi-GPU cluster.

In order to execute your Dask workflow on multiple GPUs, you will
typically need to use `Dask CUDA <https://docs.rapids.ai/api/dask-cuda/stable/>`__
to deploy distributed Dask cluster, and
rjzamora marked this conversation as resolved.
Show resolved Hide resolved
`Distributed <https://distributed.dask.org/en/stable/client.html>`__
to define a client object. For example::

from dask_cuda import LocalCUDACluster
from distributed import Client

if __name__ == "__main__":

client = Client(
LocalCUDACluster(
CUDA_VISIBLE_DEVICES="0,1", # Use two workers (on devices 0 and 1)
rmm_pool_size=0.9, # Use 90% of GPU memory as a pool for faster allocations
enable_cudf_spill=True, # Improve device memory stability
local_directory="/fast/scratch/", # Use fast local storage for spilling
)
)

df = dd.read_parquet("/my/parquet/dataset/")
agg = df.groupby('B').sum()
agg.compute() # This will use the cluster defined above

.. note::
This example uses :func:`compute` to materialize a concrete
``cudf.DataFrame`` object in local memory. Never call :func:`compute`
on a large collection that cannot fit comfortably in the memory of a
single GPU! See Dask's `documentation on managing computation
<https://distributed.dask.org/en/stable/manage-computation.html>`__
for more details.

Dataframe creation from in-memory formats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Please see the `Dask CUDA <https://docs.rapids.ai/api/dask-cuda/stable/>`__
documentation for more information about deploying GPU-aware clusters
(including `best practices
<https://docs.rapids.ai/api/dask-cuda/stable/examples/best-practices/>`__).

If you already have a dataframe in memory and want to convert it to a
cuDF-backend one, there are two options depending on whether the
dataframe is already a Dask one or not. If you have a Dask dataframe,
then you can call :func:`dask.dataframe.to_backend` passing ``"cudf"``
as the backend; if you have a pandas dataframe then you can either
call :func:`dask.dataframe.from_pandas` followed by
:func:`~dask.dataframe.to_backend` or first convert the dataframe with
:func:`cudf.from_pandas` and then parallelise this with
:func:`dask_cudf.from_cudf`.

API Reference
-------------

Generally speaking, Dask-cuDF tries to offer exactly the same API as
Dask itself. There are, however, some minor differences mostly because
Generally speaking, Dask cuDF tries to offer exactly the same API as
Dask DataFrame. There are, however, some minor differences mostly because
cuDF does not :doc:`perfectly mirror <cudf:user_guide/PandasCompat>`
the pandas API, or because cuDF provides additional configuration
flags (these mostly occur in data reading and writing interfaces).

As a result, straightforward workflows can be migrated without too
much trouble, but more complex ones that utilise more features may
need a bit of tweaking. The API documentation describes details of the
differences and all functionality that Dask-cuDF supports.
differences and all functionality that Dask cuDF supports.

.. toctree::
:maxdepth: 2
Expand Down
Loading
Loading