From 1e028ea324fd0f609ed505728cb453f189f79ca5 Mon Sep 17 00:00:00 2001 From: rjzamora Date: Tue, 17 Sep 2024 17:54:08 -0700 Subject: [PATCH] address code review --- docs/dask_cudf/source/best_practices.rst | 54 +++++++++++++----------- 1 file changed, 30 insertions(+), 24 deletions(-) diff --git a/docs/dask_cudf/source/best_practices.rst b/docs/dask_cudf/source/best_practices.rst index e53e10e7859..657a33a4751 100644 --- a/docs/dask_cudf/source/best_practices.rst +++ b/docs/dask_cudf/source/best_practices.rst @@ -9,34 +9,27 @@ effectively. .. note:: Since Dask cuDF is a backend extension for `Dask DataFrame `__, - many of the details discussed in the `Dask DataFrames Best Practices + the guidelines discussed in the `Dask DataFrames Best Practices `__ - documentation also apply to Dask cuDF. + documentation also apply to Dask cuDF (excluding any pandas-specific + details). Deployment and Configuration ---------------------------- -Use Dask DataFrame Directly -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Although Dask cuDF provides a public ``dask_cudf`` Python module, we -strongly recommended that you use the CPU/GPU portable ``dask.dataframe`` -API instead. Simply use the `Dask configuration `__ -system to set the ``"dataframe.backend"`` option to ``"cudf"``, and -the ``dask_cudf`` module will be imported and used implicitly. - Use Dask-CUDA ~~~~~~~~~~~~~ -In order to execute a Dask workflow on multiple GPUs, a Dask "cluster" must +In order to execute a Dask workflow on multiple GPUs, a Dask cluster must be deployed with `Dask-CUDA `__ and/or `Dask.distributed `__. When running on a single machine, the `LocalCUDACluster `__ convenience function is strongly recommended. No matter how many GPUs are -available on the machine (even one!), using Dask-CUDA has important advantages -over default (threaded) execution:: +available on the machine (even one!), using `Dask-CUDA has many advantages +`__ +over default (threaded) execution. Just to list a few:: * Dask-CUDA makes it easy to pin workers to specific devices. * Dask-CUDA makes it easy to configure memory-spilling options. @@ -71,6 +64,14 @@ JIT unspilling is likely to produce better protection from out-of-memory `__ for further details and guidance. +Use the Dask DataFrame API +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Although Dask cuDF provides a public ``dask_cudf`` Python module, we +strongly recommended that you use the CPU/GPU portable ``dask.dataframe`` +API instead. Simply use the `Dask configuration `__ +system to set the ``"dataframe.backend"`` option to ``"cudf"``, and +the ``dask_cudf`` module will be imported and used implicitly. Reading Data ------------ @@ -78,11 +79,16 @@ Reading Data Tune the partition size ~~~~~~~~~~~~~~~~~~~~~~~ -The ideal partition size is typically between 2-10% of the memory capacity -of a single GPU. Increasing the partition size will typically reduce the -number of tasks in your workflow and improve the GPU utilization for each -task. However, if the partitions are too large, the risk of OOM errors can -become significant. +The ideal partition size is usually between 1/16 and 1/8 the memory +capacity of a single GPU. Increasing the partition size will typically +reduce the number of tasks in your workflow and improve the GPU utilization +for each task. However, if the partitions are too large, the risk of OOM +errors can become significant. + +.. note:: + As a general rule of thumb, aim for 1/16 within shuffle-intensive workflows + (e.g. large-scale sorting and joining), and 1/8 otherwise. For pathologically + skewed data distributions, it may be necessary to target 1/32 or smaller. The best way to tune the partition size is to begin with appropriate sized partitions when the DataFrame collection is first created by a function @@ -107,11 +113,11 @@ The most important arguments to :func:`read_parquet` are ``blocksize`` and ``aggregate_files``:: ``blocksize``: Use this argument to specify the maximum partition size. -The default is `"256 MiB"`, but larger values are usually more performant -(e.g. `1 GiB` is usually safe). Dask will use the ``blocksize`` value to map -a discrete number of Parquet row-groups (or files) to each output partition. -This mapping will only account for the uncompressed storage size of each -row group, which is usually smaller than the correspondng ``cudf.DataFrame``. +The default is `"256 MiB"`, but larger values are usually more performant. +Dask will use the ``blocksize`` value to map a discrete number of Parquet +row-groups (or files) to each output partition. This mapping will only +account for the uncompressed storage size of each row group, which is +usually smaller than the correspondng ``cudf.DataFrame``. ``aggregate_files``: Use this argument to specify whether Dask is allowed to map multiple files to the same DataFrame partition. The default is ``False``,