From 3332717c1f6b774fb5a1b4fea67bbcde1861d2b4 Mon Sep 17 00:00:00 2001 From: rjzamora Date: Wed, 18 Sep 2024 06:53:11 -0700 Subject: [PATCH] more revisions --- docs/dask_cudf/source/best_practices.rst | 52 +++++++++++++++--------- 1 file changed, 33 insertions(+), 19 deletions(-) diff --git a/docs/dask_cudf/source/best_practices.rst b/docs/dask_cudf/source/best_practices.rst index 657a33a4751..e8a357ad53a 100644 --- a/docs/dask_cudf/source/best_practices.rst +++ b/docs/dask_cudf/source/best_practices.rst @@ -64,6 +64,20 @@ JIT unspilling is likely to produce better protection from out-of-memory `__ for further details and guidance. +Use RMM +~~~~~~~ + +Memory allocations in cuDF are significantly faster and more efficient when +the `RAPIDS Memory Manager (RMM) `__ +library is used on worker processes. In most cases, the best way to manage +memory is by initializing an RMM pool on each worker before executing a +workflow. When using :func:`LocalCUDACluster`, this is easily accomplished +by setting ``rmm_pool_size`` to a large fraction (e.g. ``0.9``). + +See the `Dask-CUDA memory-management documentation +`__ +for more details. + Use the Dask DataFrame API ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -86,23 +100,22 @@ for each task. However, if the partitions are too large, the risk of OOM errors can become significant. .. note:: - As a general rule of thumb, aim for 1/16 within shuffle-intensive workflows + As a general rule of thumb, aim for 1/16 in shuffle-intensive workflows (e.g. large-scale sorting and joining), and 1/8 otherwise. For pathologically skewed data distributions, it may be necessary to target 1/32 or smaller. -The best way to tune the partition size is to begin with appropriate sized -partitions when the DataFrame collection is first created by a function -like :func:`read_parquet`, :func:`read_csv`, or :func:`from_map`. For -example, both :func:`read_parquet` and :func:`read_csv` expose a -``blocksize`` argument for adjusting the maximum partition size. +The easiest way to tune the partition size is when the DataFrame collection +is first created by a function like :func:`read_parquet`, :func:`read_csv`, +or :func:`from_map`. For example, both :func:`read_parquet` and :func:`read_csv` +expose a ``blocksize`` argument for adjusting the maximum partition size. If the partition size cannot be tuned effectively at creation time, the `repartition `__ method can be used as a last resort. -Use Parquet files -~~~~~~~~~~~~~~~~~ +Use Parquet +~~~~~~~~~~~ `Parquet `__ is the recommended file format for Dask cuDF. It provides efficient columnar storage and enables @@ -113,16 +126,17 @@ The most important arguments to :func:`read_parquet` are ``blocksize`` and ``aggregate_files``:: ``blocksize``: Use this argument to specify the maximum partition size. -The default is `"256 MiB"`, but larger values are usually more performant. -Dask will use the ``blocksize`` value to map a discrete number of Parquet -row-groups (or files) to each output partition. This mapping will only -account for the uncompressed storage size of each row group, which is -usually smaller than the correspondng ``cudf.DataFrame``. +The default is `"256 MiB"`, but larger values are usually more performant +on GPUs with more than 8 GiB of memory. Dask will use the ``blocksize`` +value to map a discrete number of Parquet row-groups (or files) to each +output partition. This mapping will only account for the uncompressed +storage size of each row group, which is usually smaller than the +correspondng ``cudf.DataFrame``. ``aggregate_files``: Use this argument to specify whether Dask is allowed -to map multiple files to the same DataFrame partition. The default is ``False``, -but ``aggregate_files=True`` is usually more performant when the dataset -contains many files that are smaller than half of ``blocksize``. +to map multiple files to the same DataFrame partition. The default is +``False``, but ``aggregate_files=True`` is usually more performant when +the dataset contains many files that are smaller than half of ``blocksize``. .. note:: Metadata collection can be extremely slow when reading from remote @@ -160,11 +174,11 @@ bottleneck is typically device-to-host memory spilling. Although every workflow is different, the following guidelines are often recommended:: -* Use a distributed cluster with Dask-CUDA workers -* Use native cuDF spilling whenever possible +* `Use a distributed cluster with Dask-CUDA workers `_ +* `Use native cuDF spilling whenever possible `_ * Avoid shuffling whenever possible * Use ``split_out=1`` for low-cardinality groupby aggregations - * Use ``broadcast=True`` for joins when at least one collection comprises a small number of partitions (e.g. ``>=5``) + * Use ``broadcast=True`` for joins when at least one collection comprises a small number of partitions (e.g. ``<=5``) * `Use UCX `__ if communication is a bottleneck