Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update the main branch for 2412 release #480

Merged
merged 23 commits into from
Dec 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
6b1c689
Init version 24.12.0-SNAPSHOT
nvauto Sep 24, 2024
77fb305
Merge pull request #436 from NVIDIA/branch-24.10
nvauto Sep 25, 2024
e596414
Merge pull request #441 from NVIDIA/branch-24.10
nvauto Oct 8, 2024
b4f10d0
Merge pull request #443 from NVIDIA/branch-24.10
nvauto Oct 8, 2024
2641ebf
Disable kvikio remote I/O to avoid openssl dependencies in cudf build…
pxLi Oct 14, 2024
0ce65c7
Merge pull request #445 from NVIDIA/branch-24.10
nvauto Oct 15, 2024
1c36fb9
Merge pull request #447 from NVIDIA/branch-24.10
nvauto Oct 18, 2024
8015ba4
[xgboost] Fix eval dataset issues (#450)
wbo4958 Oct 22, 2024
16ca6ec
Merge pull request #454 from NVIDIA/branch-24.10
nvauto Oct 25, 2024
119897c
Update the python notebooks for the customer churn examples (#452)
NvTimLiu Oct 25, 2024
2121693
Merge pull request #455 from NVIDIA/branch-24.10
nvauto Oct 25, 2024
ca1555a
Add a TPC-DS SF 10 Notebook (#448)
gerashegalov Oct 29, 2024
6e7bfdd
Update "Open in Colab" link (#457)
gerashegalov Oct 30, 2024
d84c593
Include Tools Notebook for EMR (#459)
parthosa Nov 1, 2024
8f67f60
Replace retail pipeline with new notebook file (#463)
YanxuanLiu Nov 5, 2024
27ed958
Run micro-benchmark on GCP (#464)
YanxuanLiu Nov 7, 2024
d989b14
Prepare the arg to disable kvikIO remote IO (#465)
pxLi Nov 11, 2024
af324af
Remove the kvikIO workaround (#466)
pxLi Nov 12, 2024
20556db
Update udf native example cmake version to 3.28.6 (#468)
pxLi Nov 18, 2024
d101065
Updated title in README for EMR and Databricks Tools Notebooks (#469)
parthosa Nov 20, 2024
8c0833f
Publish Optuna-Spark Examples (#462)
rishic3 Dec 14, 2024
e3d16dd
update for 2412 release (#478)
nvliyuan Dec 24, 2024
7da464e
Merge remote-tracking branch 'origin/branch-24.12' into main-2412-rel…
nvliyuan Dec 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Navigate to your home directory in the UI and select **Create** > **File** from
create an `init.sh` scripts with contents:
```bash
#!/bin/bash
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.10.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0.jar
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.12.0/rapids-4-spark_2.12-24.12.0.jar
```
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
Prerequisites section.
Expand Down Expand Up @@ -68,7 +68,7 @@ create an `init.sh` scripts with contents:
```bash
spark.rapids.sql.python.gpu.enabled true
spark.python.daemon.module rapids.daemon_databricks
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-24.10.0.jar:/databricks/spark/python
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-24.12.0.jar:/databricks/spark/python
```
Note that since python memory pool require installing the cudf library, so you need to install cudf library in
each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool
Expand Down
2 changes: 1 addition & 1 deletion docs/get-started/xgboost-examples/csp/databricks/init.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar
sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar

sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.10.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0.jar
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.12.0/rapids-4-spark_2.12-24.12.0.jar
sudo wget -O /databricks/jars/xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar
sudo wget -O /databricks/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar
ls -ltr
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
export SPARK_DOCKER_TAG=<spark docker image tag>

pushd ${SPARK_HOME}
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-24.10/dockerfile/Dockerfile
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-24.12/dockerfile/Dockerfile

# Optionally install additional jars into ${SPARK_HOME}/jars/

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ For simplicity export the location to these jars. All examples assume the packag
### Download the jars

Download the RAPIDS Accelerator for Apache Spark plugin jar
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0.jar)
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.12.0/rapids-4-spark_2.12-24.12.0.jar)

### Build XGBoost Python Examples

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ For simplicity export the location to these jars. All examples assume the packag
### Download the jars

1. Download the RAPIDS Accelerator for Apache Spark plugin jar
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0.jar)
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.12.0/rapids-4-spark_2.12-24.12.0.jar)

### Build XGBoost Scala Examples

Expand Down
Binary file added docs/img/guides/tpcds.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
243 changes: 243 additions & 0 deletions examples/ML+DL-Examples/Optuna-Spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/tensorrt_torchtrt_efficientnet/nvidia_logo.png" width="110px">

# Distributed Hyperparameter Tuning

These examples demonstrate distributed hyperparameter tuning with [Optuna](https://optuna.readthedocs.io/en/stable/index.html) on Apache Spark, accelerated with [RAPIDS](https://rapids.ai/) on GPU. We showcase how to set up and tune XGBoost on GPU, with deployment on Spark Standalone or Databricks clusters.

## Contents:
- [Overview](#overview)
- [Examples](#examples)
- [Running Optuna on Spark Standalone](#running-optuna-on-spark-standalone)
- [Setup Database for Optuna](#1-setup-database-for-optuna)
- [Setup Optuna Python Environment](#2-setup-optuna-python-environment)
- [Start Standalone Cluster and Run](#3-start-standalone-cluster-and-run)
- [Running Optuna on Databricks](#running-optuna-on-databricks)
- [Upload Init Script and Notebook](#1-upload-init-script-and-notebook)
- [Create Cluster](#2-create-cluster)
- [Run Notebook](#3-run-notebook)
- [Benchmarks](#benchmarks)
- [How Does it Work?](#how-does-it-work)
- [Implementation Notes](#implementation-notes)

---

## Overview

Optuna is a lightweight Python library for hyperparameter tuning, integrating state-of-the-art hyperparameter optimization algorithms.

At a high level, we optimize hyperparameters in three steps:
1. Wrap model training with an `objective` function that returns a loss metric.
2. In each `trial`, suggest hyperparameters based on previous results.
3. Create a `study` object, which executes the optimization and stores the trial results.

**Local example**: tuning XGBoost with Optuna (from [Optuna docs](https://optuna.org/#code_examples)):
```python
import xgboost as xgb
import optuna

# 1. Define an objective function to be maximized.
def objective(trial):
...

# 2. Suggest values of the hyperparameters using a trial object.
param = {
"objective": "binary:logistic",
"booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
"lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
"subsample": trial.suggest_float("subsample", 0.2, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
}

booster = xgb.train(param, dtrain)
...
return accuracy

# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
```

To run **distributed tuning** on Spark, we take the following steps:
1. Each worker receives a copy of the same dataset.
2. Each worker runs a subset of the trials in parallel.
3. Workers write trial results and receive new hyperparameters using a shared database.

### Examples

We provide **2 notebooks**, with differences in the backend/implementation. See [implementation notes](#implementation-notes) for more details.

- `optuna-joblibspark.ipynb`:
- Uses the [Joblib Spark backend](https://github.com/joblib/joblib-spark) to distribute tasks on the Spark cluster.
- Implements *Worker-I/O*, where each worker reads the full dataset from a specified filepath (e.g., distributed file system).
- Builds on [this Databricks example](https://docs.databricks.com/en/machine-learning/automl-hyperparam-tuning/optuna.html).
- `optuna-dataframe.ipynb`:
- Uses Spark dataframes to distribute tasks on the cluster.
- Implements *Spark-I/O*, where Spark reads the dataset from a specified filepath, then duplicates and repartitions it so that each worker task is mapped onto a copy of the dataset.
- Dataframe operations are accelerated on GPU with the [Spark-RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/).

## Running Optuna on Spark Standalone

### 1. Setup Database for Optuna

Optuna offers an RDBStorage option which allows for the persistence of experiments across different machines and processes, thereby enabling Optuna tasks to be distributed.

This section will walk you through setting up MySQL as the backend for RDBStorage in Optuna.

We highly recommend installing MySQL on the driver node. This setup eliminates concerns regarding MySQL connectivity between worker nodes and the driver, simplifying the management of database connections.
(For Databricks, the installation is handled by the init script).

1. Install MySql:

``` shell
sudo apt install mysql-server
```

2. Configure MySQL bind address:

in `/etc/mysql/mysql.conf.d/mysqld.cnf`

``` shell
bind-address = YOUR_DRIVER_HOST_IP
mysqlx-bind-address = YOUR_DRIVER_HOST_IP
```

3. Restart MySQL:

``` shell
sudo systemctl restart mysql.service
```

4. Setup user:

```shell
sudo mysql
```

``` mysql
mysql> CREATE USER 'optuna_user'@'%' IDENTIFIED BY 'optuna_password';
Query OK, 0 rows affected (0.01 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'optuna_user'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.01 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.01 sec)

mysql> EXIT;
Bye
```

Create a database for Optuna:

``` shell
mysql -u optuna_user -p -e "CREATE DATABASE IF NOT EXISTS optuna"
```

Troubleshooting:
> If you encounter
`"ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)"`,
try the command:
`ln -s /var/run/mysqld/mysqld.sock /tmp/mysql.sock`

### 2. Setup Optuna Python Environment

Install the MySQL client and create a conda environment with the required libraries.
We use [RAPIDS](https://docs.rapids.ai/install/#get-rapids) for GPU-accelerated ETL. See the [docs](https://docs.rapids.ai/install/#get-rapids) for version selection.
``` shell
sudo apt install libmysqlclient-dev

conda create -n rapids-24.12 -c rapidsai -c conda-forge -c nvidia \
cudf=24.12 cuml=24.12 python=3.10 'cuda-version>=12.0,<=12.5'
conda activate optuna-spark
pip install mysqlclient
pip install optuna joblib joblibspark ipywidgets
```

### 3. Start Standalone Cluster and Run

Configure your standalone cluster settings.
This example just creates local cluster with a single GPU worker:
```shell
export SPARK_HOME=/path/to/spark
export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 \
-Dspark.worker.resource.gpu.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
export MASTER=spark://$(hostname):7077; export SPARK_WORKER_INSTANCES=1; export CORES_PER_WORKER=8

${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-worker.sh -c ${CORES_PER_WORKER} -m 16G ${MASTER}
```

You can now run the notebook using the `optuna-spark` Python kernel!
The notebook contains instructions to attach to the standalone cluster.


## Running Optuna on Databricks

### 1. Upload Init Script and Notebook

- Make sure your [Databricks CLI]((https://docs.databricks.com/en/dev-tools/cli/tutorial.html)) is configured for your Databricks workspace.
- Copy the desired notebook into your Databricks workspace. For example:
```shell
databricks workspace import /Users/[email protected]/optuna/optuna-joblibspark.ipynb --format JUPYTER --file optuna-joblibspark.ipynb
```
- Copy the init script ```databricks/init_optuna.sh```:
```shell
databricks workspace import /Users/[email protected]/optuna/init_optuna.sh --format AUTO --file databricks/init_optuna.sh
```

### 2. Create Cluster

*For Databricks Azure*: Use the cluster startup script, which is configured to create a 4 node GPU cluster:
```shell
export INIT_PATH=/Users/[email protected]/optuna/init_optuna.sh
cd databricks
chmod +x start_cluster.sh
./start_cluster.sh
```

Or, create a cluster via the web UI:
- Go to `Compute > Create compute` and set the desired cluster settings.
- Under `Advanced Options > Init Scripts`, upload the init script from your workspace.
- Under `Advanced Options > Spark > Environment variables`, set `LIBCUDF_CUFILE_POLICY=OFF`.
- Make sure to use a GPU cluster and include task GPU resources.

The init script will install the required libraries on all nodes, including RAPIDS and the Spark-RAPIDS plugin for GPU-accelerated ETL. On the driver, it will setup the MySQL server backend.

### 3. Run Notebook

Locate the notebook in your workspace and click on `Connect` to attach it to the cluster. The notebook is ready to run!

## Benchmarks

The graph below shows running times comparing distributed (8 GPUs) vs. single GPU hyperparameter tuning with 100 trials on synthetic regression datasets.

![Databricks benchmarking results](images/runtimes.png)

## How does it work?

The Optuna tasks will be serialized into bytes and distributed to Spark workers to run. The Optuna task on the executor side that loads the Optuna study from RDBStorage, and then runs its set of trials.

During tuning, the Optuna tasks send intermediate results back to RDBStorage to persist, and ask for the parameters from RDBStorage sampled by Optuna on the driver to run next.

**Using JoblibSpark**: each Optuna task is a Spark application that has only 1 job, 1 stage, 1 task, and the Spark application will be submitted on the local threads. Here the parameter `n_jobs` configures the Spark backend to limit how many Spark applications are submitted at the same time.

Thus Optuna with JoblibSpark uses Spark application level parallelism, rather than task-level parallelism. For larger datasets, ensure that a single XGBoost task can run on a single node without any CPU/GPU OOM.

Application parallelism with JoblibSpark:

![Optuna on JoblibSpark](images/optuna.svg)

### Implementation Notes

###### Data I/O:
Since each worker requires the full dataset to perform hyperparameter tuning, there are two strategies to get the data into worker memory:
- **Worker I/O**: *each worker reads the dataset* from the filepath once the task has begun. In practice, this requires the dataset to be written to a distributed file system accessible to all workers prior to tuning. The `optuna-joblibspark` notebook demonstrates this.
- **Spark I/O**: Spark reads the dataset and *creates a copy of the dataset for each worker*, then maps the tuning task onto each copy. In practice, this enables the code to be chained to other Dataframe operations (e.g. ETL stages) without the intermediate step of writing to DBFS, at the cost of some overhead during duplication. The `optuna-dataframe` notebook demonstrates this.
- To achieve this, we coalesce the input Dataframe to a single partition, and recursively self-union until we have the desired number of copies (number of workers). Thus each partition will contain a duplicate of the entire dataset, and the Optuna task can be mapped directly onto the partitions.


###### Misc:
- Please be aware that Optuna studies will continue where they left off from previous trials; delete and recreate the study if you would like to start anew.
- Optuna in distributed mode is **non-deterministic** (see [this link](https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-obtain-reproducible-optimization-results)), as trials are executed asynchronously by executors. Deterministic behavior can be achieved using Spark barriers to coordinate reads/writes to the database.
- Reading data with GPU using cuDF requires disabling [GPUDirect Storage](https://docs.rapids.ai/api/cudf/nightly/user_guide/io/io/#magnum-io-gpudirect-storage-integration), i.e., setting the environment variable `LIBCUDF_CUFILE_POLICY=OFF`, to be compatible with the Databricks file system. Without GDS, cuDF will use a CPU bounce buffer when reading files, but all parsing and decoding will still be accelerated by the GPU.
- Note that the storage doesn’t store the state of the instance of samplers and pruners. To resume a study with a sampler whose seed argument is specified, [the sampler can be pickled](https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/001_rdb.html#resume-study) and returned to the driver alongside the results.
4 changes: 4 additions & 0 deletions examples/ML+DL-Examples/Optuna-Spark/images/optuna.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/bin/bash
# Copyright (c) 2024, NVIDIA CORPORATION.

set -x

sudo rm -r /var/lib/apt/lists/*
sudo apt clean && sudo apt update --fix-missing -y

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
# setup database for optuna on driver

# install mysql server
sudo apt install -y mysql-server

if [[ ! -f "/etc/mysql/mysql.conf.d/mysqld.cnf" ]]; then
sudo apt remove --purge mysql\*
sudo apt clean && sudo apt update --fix-missing -y
sudo apt install -y mysql-server
fi

if [[ ! -f "/etc/mysql/mysql.conf.d/mysqld.cnf" ]]; then
echo "ERROR: MYSQL installation failed"
exit 1
fi

# configure mysql
BIND_ADDRESS=$DB_DRIVER_IP
MYSQL_CONFIG_FILE="/etc/mysql/mysql.conf.d/mysqld.cnf"
sudo sed -i "s/^bind-address\s*=.*/bind-address = $BIND_ADDRESS/" "$MYSQL_CONFIG_FILE"
sudo sed -i "s/^mysqlx-bind-address\s*=.*/mysqlx-bind-address = $BIND_ADDRESS/" "$MYSQL_CONFIG_FILE"
sudo systemctl restart mysql.service

# setup user
OPTUNA_USER="optuna_user"
OPTUNA_PASSWORD="optuna_password"
sudo mysql -u root -e "
CREATE USER IF NOT EXISTS '$OPTUNA_USER'@'%' IDENTIFIED BY '$OPTUNA_PASSWORD';
GRANT ALL PRIVILEGES ON *.* TO '$OPTUNA_USER'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;"
fi


# rapids import
SPARK_RAPIDS_VERSION=24.12.0
curl -L https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/${SPARK_RAPIDS_VERSION}/rapids-4-spark_2.12-${SPARK_RAPIDS_VERSION}.jar -o \
/databricks/jars/rapids-4-spark_2.12-${SPARK_RAPIDS_VERSION}.jar

# setup cuda: install cudatoolkit 11.8 via runfile approach
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sh cuda_11.8.0_520.61.05_linux.run --silent --toolkit
# reset symlink and update library loading paths
rm /usr/local/cuda
ln -s /usr/local/cuda-11.8 /usr/local/cuda

sudo /databricks/python3/bin/pip3 install \
--extra-index-url=https://pypi.nvidia.com \
"cudf-cu11==24.12.*" "cuml-cu11==24.12.*"

# setup python environment
sudo apt clean && sudo apt update --fix-missing -y
sudo apt install pkg-config
sudo apt install -y libmysqlclient-dev
sudo /databricks/python3/bin/pip3 install --upgrade pip
sudo /databricks/python3/bin/pip3 install mysqlclient xgboost
sudo /databricks/python3/bin/pip3 install optuna joblib joblibspark

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
# create optuna database and study
sudo mysql -u $OPTUNA_USER -p$OPTUNA_PASSWORD -e "CREATE DATABASE IF NOT EXISTS optuna;"
fi
set +x
Loading
Loading