You will need to build the Docker container before using it in the Databricks environment. This can be done with the provided build script. You can customize a few options using environment variables, but will at minimum need to set REPO_BASE and TAG_NAME to a repository where you can push the built image. For example to push an image to repository at i-love-spark/rapids-4-spark-databricks
:
$ REPO_BASE=i-love-spark TAG_NAME=rapids-4-spark-databricks ./build.sh
The script will then build an image with fully qualified tag: i-love-spark/rapids-4-spark-databricks:22.10.0
.
If you set PUSH=true
, if the build completes successfully, the script will push it to specified repository. Only do this if you have authenticated using Docker to the repository and you have the appropriate permissions to push image artifacts.
$ REPO_BASE=i-love-spark TAG_NAME=rapids-4-spark-databricks PUSH=true ./build.sh
There are other customizations possible, see the source in build.sh
for more information.
Once this image is pushed to your repository, it is ready to be used on the Databricks environment.
The easiest way to use the RAPIDS Accelerator for Spark on Databricks is use the pre-built Docker container and Databricks Container Services.
Currently the Docker container supports the following Databricks runtime(s) via Databricks Container Services:
See Customize containers with Databricks Container Services for more information.
Create a Databricks cluster by going to Clusters, then clicking + Create Cluster
. Ensure the
cluster meets the prerequisites above by configuring it as follows:
-
In the
Databricks runtime version
field, clickStandard
and selectRuntime: 10.4 LTS (Scala 2.12, Spark 3.2.1)
(do NOT useRuntime: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1)
from theML
tab). -
Ensure
Use Photon Acceleration
is disabled.
Note that GPU nodes are not available to be selected at this time for the driver or the workers. Therefore, you will first configure the use of the Docker container before configuring the driver and worker nodes.
- Under the
Advanced options
, select theDocker
tab.
-
Select
Use your own Docker container
. -
In the
Docker Image URL
field, enter the image location you pushed to using the build steps. -
Set
Authentication
set toDefault
if using a public repository, or configureAuthentication
for the repository you have pushed the image to.
Now you can configure the driver and worker nodes in the main part of the UI.
-
Choose the number of workers that matches the number of GPUs you want to use.
-
Select a worker type. On AWS, use nodes with 1 GPU each such as
p3.2xlarge
org4dn.xlarge
. p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker (although they can be used for the driver node). For Azure, choose GPU nodes such as Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
- Select a driver type. Generally, this can be set the same as the worker, but you can select a node that
does NOT include a GPU if you don't plan to do any GPU-related operations on the driver. On AWS, this
can be an
i3.xlarge
or larger.
-
Ensure
Enable autoscaling
is disabled. -
Now under
Advanced options
, select theInit Scripts
tab. -
In the
Destination
field, selectFILE
. -
In the
Init script path
field, enterfile:/opt/spark-rapids/init.sh
-
Click
Add
. -
Add any other configs, such as SSH Key, Logging, or additional Spark configuration. The Docker container uses the configuration in
00-custom-spark-driver-defaults.conf
by default. When adding additional lines toSpark config
in the UI, the configuration will override those defaults that are configured in the Docker container. -
Start the cluster.
If you would like to enable the Alluxio cluster on your Databricks cluster, you will need to add the following configuration to your cluster.
-
Edit the desired cluster.
-
Under the
Advanced options
, select theSpark
tab. -
In the
Spark config
field, add the following lines. The second 2 are good starting points when using Alluxio but could be tuned if needed.
spark.databricks.io.cached.enabled false
spark.rapids.alluxio.automount.enabled true
spark.rapids.sql.coalescing.reader.numFilterParallel 2
spark.rapids.sql.multiThreadedRead.numThreads 40
-
In the
Environment variables
field, add the following lines:
ENABLE_ALLUXIO=1
ALLUXIO_HOME=/opt/alluxio-2.9.0
-
Customize Alluxio configuration using the following configs if needed. These should be added in the
Environment variables
field if you wish to change them.
-
The default amount of disk space used for Alluxio on the Workers is 70%. This can be adjusted using the configuration below.
ALLUXIO_STORAGE_PERCENT=70
-
The default heap size used by the Alluxio Master process is 16GB, this may need to be changed depending on the size of the driver node. Make sure it has enough memory for the Master and the Spark driver processes.
ALLUXIO_MASTER_HEAP=16g
-
To copy the Alluxio Master and Worker logs off of local disk to be able to look at them after the cluster is shutdown you can configure this to some path accessible via rsync. For instance, on Databricks this might be a path in /dbfs/.
ALLUXIO_COPY_LOG_PATH=/dbfs/somedirectory-for-alluxio-logs/
-
To copy the Alluxio metrics which are in Prometheus format to be able to look at them after the cluster is shutdown you can configure this to some path accessible via rsync. For instance, on Databricks this might be a path in /dbfs/.
PROMETHEUS_COPY_DATA_PATH=/dbfs/somedirectory-for-alluxio-prometheus-metrics/
. The saved Prometheus data can be graphed outside of the cluster. For more details, refer tospark-rapids/docs/get-started/getting-started-alluxio.md
in spark-rapids doc
-
Click
Confirm
(if the cluster is currently stopped) orConfirm and Restart
if the cluster is currently running. -
Ensure the cluster is started by click
Start
if necessary.
To verify the alluxio cluster is working, you can use the Web Terminal:
-
Ensure the cluster is fully up and running. Then in the cluster UI, click the
Apps
tab. -
Click
Launch Web Terminal
. -
In the new tab that opens, you will get a terminal session.
-
Run the following command:
$ /opt/alluxio-2.8.0/bin/alluxio fsadmin report
- You should see a line indicating the number of active workers, ensure this is equal to the configured number of workers you used for the cluster:
...
Live Workers: 2
...