Skip to content

Commit

Permalink
Update Spark-RAPIDS-ML PCA (#440)
Browse files Browse the repository at this point in the history
* Update Spark-RAPIDS-ML PCA

Signed-off-by: Rishi Chandra <[email protected]>

* Reran with standalone

* Fix typo

* Delete Scala example, remove mean-centering

* Update README

* Update standalone setup script, README

* SparkSession init for CI

* remove sparkcontext

---------

Signed-off-by: Rishi Chandra <[email protected]>
  • Loading branch information
rishic3 authored Oct 8, 2024
1 parent 7bf3931 commit 8bc8f9e
Show file tree
Hide file tree
Showing 12 changed files with 668 additions and 1,030 deletions.
32 changes: 32 additions & 0 deletions examples/ML+DL-Examples/Spark-Rapids-ML/pca/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Spark-Rapids-ML PCA example

This is an example of the GPU accelerated PCA algorithm from the [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) library, which provides PySpark ML compatible algorithms powered by RAPIDS cuML.
The notebook uses PCA to reduce a random dataset with 2048 feature dimensions to 3 dimensions. We train both the GPU and CPU algorithms for comparison.

## Build

Please refer to the Spark-Rapids-ML [README](https://github.com/NVIDIA/spark-rapids-ml/blob/HEAD/python) for environment setup instructions and API usage.

## Download RAPIDS Jar from Maven Central

Download the RAPIDS jar from Maven Central: [rapids-4-spark_2.12-24.08.1.jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar)
Alternatively, see the Spark-Rapids [download page](https://nvidia.github.io/spark-rapids/docs/download.html#download-rapids-accelerator-for-apache-spark-v24081) for version selection.

## Running the Notebooks

Once you have built your environment, please follow these instructions to run the notebooks. Make sure `jupyterlab` is installed in the environment.

**Note**: for demonstration purposes, these examples just use a local Spark Standalone cluster with a single executor, but you should be able to run them on any distributed Spark cluster.
```
# setup environment variables
export SPARK_HOME=/path/to/spark
export RAPIDS_JAR=/path/to/rapids.jar
# launches the standalone cluster and jupyter with pyspark
./start-spark-rapids.sh
# BROWSE to localhost:8888 to view/run notebooks
# stop spark standalone cluster
${SPARK_HOME}/sbin/stop-worker.sh; ${SPARK_HOME}/sbin/stop-master.sh
```
Loading

0 comments on commit 8bc8f9e

Please sign in to comment.