Update Spark-RAPIDS-ML PCA (#440)

* Update Spark-RAPIDS-ML PCA Signed-off-by: Rishi Chandra <[email protected]> * Reran with standalone * Fix typo * Delete Scala example, remove mean-centering * Update README * Update standalone setup script, README * SparkSession init for CI * remove sparkcontext --------- Signed-off-by: Rishi Chandra <[email protected]>
NVIDIA · Oct 8, 2024 · 8bc8f9e · 8bc8f9e
1 parent 7bf3931
commit 8bc8f9e
Show file tree

Hide file tree

Showing 12 changed files with 668 additions and 1,030 deletions.
diff --git a/examples/ML+DL-Examples/Spark-Rapids-ML/pca/README.md b/examples/ML+DL-Examples/Spark-Rapids-ML/pca/README.md
@@ -0,0 +1,32 @@
+# Spark-Rapids-ML PCA example
+
+This is an example of the GPU accelerated PCA algorithm from the [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) library, which provides PySpark ML compatible algorithms powered by RAPIDS cuML. 
+The notebook uses PCA to reduce a random dataset with 2048 feature dimensions to 3 dimensions. We train both the GPU and CPU algorithms for comparison. 
+
+## Build
+
+Please refer to the Spark-Rapids-ML [README](https://github.com/NVIDIA/spark-rapids-ml/blob/HEAD/python) for environment setup instructions and API usage.
+
+## Download RAPIDS Jar from Maven Central
+
+Download the RAPIDS jar from Maven Central: [rapids-4-spark_2.12-24.08.1.jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar)  
+Alternatively, see the Spark-Rapids [download page](https://nvidia.github.io/spark-rapids/docs/download.html#download-rapids-accelerator-for-apache-spark-v24081) for version selection. 
+
+## Running the Notebooks
+
+Once you have built your environment, please follow these instructions to run the notebooks. Make sure `jupyterlab` is installed in the environment.
+
+**Note**: for demonstration purposes, these examples just use a local Spark Standalone cluster with a single executor, but you should be able to run them on any distributed Spark cluster.
+```
+# setup environment variables
+export SPARK_HOME=/path/to/spark
+export RAPIDS_JAR=/path/to/rapids.jar
+
+# launches the standalone cluster and jupyter with pyspark
+./start-spark-rapids.sh
+
+# BROWSE to localhost:8888 to view/run notebooks
+
+# stop spark standalone cluster
+${SPARK_HOME}/sbin/stop-worker.sh; ${SPARK_HOME}/sbin/stop-master.sh
+```