-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update Spark-RAPIDS-ML PCA Signed-off-by: Rishi Chandra <[email protected]> * Reran with standalone * Fix typo * Delete Scala example, remove mean-centering * Update README * Update standalone setup script, README * SparkSession init for CI * remove sparkcontext --------- Signed-off-by: Rishi Chandra <[email protected]>
- Loading branch information
Showing
12 changed files
with
668 additions
and
1,030 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Spark-Rapids-ML PCA example | ||
|
||
This is an example of the GPU accelerated PCA algorithm from the [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) library, which provides PySpark ML compatible algorithms powered by RAPIDS cuML. | ||
The notebook uses PCA to reduce a random dataset with 2048 feature dimensions to 3 dimensions. We train both the GPU and CPU algorithms for comparison. | ||
|
||
## Build | ||
|
||
Please refer to the Spark-Rapids-ML [README](https://github.com/NVIDIA/spark-rapids-ml/blob/HEAD/python) for environment setup instructions and API usage. | ||
|
||
## Download RAPIDS Jar from Maven Central | ||
|
||
Download the RAPIDS jar from Maven Central: [rapids-4-spark_2.12-24.08.1.jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar) | ||
Alternatively, see the Spark-Rapids [download page](https://nvidia.github.io/spark-rapids/docs/download.html#download-rapids-accelerator-for-apache-spark-v24081) for version selection. | ||
|
||
## Running the Notebooks | ||
|
||
Once you have built your environment, please follow these instructions to run the notebooks. Make sure `jupyterlab` is installed in the environment. | ||
|
||
**Note**: for demonstration purposes, these examples just use a local Spark Standalone cluster with a single executor, but you should be able to run them on any distributed Spark cluster. | ||
``` | ||
# setup environment variables | ||
export SPARK_HOME=/path/to/spark | ||
export RAPIDS_JAR=/path/to/rapids.jar | ||
# launches the standalone cluster and jupyter with pyspark | ||
./start-spark-rapids.sh | ||
# BROWSE to localhost:8888 to view/run notebooks | ||
# stop spark standalone cluster | ||
${SPARK_HOME}/sbin/stop-worker.sh; ${SPARK_HOME}/sbin/stop-master.sh | ||
``` |
Oops, something went wrong.