Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine cupsatial demo to make it more clear for customers #187

Merged
merged 66 commits into from
Jul 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8463062
update cuda pub key to avoid GPG error
nvliyuan Jun 24, 2022
85b9491
address review comments
nvliyuan Jul 4, 2022
b7cfe33
address review comments
nvliyuan Jul 4, 2022
81093a4
fit img size
nvliyuan Jul 4, 2022
eeb3206
fit img size
nvliyuan Jul 4, 2022
d4c8071
fit some types
nvliyuan Jul 4, 2022
cfaa7c5
try resize img
nvliyuan Jul 4, 2022
5a11e74
try resize img
nvliyuan Jul 4, 2022
b4a25c9
try resize img
nvliyuan Jul 4, 2022
744db7a
try resize img
nvliyuan Jul 4, 2022
118b60b
try resize img
nvliyuan Jul 4, 2022
7e91fad
fix dockerbuild error
nvliyuan Jul 5, 2022
89feb33
update cuspatial version in build-in-local step
nvliyuan Jul 5, 2022
bac294e
add cpu notebook
nvliyuan Jul 8, 2022
427159f
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 11, 2022
ebe410a
Update the path to make it consistent
nvliyuan Jul 13, 2022
0ca40f8
Merge branch 'cuspatial-refine' of https://github.com/nvliyuan/spark-…
nvliyuan Jul 13, 2022
6f2fda4
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
4e515c3
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
5d8ba4b
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
e26c096
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
fd8fb2a
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
1a4eb43
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
0790cf3
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
f417f64
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
5812cb7
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
a6b8477
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
1f5cdac
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
3d1a992
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
64e3635
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
dc8711e
remove Run CPU Demo step4
nvliyuan Jul 14, 2022
2214bef
break the long sql to multi lines
nvliyuan Jul 14, 2022
c4fdf6a
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
d69e86e
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
a2a5989
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
18bbeab
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c212136
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c208053
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
e8f42d5
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
9227671
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c89b822
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
07e9f0f
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
e21e5bb
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
f38ee0d
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c4e97a3
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
4bd4ad5
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
b8f9743
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
2ce7cca
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
36ca185
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
91a143b
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
8304d9a
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
e741f74
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
3135606
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
cf2715e
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
3a04e4e
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
a0777d9
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
35e6706
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c2ce8d9
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
803c6f5
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
b190d56
try to workarround broken link error
nvliyuan Jul 18, 2022
935d3b9
try to workarround broken link error
nvliyuan Jul 18, 2022
3e6f0ed
try to workarround broken link error
nvliyuan Jul 18, 2022
fa52056
try to workarround broken link error
nvliyuan Jul 18, 2022
6f1628b
try to workarround broken link error
nvliyuan Jul 18, 2022
d9b4d59
try to workarround broken link error
nvliyuan Jul 18, 2022
d61ac6a
revert markdown links checker conf
nvliyuan Jul 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/Nyct2000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/install-jar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/sample-polygon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/taxi-zones.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions examples/UDF-Examples/Spark-cuSpatial/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ RUN conda --version
RUN conda install -c conda-forge openjdk=8 maven=3.8.1 -y

# install cuDF dependency.
RUN conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y
RUN conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y

RUN wget --quiet \
https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3-linux-x86_64.tar.gz \
&& tar -xzf cmake-3.21.3-linux-x86_64.tar.gz \
&& rm -rf cmake-3.21.3-linux-x86_64.tar.gz

ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
3 changes: 3 additions & 0 deletions examples/UDF-Examples/Spark-cuSpatial/Dockerfile.awsdb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ FROM nvidia/cuda:11.2.2-devel-ubuntu18.04

ENV DEBIAN_FRONTEND=noninteractive

# update cuda pub key to avoid GPG error
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# See https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile
RUN apt-get update && \
apt-get install --yes --no-install-recommends \
Expand Down
135 changes: 96 additions & 39 deletions examples/UDF-Examples/Spark-cuSpatial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,93 +5,117 @@ It implements a [RapidsUDF](https://nvidia.github.io/spark-rapids/docs/additiona
interface to call the cuSpatial functions through JNI. It can be run on a distributed Spark cluster with scalability.
NvTimLiu marked this conversation as resolved.
Show resolved Hide resolved

## Performance
We got the end-2-end time as below table when running with 2009 NYC Taxi trip pickup location,
which includes 168,898,952 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd).
The data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
and [NYC Open data](https://www1.nyc.gov/site/planning/data-maps/open-data.page#district_political).
| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd (71 Complex Polygons)|
We got the end-2-end hot run times as below table when running with 2009 NYC Taxi trip pickup location,
which includes 170,896,055 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd Community-Districts).
The point data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
The polygon data can be downloaded from [taxi_zone dataset](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc),
[nyct2000 dataset](https://data.cityofnewyork.us/City-Government/2000-Census-Tracts/ysjj-vb9j) and
[nycd Community-Districts dataset](https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4)

| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd Community-Districts (71 Complex Polygons)|
| ----------- | :---------: | :---------: | :---------: |
| 4-core CPU | 1122.9 seconds | 5525.4 seconds| 6642.7 seconds |
| 1 GPU(Titan V) on local | 4.5 seconds | 5.7 seconds | 6.6 seconds|
| 2 GPU(T4) on Databricks | 9.1 seconds | 10.0 seconds | 12.1 seconds |
| 4-core CPU | 3.9 minutes | 4.0 minutes| 4.1 minutes |
| 1 GPU(T4) on Databricks | 25 seconds | 27 seconds | 28 seconds|
| 2 GPU(T4) on Databricks | 15 seconds | 14 seconds | 17 seconds |
| 4 GPU(T4) on Databricks | 11 seconds | 11 seconds | 12 seconds |

Note: Please update the `x,y` column names to `Start_Lon,Start_Lat` in
the [notebook](./notebooks/cuspatial_sample_db.ipynb) if you test with the download points.

taxi-zones map:

<img src="../../../docs/img/guides/cuspatial/taxi-zones.png" width="600">

nyct2000 map:

<img src="../../../docs/img/guides/cuspatial/Nyct2000.png" width="600">

nyct-community-districts map:

<img src="../../../docs/img/guides/cuspatial/Nycd-Community-Districts.png" width="600">

## Build
You can build the jar file [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile)
or you can build it [in local](#build-in-local) machine after some prerequisites.
Firstly build the UDF JAR from source code before running this demo.
You can build the JAR [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile),
or [in local machine](#build-in-local-machine) after prerequisites.

### Build in Docker
1. Build the docker image [Dockerfile](Dockerfile), then run the container.
```Bash
docker build -f Dockerfile . -t build-spark-cuspatial
docker run -it build-spark-cuspatial bash
```
2. Get the code, then run "mvn package".
2. Get the code, then run `mvn package`.
```Bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/examples/UDF-Examples/Spark-cuSpatial/
mvn package
```
3. You'll get the jar named like "spark-cuspatial-<version>.jar" in the target folder.
3. You'll get the jar named `spark-cuspatial-<version>.jar` in the target folder.

Note: The docker env is just for building the jar, not for running the application.

### Build in Local:
1. essential build tools:
### Build in local machine:
1. Essential build tools:
- [cmake(>=3.20)](https://cmake.org/download/),
- [ninja(>=1.10)](https://github.com/ninja-build/ninja/releases),
- [gcc(>=9.3)](https://gcc.gnu.org/releases.html)
2. [CUDA Toolkit(>=11.0)](https://developer.nvidia.com/cuda-toolkit)
3. conda: use [miniconda](https://docs.conda.io/en/latest/miniconda.html) to maintain header files and cmake dependecies
4. [cuspatial](https://github.com/rapidsai/cuspatial): install libcuspatial
```Bash
# get libcuspatial from conda
conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.04
# Install libcuspatial from conda
conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06
# or below command for the nightly (aka SNAPSHOT) version.
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.08
```
5. Get the code, then run "mvn package".
5. Build the JAR using `mvn package`.
```Bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/examples/Spark-cuSpatial/
mvn package
```
6. You'll get "spark-cuspatial-<version>.jar" in the target folder.

6. `spark-cuspatial-<version>.jar` will be generated in the target folder.

## Run
### Run on-premises clusters: standalone
### GPU Demo on Spark Standalone on-premises cluster
1. Install necessary libraries. Besides `cudf` and `cuspatial`, the `gdal` library that is compatible with the installed `cuspatial` may also be needed.
Install it by running the command below.
```
conda install -c conda-forge libgdal=3.3.1
```
2. Set up [a standalone cluster](/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md) of Spark. Make sure the conda/lib is included in LD_LIBRARY_PATH, so that spark executors can load libcuspatial.so.

3. Download spark-rapids jars
* [spark-rapids v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
4. Prepare the dataset & jars. Copy the sample dataset from [cuspatial_data](../../../datasets/cuspatial_data.tar.gz) to "/data/cuspatial_data".
Copy spark-rapids & spark-cuspatial-22.08.0-SNAPSHOT.jar to "/data/cuspatial_data/jars".
You can use your own path, but remember to update the paths in "gpu-run.sh" accordingly.
5. Run "gpu-run.sh"
3. Download Spark RAPIDS JAR
* [Spark RAPIDS JAR v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
4. Prepare sample dataset and JARs. Copy the [sample dataset](../../../datasets/cuspatial_data.tar.gz) to `/data/cuspatial_data/`.
Copy Spark RAPIDS JAR and `spark-cuspatial-<version>.jar` to `/data/cuspatial_data/jars/`.
If you build the `spark-cuspatial-<version>.jar` in docker, please copy the jar from docker to local:
```
docker cp YOUR_DOCKER_CONTAINER:/PATH/TO/spark-cuspatial-<version>.jar ./YOUR_LOCAL_PATH
```
Note: update the paths in `gpu-run.sh` accordingly.
5. Run `gpu-run.sh`
```Bash
./gpu-run.sh
```
### Run on AWS Databricks
1. Build the customized docker image [Dockerfile.awsdb](Dockerfile.awsdb) and push to dockerhub so that it can be accessible by AWS Databricks.
### GPU Demo on AWS Databricks
1. Build a customized docker image using [Dockerfile.awsdb](Dockerfile.awsdb) and push to a Docker registry such as [Docker Hub](https://hub.docker.com/) which can be accessible by AWS Databricks.
```Bash
# replace your dockerhub repo, your tag or any other repo AWS DB can access
docker build -f Dockerfile.awsdb . -t <your-dockerhub-repo>:<your-tag>
docker push <your-dockerhub-repo>:<your-tag>
```

2. Follow the [Spark-rapids get-started document](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html#start-a-databricks-cluster) to create a GPU cluster on AWS Databricks.
Something different from the document.
Below are some different steps since a custom docker image is used with Databricks:
* Databricks Runtime Version
You should choose a Standard version of the Runtime version like "Runtime: 9.1 LTS(Scala 2.12, Spark 3.1.2)" and
choose GPU instance type like "g4dn.xlarge". Note that ML runtime does not support customized docker container.
If you choose a ML version, it says "Support for Databricks container services requires runtime version 5.3+"
and the "Confirm" button is disabled.
Choose a non-ML Databricks Runtime such as `Runtime: 9.1 LTS(Scala 2.12, Spark 3.1.2)` and
choose GPU AWS instance type such as `g4dn.xlarge`. Note that ML runtime does not support customized docker container with below messages:
`Support for Databricks container services requires runtime version 5.3+`
and the `Confirm` button is disabled.
* Use your own Docker container
Input "Docker Image URL" as "your-dockerhub-repo:your-tag"
* For the other configurations, you can follow the get-started document.
Input `Docker Image URL` as `your-dockerhub-repo:your-tag`
* Follow the Databricks get-started document for other steps.

3. Copy the sample [cuspatial_data.tar.gz](../../../datasets/cuspatial_data.tar.gz) or your data to DBFS by using Databricks CLI.
```Bash
Expand All @@ -103,5 +127,38 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
points
polygons
```
4. Import the Library "spark-cuspatial-22.08.0-SNAPSHOT.jar" to the Databricks, then install it to your cluster.
5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to your workspace in Databricks. Attach to your cluster, then run it.
The sample points and polygons are randomly generated.

Sample polygons:

<img src="../../../docs/img/guides/cuspatial/sample-polygon.png" width="600">

4. Upload `spark-cuspatial-<version>.jar` on dbfs and then install it in Databricks cluster.

<img src="../../../docs/img/guides/cuspatial/install-jar.png" width="600">

5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to Databricks workspace, attach it to Databricks cluster and run it.

### CPU Demo on AWS Databricks
1. Create a Databricks cluster. For example, Databricks Runtime 10.3.

2. Install the Sedona jars and Sedona Python libs on Databricks using web UI.
The Sedona version should be 1.1.1-incubating or higher.
* Install below jars from Maven Coordinates in Libraries tab:
```Bash
org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.0-incubating
org.datasyslab:geotools-wrapper:1.1.0-25.2
```
* To enable python support, install below python lib from PyPI in Libraries tab
```Bash
apache-sedona
```
3. From your cluster configuration (Cluster -> Edit -> Configuration -> Advanced options -> Spark) activate the
Sedona functions and the kryo serializer by adding below to the Spark Config
```Bash
spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
```

4. Upload the sample data files to DBFS, start the cluster, attach the [notebook](notebooks/spacial-cpu-apache-sedona_db.ipynb) to the cluster, and run it.
Loading