Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine cupsatial demo to make it more clear for customers #187

Merged
merged 66 commits into from
Jul 19, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8463062
update cuda pub key to avoid GPG error
nvliyuan Jun 24, 2022
85b9491
address review comments
nvliyuan Jul 4, 2022
b7cfe33
address review comments
nvliyuan Jul 4, 2022
81093a4
fit img size
nvliyuan Jul 4, 2022
eeb3206
fit img size
nvliyuan Jul 4, 2022
d4c8071
fit some types
nvliyuan Jul 4, 2022
cfaa7c5
try resize img
nvliyuan Jul 4, 2022
5a11e74
try resize img
nvliyuan Jul 4, 2022
b4a25c9
try resize img
nvliyuan Jul 4, 2022
744db7a
try resize img
nvliyuan Jul 4, 2022
118b60b
try resize img
nvliyuan Jul 4, 2022
7e91fad
fix dockerbuild error
nvliyuan Jul 5, 2022
89feb33
update cuspatial version in build-in-local step
nvliyuan Jul 5, 2022
bac294e
add cpu notebook
nvliyuan Jul 8, 2022
427159f
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 11, 2022
ebe410a
Update the path to make it consistent
nvliyuan Jul 13, 2022
0ca40f8
Merge branch 'cuspatial-refine' of https://github.com/nvliyuan/spark-…
nvliyuan Jul 13, 2022
6f2fda4
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
4e515c3
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
5d8ba4b
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
e26c096
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
fd8fb2a
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
1a4eb43
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
0790cf3
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
f417f64
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
5812cb7
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
a6b8477
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
1f5cdac
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
3d1a992
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
64e3635
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 14, 2022
dc8711e
remove Run CPU Demo step4
nvliyuan Jul 14, 2022
2214bef
break the long sql to multi lines
nvliyuan Jul 14, 2022
c4fdf6a
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
d69e86e
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
a2a5989
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
18bbeab
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c212136
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c208053
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
e8f42d5
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
9227671
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c89b822
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
07e9f0f
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
e21e5bb
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
f38ee0d
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c4e97a3
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
4bd4ad5
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
b8f9743
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
2ce7cca
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
36ca185
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
91a143b
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
8304d9a
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
e741f74
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
3135606
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
cf2715e
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
3a04e4e
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
a0777d9
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
35e6706
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
c2ce8d9
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
803c6f5
Update examples/UDF-Examples/Spark-cuSpatial/README.md
nvliyuan Jul 15, 2022
b190d56
try to workarround broken link error
nvliyuan Jul 18, 2022
935d3b9
try to workarround broken link error
nvliyuan Jul 18, 2022
3e6f0ed
try to workarround broken link error
nvliyuan Jul 18, 2022
fa52056
try to workarround broken link error
nvliyuan Jul 18, 2022
6f1628b
try to workarround broken link error
nvliyuan Jul 18, 2022
d9b4d59
try to workarround broken link error
nvliyuan Jul 18, 2022
d61ac6a
revert markdown links checker conf
nvliyuan Jul 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/Nyct2000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/install-jar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/sample-polygon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/taxi-zones.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions examples/UDF-Examples/Spark-cuSpatial/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ RUN conda --version
RUN conda install -c conda-forge openjdk=8 maven=3.8.1 -y

# install cuDF dependency.
RUN conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y
RUN conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y

RUN wget --quiet \
https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3-linux-x86_64.tar.gz \
&& tar -xzf cmake-3.21.3-linux-x86_64.tar.gz \
&& rm -rf cmake-3.21.3-linux-x86_64.tar.gz

ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
3 changes: 3 additions & 0 deletions examples/UDF-Examples/Spark-cuSpatial/Dockerfile.awsdb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ FROM nvidia/cuda:11.2.2-devel-ubuntu18.04

ENV DEBIAN_FRONTEND=noninteractive

# update cuda pub key to avoid GPG error
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# See https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile
RUN apt-get update && \
apt-get install --yes --no-install-recommends \
Expand Down
91 changes: 76 additions & 15 deletions examples/UDF-Examples/Spark-cuSpatial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,38 @@ It implements a [RapidsUDF](https://nvidia.github.io/spark-rapids/docs/additiona
interface to call the cuSpatial functions through JNI. It can be run on a distributed Spark cluster with scalability.
NvTimLiu marked this conversation as resolved.
Show resolved Hide resolved

## Performance
We got the end-2-end time as below table when running with 2009 NYC Taxi trip pickup location,
which includes 168,898,952 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd).
The data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
and [NYC Open data](https://www1.nyc.gov/site/planning/data-maps/open-data.page#district_political).
| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd (71 Complex Polygons)|
We got the end-2-end hot run times as below table when running with 2009 NYC Taxi trip pickup location,
which includes 170,896,055 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd Community-Districts).
The point data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
The polygon data can be downloaded from [taxi_zone dataset](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc),
[nyct2000 dataset](https://data.cityofnewyork.us/City-Government/2000-Census-Tracts/ysjj-vb9j) and
[nycd Community-Districts dataset](https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4)

| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd Community-Districts (71 Complex Polygons)|
| ----------- | :---------: | :---------: | :---------: |
| 4-core CPU | 1122.9 seconds | 5525.4 seconds| 6642.7 seconds |
| 1 GPU(Titan V) on local | 4.5 seconds | 5.7 seconds | 6.6 seconds|
| 2 GPU(T4) on Databricks | 9.1 seconds | 10.0 seconds | 12.1 seconds |
| 4-core CPU | 3.9 minutes | 4.0 minutes| 4.1 minutes |
| 1 GPU(T4) on Databricks | 25 seconds | 27 seconds | 28 seconds|
| 2 GPU(T4) on Databricks | 15 seconds | 14 seconds | 17 seconds |
| 4 GPU(T4) on Databricks | 11 seconds | 11 seconds | 12 seconds |

Note: Please update the `x,y` column names to `Start_Lon,Start_Lat` in
the [notebook](./notebooks/cuspatial_sample_db.ipynb) if you test with the download points.

taxi-zones map:

<img src="../../../docs/img/guides/cuspatial/taxi-zones.png" width="600">

nyct2000 map:

<img src="../../../docs/img/guides/cuspatial/Nyct2000.png" width="600">

nyct-community-districts map:

<img src="../../../docs/img/guides/cuspatial/Nycd-Community-Districts.png" width="600">

## Build
You can build the jar file [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile)
First you need to build the UDF Jar file from source code before running this demo.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
You can do it [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile).
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
or you can build it [in local](#build-in-local) machine after some prerequisites.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

### Build in Docker
Expand All @@ -33,6 +53,8 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
```
3. You'll get the jar named like "spark-cuspatial-<version>.jar" in the target folder.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

Note: The docker env is just for building the jar, not for running the application.

### Build in Local:
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
1. essential build tools:
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
- [cmake(>=3.20)](https://cmake.org/download/),
Expand All @@ -43,21 +65,20 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
4. [cuspatial](https://github.com/rapidsai/cuspatial): install libcuspatial
```Bash
# get libcuspatial from conda
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.04
conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06
# or below command for the nightly (aka SNAPSHOT) version.
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.08
```
5. Get the code, then run "mvn package".
```Bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/examples/Spark-cuSpatial/
mvn package
```
6. You'll get "spark-cuspatial-<version>.jar" in the target folder.

6. You'll get "spark-cuspatial-<version>.jar" in the target folder.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

## Run
### Run on-premises clusters: standalone
### GPU Demo Run on-premises clusters: standalone
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
1. Install necessary libraries. Besides `cudf` and `cuspatial`, the `gdal` library that is compatible with the installed `cuspatial` may also be needed.
Install it by running the command below.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```
Expand All @@ -69,12 +90,16 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
* [spark-rapids v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
4. Prepare the dataset & jars. Copy the sample dataset from [cuspatial_data](../../../datasets/cuspatial_data.tar.gz) to "/data/cuspatial_data".
Copy spark-rapids & spark-cuspatial-22.08.0-SNAPSHOT.jar to "/data/cuspatial_data/jars".
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
If you build the spark-cuspatial-22.08.0-SNAPSHOT.jar in docker, please copy the jar from docker to local:
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```
docker cp your-instance:/root/spark-rapids-examples/examples/UDF-Examples/Spark-cuSpatial/target/spark-cuspatial-<version>.jar ./your-local-path
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```
You can use your own path, but remember to update the paths in "gpu-run.sh" accordingly.
5. Run "gpu-run.sh"
```Bash
./gpu-run.sh
```
### Run on AWS Databricks
### GPU Demo Run on AWS Databricks
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
1. Build the customized docker image [Dockerfile.awsdb](Dockerfile.awsdb) and push to dockerhub so that it can be accessible by AWS Databricks.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```Bash
# replace your dockerhub repo, your tag or any other repo AWS DB can access
Expand Down Expand Up @@ -103,5 +128,41 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
points
polygons
```
The sample points and polygons are generated by random.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

Sample polygons:

<img src="../../../docs/img/guides/cuspatial/sample-polygon.png" width="600">

4. Import the Library "spark-cuspatial-22.08.0-SNAPSHOT.jar" to the Databricks, then install it to your cluster.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

<img src="../../../docs/img/guides/cuspatial/install-jar.png" width="600">

5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to your workspace in Databricks. Attach to your cluster, then run it.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

### CPU Demo Run on AWS Databricks
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
1. Set up a Databricks cluster with Databricks Runtime Version: Standard Runtime 10.3,
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved

2. Install the Sedona jars and Sedona Python on Databricks using Databricks default web UI.
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
The Sedona version should be 1.1.1-incubating or higher.
* From the Libraries tab install from Maven Coordinates
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```Bash
org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.0-incubating
org.datasyslab:geotools-wrapper:1.1.0-25.2
```
* For enabling python support, from the Libraries tab install from PyPI
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```Bash
apache-sedona
```
3. From your cluster configuration (Cluster -> Edit -> Configuration -> Advanced options -> Spark) activate the
Sedona functions and the kryo serializer by adding to the Spark Config
nvliyuan marked this conversation as resolved.
Show resolved Hide resolved
```Bash
spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
```
4. Restart the cluster because the libraries installed via UI are installed after the cluster has already started,
viadea marked this conversation as resolved.
Show resolved Hide resolved
and therefore the classes specified by the config `spark.sql.extensions`, `spark.serializer`, and `spark.kryo.registrator` are not available
at startup time.

5. Upload the sample data files to dbfs, attach the [notebook](notebooks/spacial-cpu-apache-sedona_db.ipynb) to the cluster, and run all cells.
Loading