-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update docs for xgboost1.7.1 and add python notebooks #252
Changes from all commits
e090e4e
cc08919
32aa708
19057a6
01ea8a7
ef6668c
baf227b
c6bc5c2
5e4861a
95b7b79
b130e31
47fcbff
cc24728
9b85f1f
01bf404
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,11 +12,13 @@ Prerequisites | |
* Multi-node clusters with homogenous GPU configuration | ||
* Software Requirements | ||
* Ubuntu 18.04, 20.04/CentOS7, CentOS8 | ||
* CUDA 11.0+ | ||
* CUDA 11.5+ | ||
* NVIDIA driver compatible with your CUDA | ||
* NCCL 2.7.8+ | ||
* Python 3.6+ | ||
* Python 3.8 or 3.9 | ||
* NumPy | ||
* XGBoost 1.7.0+ | ||
* cudf-cu11 | ||
|
||
The number of GPUs in each host dictates the number of Spark executors that can run there. | ||
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time. | ||
|
@@ -47,6 +49,14 @@ And here are the steps to enable the GPU resources discovery for Spark 3.1+. | |
spark.worker.resource.gpu.amount 1 | ||
spark.worker.resource.gpu.discoveryScript ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh | ||
``` | ||
3. Install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application. | ||
|
||
``` bash | ||
pip install xgboost | ||
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com | ||
pip install numpy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we still install numpy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
pip install scikit-learn | ||
``` | ||
|
||
Get Application Files, Jar and Dataset | ||
------------------------------- | ||
|
@@ -182,6 +192,10 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.gpu_main | |
|
||
# tree construction algorithm | ||
export TREE_METHOD=gpu_hist | ||
|
||
# if you enable archive python environment | ||
export PYSPARK_DRIVER_PYTHON=python | ||
export PYSPARK_PYTHON=./environment/bin/python | ||
``` | ||
|
||
Run spark-submit: | ||
|
@@ -197,8 +211,9 @@ ${SPARK_HOME}/bin/spark-submit | |
--driver-memory ${SPARK_DRIVER_MEMORY} \ | ||
--executor-memory ${SPARK_EXECUTOR_MEMORY} \ | ||
--conf spark.cores.max=${TOTAL_CORES} \ | ||
--jars ${RAPIDS_JAR},${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR} \ | ||
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \ | ||
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \ | ||
--jars ${RAPIDS_JAR} \ | ||
--py-files ${SAMPLE_ZIP} \ | ||
${MAIN_PY} \ | ||
--mainClass=${EXAMPLE_CLASS} \ | ||
--dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/ \ | ||
|
@@ -261,6 +276,10 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.cpu_main | |
|
||
# tree construction algorithm | ||
export TREE_METHOD=hist | ||
|
||
# if you enable archive python environment | ||
export PYSPARK_DRIVER_PYTHON=python | ||
export PYSPARK_PYTHON=./environment/bin/python | ||
``` | ||
|
||
This is the same command as for the GPU example, repeated for convenience: | ||
|
@@ -271,8 +290,9 @@ ${SPARK_HOME}/bin/spark-submit | |
--driver-memory ${SPARK_DRIVER_MEMORY} \ | ||
--executor-memory ${SPARK_EXECUTOR_MEMORY} \ | ||
--conf spark.cores.max=${TOTAL_CORES} \ | ||
--jars ${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR} \ | ||
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \ | ||
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \ | ||
--jars ${RAPIDS_JAR} \ | ||
--py-files ${SAMPLE_ZIP} \ | ||
${SPARK_PYTHON_ENTRYPOINT} \ | ||
--mainClass=${EXAMPLE_CLASS} \ | ||
--dataPath=train::${DATA_PATH}/mortgage/output/train/ \ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,12 +12,14 @@ Prerequisites | |
* Multi-node clusters with homogenous GPU configuration | ||
* Software Requirements | ||
* Ubuntu 18.04, 20.04/CentOS7, CentOS8 | ||
* CUDA 11.0+ | ||
* CUDA 11.5+ | ||
* NVIDIA driver compatible with your CUDA | ||
* NCCL 2.7.8+ | ||
* Python 3.6+ | ||
* Python 3.8 or 3.9 | ||
* NumPy | ||
|
||
* XGBoost 1.7.0+ | ||
* cudf-cu11 | ||
|
||
The number of GPUs per NodeManager dictates the number of Spark executors that can run in that NodeManager. | ||
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time. | ||
|
||
|
@@ -32,6 +34,32 @@ We use `SPARK_HOME` environment variable to point to the Apache Spark cluster. | |
And as to how to enable GPU scheduling and isolation for Yarn, | ||
please refer to [here](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html). | ||
|
||
Please make sure to install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application. | ||
``` bash | ||
pip install xgboost | ||
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com | ||
pip install numpy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same with the previous comment |
||
pip install scikit-learn | ||
``` | ||
You can also create an isolated python environment by using (Virtualenv)[https://virtualenv.pypa.io/en/latest/], | ||
and then directly pass/unpack the archive file and enable the environment on executors | ||
by leveraging the --archives option or spark.archives configuration. | ||
``` bash | ||
# create an isolated python environment and install libraries | ||
python -m venv pyspark_venv | ||
source pyspark_venv/bin/activate | ||
pip install xgboost | ||
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com | ||
pip install numpy | ||
pip install scikit-learn | ||
venv-pack -o pyspark_venv.tar.gz | ||
|
||
# enable archive python environment on executors | ||
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes. | ||
export PYSPARK_PYTHON=./environment/bin/python | ||
spark-submit --archives pyspark_venv.tar.gz#environment app.py | ||
``` | ||
|
||
Get Application Files, Jar and Dataset | ||
------------------------------- | ||
|
||
|
@@ -114,6 +142,10 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.gpu_main | |
|
||
# tree construction algorithm | ||
export TREE_METHOD=gpu_hist | ||
|
||
# if you enable archive python environment | ||
export PYSPARK_DRIVER_PYTHON=python | ||
export PYSPARK_PYTHON=./environment/bin/python | ||
``` | ||
|
||
Run spark-submit: | ||
|
@@ -129,11 +161,12 @@ ${SPARK_HOME}/bin/spark-submit | |
--files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh \ | ||
--master yarn \ | ||
--deploy-mode ${SPARK_DEPLOY_MODE} \ | ||
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \ | ||
--num-executors ${SPARK_NUM_EXECUTORS} \ | ||
--driver-memory ${SPARK_DRIVER_MEMORY} \ | ||
--executor-memory ${SPARK_EXECUTOR_MEMORY} \ | ||
--jars ${RAPIDS_JAR},${XGBOOST4J_JAR} \ | ||
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \ | ||
--jars ${RAPIDS_JAR} \ | ||
--py-files ${SAMPLE_ZIP} \ | ||
${MAIN_PY} \ | ||
--mainClass=${EXAMPLE_CLASS} \ | ||
--dataPath=train::${DATA_PATH}/mortgage/out/train/ \ | ||
|
@@ -190,19 +223,24 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.cpu_main | |
|
||
# tree construction algorithm | ||
export TREE_METHOD=hist | ||
|
||
# if you enable archive python environment | ||
export PYSPARK_DRIVER_PYTHON=python | ||
export PYSPARK_PYTHON=./environment/bin/python | ||
``` | ||
|
||
This is the same command as for the GPU example, repeated for convenience: | ||
|
||
``` bash | ||
${SPARK_HOME}/bin/spark-submit \ | ||
--master yarn \ | ||
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \ | ||
--deploy-mode ${SPARK_DEPLOY_MODE} \ | ||
--num-executors ${SPARK_NUM_EXECUTORS} \ | ||
--driver-memory ${SPARK_DRIVER_MEMORY} \ | ||
--executor-memory ${SPARK_EXECUTOR_MEMORY} \ | ||
--jars ${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR} \ | ||
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \ | ||
--jars ${RAPIDS_JAR} \ | ||
--py-files ${SAMPLE_ZIP} \ | ||
${MAIN_PY} \ | ||
--mainClass=${EXAMPLE_CLASS} \ | ||
--dataPath=train::${DATA_PATH}/mortgage/output/train/ \ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,18 @@ | ||
# Spark XGBoost Examples | ||
|
||
Spark XGBoost examples here showcase the need for end-to-end GPU acceleration. | ||
Spark XGBoost examples here showcase the need for ETL+Training pipeline GPU acceleration. | ||
The Scala based XGBoost examples here use [DMLC’s version](https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark_2.12/). | ||
For PySpark based XGBoost, please refer to the [Spark-RAPIDS-examples 22.04 branch](https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.04) that | ||
uses [NVIDIA’s Spark XGBoost version](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.4.2-0.3.0/). | ||
The pyspark based XGBoost examples requires [installing RAPIDS via pip](https://rapids.ai/pip.html#install). | ||
Most data scientists spend a lot of time not only on | ||
Training models but also processing the large amounts of data needed to train these models. | ||
As you can see below, XGBoost training on GPUs can be up to 10X and data processing using | ||
RAPIDS Accelerator can also be accelerated with an end-to-end speed-up of 7X on GPU compared to CPU. | ||
As you can see below, Pyspark+XGBoost training on GPUs can be up to 13X and data processing using | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we also have a benchmark testing for xgboost-jvm-gpu? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, but I think we can add it in another PR. |
||
RAPIDS Accelerator can also be accelerated with an end-to-end speed-up of 11X on GPU compared to CPU. | ||
In the public cloud, better performance can lead to significantly lower costs as demonstrated in this [blog](https://developer.nvidia.com/blog/gpu-accelerated-spark-xgboost/). | ||
|
||
![mortgage-speedup](/docs/img/guides/mortgage-perf.png) | ||
|
||
Note that the test result is based on 21 years [Fannie Mea Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) | ||
with a 4 A100 GPU and 512 CPU vcores cluster, the performance is affected by many aspects, | ||
Note that the Training test result is based on 4 years [Fannie Mea Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) | ||
with a 8 A100 GPU and 1024 CPU vcores cluster, the performance is affected by many aspects, | ||
including data size and type of GPU. | ||
|
||
In this folder, there are three blue prints for users to learn about using | ||
|
@@ -94,6 +93,9 @@ Please follow below steps to run the example notebooks in different notebook env | |
- [Jupyter Notebook for Python](/docs/get-started/xgboost-examples/notebook/python-notebook.md) | ||
|
||
Note: | ||
Update the default value of `spark.sql.execution.arrow.maxRecordsPerBatch` to a larger number(such as 200000) will | ||
significantly improve performance by accelerating data transfer between JVM and Python process. | ||
|
||
For the CrossValidator job, we need to set `spark.task.resource.gpu.amount=1` to allow only 1 training task running on 1 GPU(executor), | ||
otherwise the customized CrossValidator may schedule more than 1 xgboost training tasks into one executor simultaneously and trigger | ||
[issue-131](https://github.com/NVIDIA/spark-rapids-examples/issues/131). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to install scikit-learn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, add 'pip install scikit-learn '