diff --git a/README.md b/README.md index 5f2a246bafe..3713bf599a2 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ via the [RAPIDS](https://rapids.ai) libraries. Documentation on the current release can be found [here](https://nvidia.github.io/spark-rapids/). -To get started and try the plugin out use the [getting started guide](./docs/get-started/getting-started.md). +To get started and try the plugin out use the [getting started guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html). ## Compatibility @@ -17,7 +17,7 @@ Operator compatibility is documented [here](./docs/compatibility.md) ## Tuning To get started tuning your job and get the most performance out of it please start with the -[tuning guide](./docs/tuning-guide.md). +[tuning guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html). ## Configuration @@ -46,7 +46,7 @@ Tests are described [here](tests/README.md). ## Integration The RAPIDS Accelerator For Apache Spark does provide some APIs for doing zero copy data transfer into other GPU enabled applications. It is described -[here](docs/additional-functionality/ml-integration.md). +[here](https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/ml-integration.html). Currently, we are working with XGBoost to try to provide this integration out of the box. @@ -59,8 +59,8 @@ access to any of the memory that RMM is holding. The Qualification and Profiling tools have been moved to [nvidia/spark-rapids-tools](https://github.com/NVIDIA/spark-rapids-tools) repo. -Please refer to [Qualification tool documentation](docs/spark-qualification-tool.md) -and [Profiling tool documentation](docs/spark-profiling-tool.md) +Please refer to [Qualification tool documentation](https://docs.nvidia.com/spark-rapids/user-guide/latest/spark-qualification-tool.html) +and [Profiling tool documentation](https://docs.nvidia.com/spark-rapids/user-guide/latest/spark-profiling-tool.html) for more details on how to use the tools. ## Dependency for External Projects diff --git a/docs/FAQ.md b/docs/FAQ.md deleted file mode 100644 index 1d920bfc7cb..00000000000 --- a/docs/FAQ.md +++ /dev/null @@ -1,654 +0,0 @@ ---- -layout: page -title: Frequently Asked Questions -nav_order: 12 ---- -# Frequently Asked Questions - -* TOC -{:toc} - -### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support? - -Please see [Software Requirements](download.md#software-requirements) section for complete list of -Apache Spark versions supported by RAPIDS plugin. The plugin replaces parts of the physical plan that -Apache Spark considers internal. The code for these plans can change, even between bug fix releases. -As a part of our process, we try to stay on top of these changes and release updates as quickly as possible. - -### Which distributions are supported? - -The RAPIDS Accelerator for Apache Spark officially supports: -- [Apache Spark](get-started/getting-started-on-prem.md) -- [AWS EMR 6.2+](get-started/getting-started-aws-emr.md) -- [Databricks Runtime](get-started/getting-started-databricks.md) -- [Google Cloud Dataproc](get-started/getting-started-gcp.md) -- [Azure Synapse](get-started/getting-started-azure-synapse-analytics.md) -- Cloudera provides the plugin packaged through - [CDS 3.2](https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/cds-3/topics/spark-spark-3-overview.html) - and [CDS 3.3](https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/cds-3/topics/spark-spark-3-overview.html). - -Most distributions based on a supported Apache Spark version should work, but because the plugin -replaces parts of the physical plan that Apache Spark considers to be internal. The code for these -plans can change from one distribution to another. We are working with most cloud service providers -to set up testing and validation on their distributions. - -### What CUDA versions are supported? - -CUDA 11.x is currently supported. Please look [here](download.md) for download links for the latest -release. - -### What hardware is supported? - -Please see [Hardware Requirements](download.md#hardware-requirements) section for the list of GPUs that -the RAPIDS plugin has been tested on. It is possible to run the plugin on GeForce desktop hardware with Volta -or better architectures. GeForce hardware does not support -[CUDA forward compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title), -and will need CUDA 11.5 installed. If not, the following error will be displayed: - -``` -ai.rapids.cudf.CudaException: forward compatibility was attempted on non supported HW - at ai.rapids.cudf.Cuda.getDeviceCount(Native Method) - at com.nvidia.spark.rapids.GpuDeviceManager$.findGpuAndAcquire(GpuDeviceManager.scala:78) -``` - -More information about cards that support forward compatibility can be found -[here](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#faq). - -### How can I check if the RAPIDS Accelerator is installed and which version is running? - -On startup the RAPIDS Accelerator will log a warning message on the Spark driver showing the -version with a message that looks something like this: -``` -WARN RapidsPluginUtils: RAPIDS Accelerator 22.10.0 using cudf 22.10.0. -``` - -The full RAPIDS Accelerator, RAPIDS Accelerator JNI and cudf build properties are logged at `INFO` -level in the Spark driver and executor logs with messages that are similar to the following: -``` -INFO RapidsPluginUtils: RAPIDS Accelerator build: {version=22.10.0-SNAPSHOT, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-09-02T12:41:30Z, revision=66450a3549d7cbb23799ec7be2f6f02b253efb85, cudf_version=22.10.0-SNAPSHOT, branch=HEAD} -INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: {version=22.10.0-SNAPSHOT, user=, url=https://github.com/NVIDIA/spark-rapids-jni.git, date=2022-09-02T03:35:21Z, revision=76b71b9ffa1fa4237365b51485d11362cbfb99e5, branch=HEAD} -INFO RapidsPluginUtils: cudf build: {version=22.10.0-SNAPSHOT, user=, url=https://github.com/rapidsai/cudf.git, date=2022-09-02T03:35:21Z, revision=c273da4d6285d6b6f9640585cb3b8cf11310bef6, branch=HEAD} -``` - -### What parts of Apache Spark are accelerated? - -Currently a limited set of SQL and DataFrame operations are supported, please see the -[configs](configs.md) and [supported operations](supported_ops.md) for a more complete list of what -is supported. Some of the MLlib functions, such as `PCA` are supported. -Some of structured streaming is likely to be accelerated, but it has not been an area -of focus right now. Other areas like GraphX or RDDs are not accelerated. - -### Is the Spark `Dataset` API supported? - -The RAPIDS Accelerator supports the `DataFrame` API which is implemented in Spark as `Dataset[Row]`. -If you are using `Dataset[Row]` that is equivalent to the `DataFrame` API. In either case the -operations that are supported for acceleration on the GPU are limited. For example using custom -classes or types with `Dataset` are not supported. Neither are using APIs that take `Row` as an input, -or ones that take Scala or Java functions to operate. This includes operators like `flatMap`, `foreach`, -or `foreachPartition`. Such queries will still execute correctly when -using the RAPIDS Accelerator, but it is likely most query operations will not be performed on the -GPU. - -With custom types the `Dataset` API generates query plans that use opaque lambda expressions to -access the custom types. The opaque expressions prevent the RAPIDS Accelerator from translating any -operation with these opaque expressions to the GPU, since the RAPIDS Accelerator cannot determine -how the expression operates. - -### What is the road-map like? - -Please look at the github repository -[https://github.com/nvidia/spark-rapids](https://github.com/nvidia/spark-rapids). It contains issue -tracking and planning for sprints and releases. - -### How much faster will my query run? - -Any single operator isn’t a fixed amount faster. So there is no simple algorithm to see how much -faster a query will run. In addition, Apache Spark can store intermediate data to disk and send it -across the network, both of which we typically see as bottlenecks in real world queries. Generally -for complicated queries where all the processing can run on the GPU we see between 3x and 7x -speedup, with a 4x speedup typical. We have seen as high as 100x in some specific cases. - -### What operators are best suited for the GPU? - -* Group by operations with high cardinality -* Joins with a high cardinality -* Sorts with a high cardinality -* Window operations, especially for large windows -* Complicated processing -* Writing Parquet/ORC -* Reading CSV -* Transcoding (reading an input file and doing minimal processing before writing it out again, -possibly in a different format, like CSV to Parquet) - -### Are there initialization costs? - -From our tests the GPU typically takes about 2 to 3 seconds to initialize when an executor first -starts. If you are only going to run a single query that only takes a few seconds to run this can -be problematic. In general if you are going to do 30 seconds or more of processing within a single -session the overhead can be amortized. - -### How long does it take to translate a query to run on the GPU? - -The time it takes to translate the Apache Spark physical plan to one that can run on the GPU -is proportional to the size of the plan. But, it also depends on the CPU you are -running on and if the JVM has optimized that code path yet. The first queries run in a client will -be worse than later queries. Small queries can typically be translated in a millisecond or two while -larger queries can take tens of milliseconds. In all cases tested the translation time is orders of -magnitude smaller than the total runtime of the query. - -See the entry on [explain](#explain) for details on how to measure this for your queries. - -### How can I tell what will run on the GPU and what will not run on it? - - -An Apache Spark plan is transformed and optimized into a set of operators called a physical plan. -This plan is then run through a set of rules to translate it to a version that runs on the GPU. -If you want to know what will run on the GPU and what will not along with an explanation why you -can set [spark.rapids.sql.explain](configs.md#sql.explain) to `ALL`. If you just want to see the -operators not on the GPU you may set it to `NOT_ON_GPU` (which is the default setting value). Be -aware that some queries end up being broken down into multiple jobs, and in those cases a separate -log message might be output for each job. These are logged each time a query is compiled into an -`RDD`, not just when the job runs. -Because of this calling `explain` on a DataFrame will also trigger this to be logged. - -The format of each line follows the pattern -``` -indicator operation operator? explanation -``` - -In this `indicator` is one of the following - * `*` for operations that will run on the GPU - * `@` for operations that could run on the GPU but will not because they are a part of a larger - section of the plan that will not run on the GPU - * `#` for operations that have been removed from the plan. The reason they are removed will be - in the explanation. - * `!` for operations that cannot run on the GPU - -`operation` indicates the type of the operator. - * `Expression` These are typically functions that operate on columns of data and produce a column - of data. - * `Exec` These are higher level operations that operate on an entire table at a time. - * `Partitioning` These are different types of partitioning used when reorganizing data to move to - different tasks. - * `Input` These are different input formats used with a few input statements, but not all. - * `Output` These are different output formats used with a few output statements, but not all. - * `NOT_FOUND` These are for anything that the plugin has no replacement rule for. - -`NAME` is the name of the operator given by Spark. - -`operator?` is an optional string representation of the operator given by Spark. - -`explanation` is a text explanation saying if this will - * run on the GPU - * could run on the GPU but will not because of something outside this operator and an - explanation why - * will not run on the GPU with an explanation why - * will be removed from the plan with a reason why - -Generally if an operator is not compatible with Spark for some reason and is off, the explanation -will include information about how it is incompatible and what configs to set to enable the -operator if you can accept the incompatibility. - -These messages are logged at the WARN level so even in `spark-shell` which by default only logs -at WARN or above you should see these messages. - -This translation takes place in two steps. The first step looks at the plan, figures out what -can be translated to the GPU, and then does the translation. The second step optimizes the -transitions between the CPU and the GPU. -Explain will also log how long these translations took at the INFO level with lines like. - -``` -INFO GpuOverrides: Plan conversion to the GPU took 3.13 ms -INFO GpuOverrides: GPU plan transition optimization took 1.66 ms -``` - -Because it is at the INFO level, the default logging level for `spark-shell` is not going to display -this information. If you want to monitor this number for your queries you might need to adjust your -logging configuration. - -### Why does the plan for the GPU query look different from the CPU query? - -Typically, there is a one to one mapping between CPU stages in a plan and GPU stages. There are a -few places where this is not the case. - -* `WholeStageCodeGen` - The GPU plan typically does not do code generation, and does not support - generating code for an entire stage in the plan. Code generation reduces the cost of processing - data one row at a time. The GPU plan processes the data in a columnar format, so the costs - of processing a batch is amortized over the entire batch of data and code generation is not - needed. - -* `ColumnarToRow` and `RowToColumnar` transitions - The CPU version of Spark plans typically process - data in a row based format. The main exception to this is reading some kinds of columnar data, - like Parquet. Transitioning between the CPU and the GPU also requires transitioning between row - and columnar formatted data. - -* `GpuCoalesceBatches` and `GpuShuffleCoalesce` - Processing data on the GPU scales - sublinearly. That means doubling the data does often takes less than half the time. Because of - this we want to process larger batches of data when possible. These operators will try to combine - smaller batches of data into fewer, larger batches to process more efficiently. - -* `SortMergeJoin` - The RAPIDS Accelerator does not support sort merge joins yet. For now, we - translate sort merge joins into shuffled hash joins. Because of this there are times when sorts - may be removed or other sorts added to meet the ordering requirements of the query. - -* `TakeOrderedAndProject` - The `TakeOrderedAndProject` operator will take the top N entries in - each task, shuffle the results to a single executor and then take the top N results from that. - The GPU plan often has more metrics than the CPU versions do, and when we tried to combine all of - these operations into a single stage the metrics were confusing to understand. Instead, we split - the single stage up into multiple smaller parts, so the metrics are clearer. - -### Why does `explain()` show that the GPU will be used even after setting `spark.rapids.sql.enabled` to `false`? - -Apache Spark caches what is used to build the output of the `explain()` function. That cache has no -knowledge about configs, so it may return results that are not up to date with the current config -settings. This is true of all configs in Spark. If you changed -`spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting -query would not change to reflect that config and still show a `SortMergeJoin` even though the new -config might have changed to be a `BroadcastHashJoin` instead. When actually running something like -with `collect`, `show` or `write` a new `DataFrame` is constructed causing Spark to re-plan the -query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain shows -stale results. - -### How are failures handled? - -The RAPIDS Accelerator does not change the way failures are normally handled by Apache Spark. - -### How does the Spark scheduler decide what to do on the GPU vs the CPU? - -Technically the Spark scheduler does not make those decisions. The plugin has a set of rules that -decide if an operation can safely be replaced by a GPU enabled version. We are working on some cost -based optimizations to try and improve performance for some situations where it might be more -efficient to stay on the CPU instead of going back and forth. - -### Is Dynamic Partition Pruning (DPP) Supported? - -Yes, DPP works. - -### Is Adaptive Query Execution (AQE) Supported? - -Any operation that is supported on GPU will stay on the GPU when AQE is enabled. - -#### Why does my query show as not on the GPU when Adaptive Query Execution is enabled? - -When running an `explain()` on a query where AQE is on, it is possible that AQE has not finalized -the plan. In this case a message stating `AdaptiveSparkPlan isFinalPlan=false` will be printed at -the top of the physical plan, and the explain output will show the query plan with CPU operators. -As the query runs, the plan on the UI will update and show operations running on the GPU. This can -happen for any AdaptiveSparkPlan where `isFinalPlan=false`. - -``` -== Physical Plan == -AdaptiveSparkPlan isFinalPlan=false -+- ... -``` - -Once the query has been executed you can access the finalized plan on WebUI and in the user code -running on the Driver, e.g. in a REPL or notebook, to confirm that the query has executed on GPU: - -```Python ->>> df=spark.range(0,100).selectExpr("sum(*) as sum") ->>> df.explain() -== Physical Plan == -AdaptiveSparkPlan isFinalPlan=false -+- HashAggregate(keys=[], functions=[sum(id#0L)]) - +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#11] - +- HashAggregate(keys=[], functions=[partial_sum(id#0L)]) - +- Range (0, 100, step=1, splits=16) - - ->>> df.collect() -[Row(sum=4950)] ->>> df.explain() -== Physical Plan == -AdaptiveSparkPlan isFinalPlan=true -+- == Final Plan == - GpuColumnarToRow false - +- GpuHashAggregate(keys=[], functions=[gpubasicsum(id#0L, LongType, false)]), filters=ArrayBuffer(None)) - +- GpuShuffleCoalesce 2147483647 - +- ShuffleQueryStage 0 - +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#64] - +- GpuHashAggregate(keys=[], functions=[partial_gpubasicsum(id#0L, LongType, false)]), filters=ArrayBuffer(None)) - +- GpuRange (0, 100, step=1, splits=16) -+- == Initial Plan == - HashAggregate(keys=[], functions=[sum(id#0L)]) - +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#11] - +- HashAggregate(keys=[], functions=[partial_sum(id#0L)]) - +- Range (0, 100, step=1, splits=16) -``` - -### Are cache and persist supported? - -Yes cache and persist are supported, the cache is GPU accelerated -but still stored on the host memory. -Please refer to [RAPIDS Cache Serializer](./additional-functionality/cache-serializer.md) -for more details. - -### Can I cache data into GPU memory? - -No, that is not currently supported. -It would require much larger changes to Apache Spark to be able to support this. - -### Is PySpark supported? - -Yes - -### Are the R APIs for Spark supported? - -Yes, but we don't actively test them, because the RAPIDS Accelerator hooks into Spark not at -the various language APIs but at the Catalyst level after all the various APIs have converged into -the DataFrame API. - -### Are the Java APIs for Spark supported? - -Yes, but we don't actively test them, because the RAPIDS Accelerator hooks into Spark not at -the various language APIs but at the Catalyst level after all the various APIs have converged into -the DataFrame API. - -### Are the Scala APIs for Spark supported? - -Yes - -### Is the GPU needed on the driver? Are there any benefits to having a GPU on the driver? - -The GPU is not needed on the driver and there is no benefit to having one available on the driver -for the RAPIDS plugin. - -### Are table layout formats supported? - -Yes, there is GPU support for [Delta Lake](./additional-functionality/delta-lake-support.md) and -[Apache Iceberg](./additional-functionality/iceberg-support.md). See the additional support -documentation for specifics on the operations supported for these formats. - -### How many tasks can I run per executor? How many should I run per executor? - -There is no limit on the number of tasks per executor that you can run. Generally we recommend 2 to -6 tasks per executor and 1 GPU per executor. The GPU typically benefits from having 2 tasks run -in [parallel](configs.md#sql.concurrentGpuTasks) on it at a time, assuming your GPU has enough -memory to support that. Having 2 to 3 times as many tasks off of the GPU as on the GPU allows for -I/O to be run in parallel with the processing. If you increase the tasks too high you can overload -the I/O and starting the initial processing can suffer. But if you have a lot of processing that -cannot be done on the GPU, like complex UDFs, the more tasks you have the more CPU processing you -can throw at it. - -### How are `spark.executor.cores`, `spark.task.resource.gpu.amount`, and `spark.rapids.sql.concurrentGpuTasks` related? - -The `spark.executor.cores` and `spark.task.resource.gpu.amount` configuration settings are inputs -to the Spark task scheduler and control the maximum number of tasks that can be run concurrently -on an executor, regardless of whether they are running CPU or GPU code at any point in time. See -the [Number of Tasks per Executor](tuning-guide.md#number-of-tasks-per-executor) section in the -tuning guide for more details. - -The `spark.rapids.sql.concurrentGpuTasks` configuration setting is specific to the RAPIDS -Accelerator and further limits the number of concurrent tasks that are _actively_ running code on -the GPU or using GPU memory at any point in time. See the -[Number of Concurrent Tasks per GPU](tuning-guide.md#number-of-concurrent-tasks-per-gpu) section -of the tuning guide for more details. - -### Why are multiple GPUs per executor not supported? - -The RAPIDS Accelerator only supports a single GPU per executor because that was a limitation of -[RAPIDS cudf](https://github.com/rapidsai/cudf), the foundation of the Accelerator. Basic support -for working with multiple GPUs has only recently been added to RAPIDS cudf, and there are no plans -for its individual operations to leverage multiple GPUs (e.g.: a single task's join operation -processed by multiple GPUs). - -Many Spark setups avoid allocating too many concurrent tasks to the same executor, and often -multiple executors are run per node on the cluster. Therefore this feature has not been -prioritized, as there has not been a compelling use-case that requires it. - -### Why are multiple executors per GPU not supported? - -There are multiple reasons why this a problematic configuration: -- Apache Spark does not support scheduling a fractional number of GPUs to an executor -- CUDA context switches between processes sharing a single GPU can be expensive -- Each executor would have a fraction of the GPU memory available for processing - -### Is [Multi-Instance GPU (MIG)](https://www.nvidia.com/en-gb/technologies/multi-instance-gpu/) supported? - -Yes, but it requires support from the underlying cluster manager to isolate the MIG GPU instance -for each executor (e.g.: by setting `CUDA_VISIBLE_DEVICES`, -[YARN with docker isolation](https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/MIG-Support) -or other means). - -Note that MIG is not recommended for use with the RAPIDS Accelerator since it significantly -reduces the amount of GPU memory that can be used by the Accelerator for each executor instance. -If the cluster is purpose-built to run Spark with the RAPIDS Accelerator then we recommend running -without MIG. Also note that the UCX-based shuffle plugin will not work as well in this -configuration because -[MIG does not support direct GPU to GPU transfers](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#app-considerations). - -However MIG can be advantageous if the cluster is intended to be shared amongst other processes -(like ML / DL jobs). - -### How can I run custom expressions/UDFs on the GPU? - -The RAPIDS Accelerator provides the following solutions for running -user-defined functions on the GPU: - -#### RAPIDS Accelerated UDFs - -UDFs can provide a RAPIDS accelerated implementation which allows the RAPIDS Accelerator to perform -the operation on the GPU. See the [RAPIDS accelerated UDF documentation](additional-functionality/rapids-udfs.md) -for details. - -#### Automatic Translation of Scala UDFs to Apache Spark Operations - -The RAPIDS Accelerator has an experimental byte-code analyzer which can translate some simple -Scala UDFs into equivalent Apache Spark operations in the query plan. The RAPIDS Accelerator then -translates these operations into GPU operations just like other query plan operations. - -The Scala UDF byte-code analyzer is disabled by default and must be enabled by the user via the -[`spark.rapids.sql.udfCompiler.enabled`](configs.md#sql.udfCompiler.enabled) configuration -setting. - -#### Optimize a row-based UDF in a GPU operation - -If the UDF can not be implemented by RAPIDS Accelerated UDFs or be automatically translated to -Apache Spark operations, the RAPIDS Accelerator has an experimental feature to transfer only the -data it needs between GPU and CPU inside a query operation, instead of falling this operation back -to CPU. This feature can be enabled by setting `spark.rapids.sql.rowBasedUDF.enabled` to true. - - -### Why is the size of my output Parquet/ORC file different? - -This can come down to a number of factors. The GPU version often compresses data in smaller chunks -to get more parallelism and performance. This can result in larger files in some instances. We have -also seen instances where the ordering of the data can have a big impact on the output size of the -files. Spark tends to prefer sort based joins, and in some cases sort based aggregations, whereas -the GPU versions are all hash based. This means that the resulting data can come out in a different -order for the CPU and the GPU. This is not wrong, but can make the size of the output data -different because of compression. Users can turn on -[spark.rapids.sql.hashOptimizeSort.enabled](additional-functionality/advanced_configs.md#sql.hashOptimizeSort.enabled) to have -the GPU try to replicate more closely what the output ordering would have been if sort were used, -like on the CPU. - -### Why am I getting the error `Failed to open the timezone file` when reading files? - -When reading from a file that contains data referring to a particular timezone, e.g.: reading -timestamps from an ORC file, the system's timezone database at `/usr/share/zoneinfo/` must contain -the timezone in order to process the data properly. This error often indicates the system is -missing the timezone database. The timezone database is provided by the `tzdata` package on many -Linux distributions. - -### Why am I getting an error when trying to use pinned memory? - -``` -Caused by: ai.rapids.cudf.CudaException: OS call failed or operation not supported on this OS - at ai.rapids.cudf.Cuda.hostAllocPinned(Native Method) - at ai.rapids.cudf.PinnedMemoryPool.(PinnedMemoryPool.java:254) - at ai.rapids.cudf.PinnedMemoryPool.lambda$initialize$1(PinnedMemoryPool.java:185) - at java.util.concurrent.FutureTask.run(FutureTask.java:264) -``` - -This is typically caused by the IOMMU being enabled. Please see the -[CUDA docs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux) -for this issue. - -To fix it you can either disable the IOMMU, or you can disable using pinned memory by setting -[`spark.rapids.memory.pinnedPool.size`](configs.md#memory.pinnedPool.size) to 0. - -### Why am I getting a buffer overflow error when using the KryoSerializer? -Buffer overflow will happen when trying to serialize an object larger than -[`spark.kryoserializer.buffer.max`](https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization), -and may result in an error such as: -``` -Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 636 - at com.esotericsoftware.kryo.io.Output.require(Output.java:167) - at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251) - at com.esotericsoftware.kryo.io.Output.write(Output.java:219) - at java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859) - at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712) - at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123) - at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107) - ... -``` -Try increasing the -[`spark.kryoserializer.buffer.max`](https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization) -from a default of 64M to something larger, for example 512M. - -### Why am I getting "Unable to acquire buffer" or "Trying to free an invalid buffer"? - -The RAPIDS Accelerator tracks GPU allocations that can be spilled to CPU memory or even disk using -the `RapidsBufferCatalog` class. - -When a fatal Spark exception happens, the executor process will not wait for task threads to complete -before starting to shut down the process. This shutdown includes the `RapidsBufferCatalog`. -This race can cause the catalog to log the following messages that can be ignored. - -The beginning of a shutdown is logged as such: - -``` -INFO com.nvidia.spark.rapids.RapidsBufferCatalog: Closing storage -``` - -After shutdown starts, we could see acquire-after-free exception, where the storage already disposed of the -object (GPU, host memory, and disk) and a task thread is not aware of the shutdown yet: - -``` -ERROR org.apache.spark.executor.Executor: Exception in task 1420.0 in stage 29.0 (TID 39681) -java.lang.IllegalStateException: Unable to acquire buffer for ID: TempSpillBufferId(3122,temp_local_a5806b35-3110-44f4-911c-39d55785e8f5) -``` - -Similarly, a double free is detected (where the first free is the catalog shutdown, and the second free is a -task thread that freed as part of its logic), and will cause a WARN to be logged: - -``` -WARN com.nvidia.spark.rapids.RapidsDeviceMemoryStore: Trying to free an invalid buffer => TempSpillBufferId(3183,temp_local_e7f190f2-a5da-4dbf-b6ae-bd683089c102), size = 406702784, device memory buffer size=406702784 -``` - -### Is speculative execution supported? - -Yes, speculative execution in Spark is fine with the RAPIDS Accelerator plugin. - -As with all speculative execution, it may or may not be beneficial depending on the nature of why a -particular task is slow and how easily speculation is triggered. You should monitor your Spark jobs -to see how often task speculation occurs and how often the speculating task (i.e.: the one launched -later) finishes before the slow task that triggered speculation. If the speculating task often -finishes first then that's good, it is working as intended. If many tasks are speculating, but the -original task always finishes first then this is a pure loss, the speculation is adding load to -the Spark cluster with no benefit. - -### Why is my query in GPU mode slower than CPU mode? - -Below are some troubleshooting tips on GPU query performance issue: -* Identify the most time consuming part of the query. You can use the - [Profiling tool](./spark-profiling-tool.md) to process the Spark event log to get more insights of - the query performance. For example, if I/O is the bottleneck, we suggest optimizing the backend - storage I/O performance because the most suitable query type is computation bound instead of - I/O or network bound. - -* Make sure at least the most time consuming part of the query is on the GPU. Please refer to - [Getting Started on Spark workload qualification](./get-started/getting-started-workload-qualification.md) - for more details. Ideally we hope the whole query is fully on the GPU, but if some minor part of - the query, eg. a small JDBC table scan, can not run on the GPU, it won't cause much performance - overhead. If there are some CPU fallbacks, check if those are some known features which can be - enabled by turning on some RAPIDS Accelerator parameters. If the features needed do not exist in - the most recent release of the RAPIDS Accelerator, please file a - [feature request](https://github.com/NVIDIA/spark-rapids/issues) with a minimum reproducing example. - -* Tune the Spark and RAPIDS Accelerator parameters such as `spark.sql.shuffle.partitions`, - `spark.sql.files.maxPartitionBytes` and `spark.rapids.sql.concurrentGpuTasks` as these configurations can affect performance of queries significantly. - Please refer to [Tuning Guide](./tuning-guide.md) for more details. - -### Why is the Avro library not found by RAPIDS? - -If you are getting a warning `Avro library not found by the RAPIDS plugin.` or if you are getting the -`java.lang.NoClassDefFoundError: org/apache/spark/sql/v2/avro/AvroScan` error, make sure you ran the -Spark job by using the `--jars` or `--packages` option followed by the file path or maven path to -RAPIDS jar since that is the preferred way to run RAPIDS accelerator. - -Note, you can add locally installed jars for external packages such as Avro Data Sources and the -RAPIDS Accelerator jars via `spark.driver.extraClassPath` (--driver-class-path in the client mode) -on the driver side, and `spark.executor.extraClassPath` on the executor side. However, you should not -mix the deploy methods for either of the external modules. Either deploy both Spark Avro and RAPIDS -Accelerator jars as local jars via `extraClassPath` settings or use the `--jars` or `--packages` options. - -As a consequence, per [issue-5796](https://github.com/NVIDIA/spark-rapids/issues/5796), if you also -use the RAPIDS Shuffle Manager, your deployment option may be limited to the extraClassPath method. - -### What is the default RMM pool allocator? - -Starting from 22.06, the default value for `spark.rapids.memory.gpu.pool` is changed to `ASYNC` from -`ARENA` for CUDA 11.5+. For CUDA 11.4 and older, it will fall back to `ARENA`. - -### What is a `RetryOOM` or `SplitAndRetryOOM` exception? - -In the 23.04 release of the accelerator two new exceptions were added to replace a -regular `OutOfMemoryError` that was thrown before when the GPU ran out of memory. -Originally we used `OutOfMemoryError` like on the CPU thinking that it would help to -trigger GC in case handles pointing to GPU memory were leaked in the JVM heap. But -`OutOfMemoryError` is technically a fatal exception and recovering from it is -not strictly supported. As such Apache Spark treats it as a fatal exception and will -kill the process that sees this exception. This can result in a lot of tasks -being rerun if the GPU runs out of memory. These new exceptions prevent that. They -also provide an indication to various GPU operators that the GPU ran out of memory -and how that operator might be able to recover. `RetryOOM` indicates that the operator -should roll back to a known good spot and then wait until the memory allocation -framework decides that it should be retried. `SplitAndRetryOOM` is used -when there is really only one task unblocked and the only way to recover would be to -roll back to a good spot and try to split the input so that less total GPU memory is -needed. - -These are not implemented for all GPU operations. A number of GPU operations that -use a significant amount of memory have been updated to handle `RetryOOM`, but fewer -have been updated to handle `SplitAndRetryOOM`. If you do run into these exceptions -it is an indication that you are using too much GPU memory. The tuning guide can -help you to reduce your memory usage. Be aware that some algorithms do not have -a way to split their usage, things like window operations over some large windows. -If tuning does not fix the problem please file an issue to help us understand what -operators may need better out of core algorithm support. - -### Encryption Support - -The RAPIDS Accelerator for Apache Spark has several components that may or may not follow -the encryption configurations that Apache Spark provides. The following documents the -exceptions that are known at the time of writing this FAQ entry: - -Local storage encryption (`spark.io.encryption.enabled`) is not supported for spilled buffers that the -plugin uses to help with GPU out-of-memory situations. The RAPIDS Shuffle Manager does not implement -local storage encryption for shuffle blocks when configured for UCX, but it does when configured in -MULTITHREADED mode. - -Network encryption (`spark.network.crypto.enabled`) is not supported in the RAPIDS Shuffle Manager -when configured for UCX, but it is supported when configured in MULTITHREADED mode. - -If your environment has specific encryption requirements for network or IO, please make sure -that the RAPIDS Accelerator suits your needs, and file and issue or discussion if you have doubts -or would like expanded encryption support. - -### Can the Rapids Accelerator work with Spark on Ray (RayDP)? -[RayDP](https://github.com/oap-project/raydp) provides simple APIs for running Spark on -[Ray](https://github.com/ray-project/ray). In order to run the RAPIDS Accelerator with RayDP, -GPUs must be requested as `GPU` resources rather than `gpu` resources. This can be done -with the following changes to the typical GPU resource scheduling setup with Spark: - -* Change the contents of the `getGpusResources.sh` script to use `GPU` instead of `gpu` -* Change all of the configs that start with `spark.executor.resource.gpu.` to corresponding configs that start with `spark.executor.resource.GPU.` -* Change `spark.task.resource.gpu.amount` to `spark.task.resource.GPU.amount` - -Note that the RAPIDS Accelerator is not regularly tested against Spark on Ray. - -### I have more questions, where do I go? -We use github to track bugs, feature requests, and answer questions. File an -[issue](https://github.com/NVIDIA/spark-rapids/issues/new/choose) for a bug or feature request. Ask -or answer a question on the [discussion board](https://github.com/NVIDIA/spark-rapids/discussions). diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md index f8e31040309..883aab24cdb 100644 --- a/docs/additional-functionality/advanced_configs.md +++ b/docs/additional-functionality/advanced_configs.md @@ -46,7 +46,7 @@ Name | Description | Default Value | Applicable at spark.rapids.python.memory.gpu.allocFraction|The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.allocFraction)), since the executor will share the GPU with its owning Python workers. Half of the rest will be used if not specified|None|Runtime spark.rapids.python.memory.gpu.maxAllocFraction|The fraction of total GPU memory that limits the maximum size of the RMM pool for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.maxAllocFraction)), since the executor will share the GPU with its owning Python workers. when setting to 0 it means no limit.|0.0|Runtime spark.rapids.python.memory.gpu.pooling.enabled|Should RMM in Python workers act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly. When not specified, It will honor the value of config 'spark.rapids.memory.gpu.pooling.enabled'|None|Runtime -spark.rapids.shuffle.enabled|Enable or disable the RAPIDS Shuffle Manager at runtime. The [RAPIDS Shuffle Manager](rapids-shuffle.md) must already be configured. When set to `false`, the built-in Spark shuffle will be used. |true|Runtime +spark.rapids.shuffle.enabled|Enable or disable the RAPIDS Shuffle Manager at runtime. The [RAPIDS Shuffle Manager](https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html) must already be configured. When set to `false`, the built-in Spark shuffle will be used. |true|Runtime spark.rapids.shuffle.mode|RAPIDS Shuffle Manager mode. "MULTITHREADED": shuffle file writes and reads are parallelized using a thread pool. "UCX": (requires UCX installation) uses accelerated transports for transferring shuffle blocks. "CACHE_ONLY": use when running a single executor, for short-circuit cached shuffle (for testing purposes).|MULTITHREADED|Startup spark.rapids.shuffle.multiThreaded.maxBytesInFlight|The size limit, in bytes, that the RAPIDS shuffle manager configured in "MULTITHREADED" mode will allow to be deserialized concurrently per task. This is also the maximum amount of memory that will be used per task. This should be set larger than Spark's default maxBytesInFlight (48MB). The larger this setting is, the more compressed shuffle chunks are processed concurrently. In practice, care needs to be taken to not go over the amount of off-heap memory that Netty has available. See https://github.com/NVIDIA/spark-rapids/issues/9153.|134217728|Startup spark.rapids.shuffle.multiThreaded.reader.threads|The number of threads to use for reading shuffle blocks per executor in the RAPIDS shuffle manager configured in "MULTITHREADED" mode. There are two special values: 0 = feature is disabled, falls back to Spark built-in shuffle reader; 1 = our implementation of Spark's built-in shuffle reader with extra metrics.|20|Startup diff --git a/docs/additional-functionality/delta-lake-support.md b/docs/additional-functionality/delta-lake-support.md deleted file mode 100644 index 724b63337c7..00000000000 --- a/docs/additional-functionality/delta-lake-support.md +++ /dev/null @@ -1,149 +0,0 @@ ---- -layout: page -title: Delta Lake Support -parent: Additional Functionality -nav_order: 8 ---- - -# Delta Lake Support - -The RAPIDS Accelerator for Apache Spark provides limited support for -[Delta Lake](https://delta.io) tables. -This document details the Delta Lake features that are supported. - -## Reading Delta Lake Tables - -### Data Queries - -Delta Lake scans of the underlying Parquet files are presented in the query as normal Parquet -reads, so the Parquet reads will be accelerated in the same way raw Parquet file reads are -accelerated. Reads against tables that have deletion vectors enabled will fallback to the CPU. - -### Metadata Queries - -Reads of Delta Lake metadata, i.e.: the Delta log detailing the history of snapshots, will not -be GPU accelerated. The CPU will continue to process metadata queries on Delta Lake tables. - - -## Writing Delta Lake Tables - -Delta Lake write acceleration is enabled by default. To disable acceleration of Delta Lake -writes, set spark.rapids.sql.format.delta.write.enabled=false. - -### Delta Lake Versions Supported For Write - -The RAPIDS Accelerator supports the following software configurations for accelerating -Delta Lake writes: -- Delta Lake version 2.0.1 on Apache Spark 3.2.x -- Delta Lake version 2.1.1 and 2.2.0 on Apache Spark 3.3.x -- Delta Lake version 2.4.0 on Apache Spark 3.4.x -- Delta Lake on Databricks 10.4 LTS -- Delta Lake on Databricks 11.3 LTS -- Delta Lake on Databricks 12.2 LTS - -Delta Lake writes will not be accelerated on Spark 3.1.x or earlier. - -### Write Operations Supported - -Very limited support is provided for GPU acceleration of table writing. Table writes are only -GPU accelerated if the table is being created via the Spark Catalyst `SaveIntoDataSourceCommand` -operation which is typically triggered via the DataFrame `write` API, e.g.: -`data.write.format("delta").save(...)`. - -Table creation from selection, table insertion from SQL, and table merges are not currently -GPU accelerated. These operations will fallback to the CPU. Writes against tables that have -deletion vectors enabled will also fallback to the CPU. - -#### Automatic Optimization of Writes - -Delta Lake on Databricks has -[automatic optimization](https://docs.databricks.com/optimizations/auto-optimize.html) -features for optimized writes and automatic compaction. - -Optimized writes are supported only on Databricks platforms. The algorithm used is similar but -not identical to the Databricks version. The following table describes configuration settings -that control the operation of the optimized write. - -| Configuration | Default | Description | -|-------------------------------------------------------------|---------|--------------------------------------------------------------------------------------------| -| spark.databricks.delta.optimizeWrite.binSize | 512 | Target uncompressed partition size in megabytes | -| spark.databricks.delta.optimizeWrite.smallPartitionFactor | 0.5 | Merge partitions smaller than this factor multiplied by the target partition size | -| spark.databricks.delta.optimizeWrite.mergedPartitionFactor | 1.2 | Avoid combining partitions larger than this factor multiplied by the target partition size | - -Automatic compaction is supported only on Databricks platforms. The algorithm is similar but -not identical to the Databricks version. The following table describes configuration settings -that control the operation of automatic compaction. - -| Configuration | Default | Description | -|---------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------| -| spark.databricks.delta.autoCompact.enabled | false | Enable/disable auto compaction for writes to Delta directories | -| spark.databricks.delta.properties.defaults.autoOptimize.autoCompact | false | Whether to enable auto compaction by default, if spark.databricks.delta.autoCompact.enabled is not set | -| spark.databricks.delta.autoCompact.minNumFiles | 50 | Minimum number of files in the Delta directory before which auto optimize does not begin compaction | - -Note that optimized write support requires round-robin partitioning of the data, and round-robin -partitioning requires sorting across all columns for deterministic operation. If the GPU cannot -support sorting a particular column type in order to support the round-robin partitioning, the -Delta Lake write will fallback to the CPU. - -### RapidsDeltaWrite Node in Query Plans - -A side-effect of performing a GPU accelerated Delta Lake write is a new node will appear in the -query plan, RapidsDeltaWrite. Normally the writing of Delta Lake files is not represented by a -dedicated node in query plans, as it is implicitly covered by higher-level operations such as -SaveIntoDataSourceCommand that wrap the entire query along with the write operation afterwards. -The RAPIDS Accelerator places a node in the plan being written to mark the point at which the -write occurs and adds statistics showing the time spent performing the low-level write operation. - -## Merging Into Delta Lake Tables - -Delta Lake merge acceleration is experimental and is disabled by default. To enable acceleration -of Delta Lake merge operations, set spark.rapids.sql.command.MergeIntoCommand=true and also set -spark.rapids.sql.command.MergeIntoCommandEdge=true on Databricks platforms. - -Merging into Delta Lake tables via the SQL `MERGE INTO` statement or via the DeltaTable `merge` -API on non-Databricks platforms is supported. - -### Limitations with DeltaTable `merge` API on non-Databricks Platforms - -For non-Databricks platforms, the DeltaTable `merge` API directly instantiates a CPU -`MergeIntoCommand` instance and invokes it. This does not go through the normal Spark Catalyst -optimizer, and the merge operation will not be visible in the Spark SQL UI on these platforms. -Since the Catalyst optimizer is bypassed, the RAPIDS Accelerator cannot replace the operation -with a GPU accelerated version. As a result, DeltaTable `merge` operations on non-Databricks -platforms will not be GPU accelerated. In those cases the query will need to be modified to use -a SQL `MERGE INTO` statement instead. - -### RapidsProcessDeltaMergeJoin Node in Query Plans - -A side-effect of performing GPU accelerated Delta Lake merge operations is a new node will appear -in the query plan, RapidsProcessDeltaMergeJoin. Normally the Delta Lake merge is performed via -a join and then post-processing of the join via a MapPartitions node. Instead the GPU performs -the join post-processing via this new RapidsProcessDeltaMergeJoin node. - -## Delete Operations on Delta Lake Tables - -Delta Lake delete acceleration is experimental and is disabled by default. To enable acceleration -of Delta Lake delete operations, set spark.rapids.sql.command.DeleteCommand=true and also set -spark.rapids.sql.command.DeleteCommandEdge=true on Databricks platforms. - -Deleting data from Delta Lake tables via the SQL `DELETE FROM` statement or via the DeltaTable -`delete` API is supported. - -### num_affected_rows Difference with Databricks - -The Delta Lake delete command returns a single row result with a `num_affected_rows` column. -When entire partition files in the table are deleted, the open source Delta Lake and RAPIDS -Acclerator implementations of delete can return -1 for `num_affected_rows` since it could be -expensive to open the files and produce an accurate row count. Databricks changed the behavior -of delete operations that delete entire partition files to return the actual row count. -This is only a difference in the statistics of the operation, and the table contents will still -be accurately deleted with the RAPIDS Accelerator. - -## Update Operations on Delta Lake Tables - -Delta Lake update acceleration is experimental and is disabled by default. To enable acceleration -of Delta Lake update operations, set spark.rapids.sql.command.Updatecommand=true and also set -spark.rapids.sql.command.UpdateCommandEdge=true on Databricks platforms. - -Updating data from Delta Lake tables via the SQL `UPDATE` statement or via the DeltaTable -`update` API is supported. diff --git a/docs/additional-functionality/filecache.md b/docs/additional-functionality/filecache.md deleted file mode 100644 index 864544be3af..00000000000 --- a/docs/additional-functionality/filecache.md +++ /dev/null @@ -1,45 +0,0 @@ ---- -layout: page -title: RAPIDS Accelerator File Cache -parent: Additional Functionality -nav_order: 9 ---- - -# RAPIDS Accelerator File Cache - -The RAPIDS Accelerator for Apache Spark provides an optional file cache which may improve -performance of Spark applications that access the same input files multiple times. It caches -portions of remote files being accessed onto the local filesystem of executors to speedup access -if that data is accessed again in the same application. - -## Limitations of the File Cache - -The file cache is only used by Parquet and ORC table scans that have been GPU-accelerated by the -RAPIDS Accelerator. CPU table scans or scans of other data formats will not use the file cache. - -The file cache does not perform well if the executor node's local disks are relatively slow. The -file cache performs best when the local disks are significantly faster than the distributed -filesystem from which data is being cached. Enabling the file cache when the executor local disks -are too slow can cause applications to run slower rather than faster. - -## Configuring the File Cache - -File caching is disabled by default. It can be enabled by setting spark.rapids.filecache.enabled -to true. The file cache stores data locally in the same local directories that have been -configured for the Spark executor. - -By default the file cache will use up to half of the available space in the Spark local -directories. To specify an absolute limit, set spark.rapids.filecache.maxBytes to the maximum -number of bytes to use for the file cache in a single executor. For example, setting -spark.rapids.filecache.maxBytes=50g will limit the filecache to 50 gigabytes of local storage per -executor. - -## Tuning File Cache Performance - -### Immutable Input Files - -By default the file cache will detect when a local copy of data is stale with respect to the -remote filesystem. If input files are immutable during the lifetime of the application then it is -recommended to set spark.rapids.filecache.checkStale to false. Note that modern data lakehouse -table formats have immutable files, so even if a data lakehouse table is overwritten/updated, -individual files stored as part of the table data are not modified. diff --git a/docs/additional-functionality/iceberg-support.md b/docs/additional-functionality/iceberg-support.md deleted file mode 100644 index e983ebd6657..00000000000 --- a/docs/additional-functionality/iceberg-support.md +++ /dev/null @@ -1,78 +0,0 @@ ---- -layout: page -title: Apache Iceberg Support -parent: Additional Functionality -nav_order: 7 ---- - -# Apache Iceberg Support - -The RAPIDS Accelerator for Apache Spark provides limited support for -[Apache Iceberg](https://iceberg.apache.org) tables. -This document details the Apache Iceberg features that are supported. - -## Apache Iceberg Versions - -The RAPIDS Accelerator supports Apache Iceberg 0.13.x. Earlier versions of Apache Iceberg are -not supported. - -> **Note!** -> Apache Iceberg in Databricks is not supported by the RAPIDS Accelerator. - -## Reading Tables - -### Metadata Queries - -Reads of Apache Iceberg metadata, i.e.: the `history`, `snapshots`, and other metadata tables -associated with a table, will not be GPU-accelerated. The CPU will continue to process these -metadata-level queries. - -### Row-level Delete and Update Support - -Apache Iceberg supports row-level deletions and updates. Tables that are using a configuration of -`write.delete.mode=merge-on-read` are not supported. - -### Schema Evolution - -Columns that are added and removed at the top level of the table schema are supported. Columns -that are added or removed within struct columns are not supported. - -### Data Formats - -Apache Iceberg can store data in various formats. Each section below details the levels of support -for each of the underlying data formats. - -#### Parquet - -Data stored in Parquet is supported with the same limitations for loading data from raw Parquet -files. See the [Input/Output](../supported_ops.md#inputoutput) documentation for details. The -following compression codecs applied to the Parquet data are supported: -- gzip (Apache Iceberg default) -- snappy -- uncompressed -- zstd - -#### ORC - -The RAPIDS Accelerator does not support Apache Iceberg tables using the ORC data format. - -#### Avro - -The RAPIDS Accelerator does not support Apache Iceberg tables using the Avro data format. - - -### Reader Split Size - -The maximum number of bytes to pack into a single partition when reading files on Spark is normally -controlled by the config `spark.sql.files.maxPartitionBytes`. But on Iceberg that doesn't apply. -Iceberg has its own configs to control the split size. See the read options in the - [Iceberg Runtime Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/#runtime-configuration) -documentation for details. One example is to use the `split-size` reader option like: -```scala -spark.read.option("split-size", "24217728").table("someTable") -``` - -## Writing Tables - -The RAPIDS Accelerator for Apache Spark does not accelerate Apache Iceberg writes. Writes -to Iceberg tables will be processed by the CPU. diff --git a/docs/additional-functionality/ml-integration.md b/docs/additional-functionality/ml-integration.md deleted file mode 100644 index bc5ba67bee3..00000000000 --- a/docs/additional-functionality/ml-integration.md +++ /dev/null @@ -1,78 +0,0 @@ ---- -layout: page -title: ML Integration -parent: Additional Functionality -nav_order: 1 ---- -# RAPIDS Accelerator for Apache Spark ML Library Integration - -## Existing ML Libraries - -The RAPIDS Accelerator for Apache Spark can be used to accelerate the ETL portions (e.g., loading -training data from parquet files) of applications using ML libraries with Spark DataFrame APIs. -Examples of such libraries include the original [Apache Spark -MLlib](https://spark.apache.org/mllib/), [XGBoost](https://xgboost.readthedocs.io/en/stable/), -[Spark RAPIDS ML](https://nvidia.github.io/spark-rapids-ml/), and the [DL inference UDF -function](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.functions.predict_batch_udf.html) -introduced in Spark 3.4. The latter three also enable leveraging GPUs (in the case of the DL -inference UDF, indirectly via the underlying DL framework) to accelerate the core ML algorithms, and -thus, in conjunction with the RAPIDS Accelerator for Apache Spark for ETL, can further enhance the -cost-benefit of GPU accelerated Spark clusters. - -For Spark API compatible ML libraries that implement their core ML computations inside pandas UDFs, -such as XGBoost’s pySpark API, Spark RAPIDS ML pySpark API, and the DL inference UDF it is -recommended to enable the RAPIDS Accelerator for Apache Spark’s [support for GPU accelerated pandas -UDFs](https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#gpu-support-for-pandas-udf). - -### RMM - -One consideration when using the RAPIDS Accelerator for Apache Spark with a GPU accelerated ML -library is the sharing of GPU memory between the two, as the ML library would typically have a -distinct GPU memory manager from the RAPIDS Accelerator’s RMM instance. Accordingly, you may need -to disable RMM pooling in the RAPIDS Accelerator via the config `spark.rapids.memory.gpu.pool` when -exporting data to an ML library since that library will likely not have access to any of the memory -that the RAPIDS Accelerator’s RMM instance is holding. Similarly, aggressive GPU memory reservation -on the side of the ML library may also need to be disabled, as via these steps in the case of -[Tensorflow](https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth). - -## GPU accelerated ML Library development - -### ColumnarRdd - -When developing a GPU accelerated ML library for Spark, there are cases where you may want to get -access to the raw data on the GPU, preferably without copying it. One use case for this is exporting -the data to the ML library after doing feature extraction. To enable this for Scala development, the -RAPIDS Accelerator for Apache Spark provides a simple utility `com.nvidia.spark.rapids.ColumnarRdd` -that can be used to convert a `DataFrame` to an `RDD[ai.rapids.cudf.Table]`. Each `Table` will have -the same schema as the `DataFrame` passed in. - -Note that `Table` is not a typical thing in an `RDD` so special care needs to be taken when working -with it. By default, it is not serializable so repartitioning the `RDD` or any other operator that -involves a shuffle will not work. This is because it is relatively expensive to serialize and -deserialize GPU data using a conventional Spark shuffle. In addition, most of the memory associated -with the `Table` is on the GPU itself. So, each `Table` must be closed when it is no longer needed -to avoid running out of GPU memory. By convention, it is the responsibility of the one consuming the -data to close it when they no longer need it. - -```scala -val df = spark.sql("""select my_column from my_table""") -val rdd: RDD[Table] = ColumnarRdd(df) -// Compute the max of the first column -val maxValue = rdd.map(table => { - val max = table.getColumn(0).max().getLong - // Close the table to avoid leaks - table.close() - max -}).max() -``` - -### Examples of Spark ML Implementations leveraging ColumnarRdd - -Both the Scala Spark PCA -[implementation](https://github.com/NVIDIA/spark-rapids-ml/blob/ab575bc46e55f38ee52906b3c3b55b75f2418459/jvm/src/main/scala/org/apache/spark/ml/linalg/distributed/RapidsRowMatrix.scala) -in Spark RAPIDS ML and XGBoost’s [GPU accelerated Scala -SparkAPI](https://github.com/dmlc/xgboost/blob/f1e9bbcee52159d4bd5f7d25ef539777ceac147c/jvm-packages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/rapids/spark/GpuPreXGBoost.scala) -leverage ColumnarRdd (search for ColumnarRdd in these files) to accelerate data transfer between the -RAPIDS Accelerator for Apache Spark and the respective core ML algorithm computations. XGBoost in -particular enables this when detecting that the RAPIDS Accelerator for Apache Spark is present and -enabled. diff --git a/docs/additional-functionality/rapids-shuffle.md b/docs/additional-functionality/rapids-shuffle.md deleted file mode 100644 index 8e1e8731ce0..00000000000 --- a/docs/additional-functionality/rapids-shuffle.md +++ /dev/null @@ -1,495 +0,0 @@ ---- -layout: page -title: RAPIDS Shuffle Manager -parent: Additional Functionality -nav_order: 5 ---- -# RAPIDS Shuffle Manager - -The RAPIDS Shuffle Manager is an implementation of the `ShuffleManager` interface in Apache Spark -that allows custom mechanisms to exchange shuffle data. We currently expose two modes of operation: -Multi Threaded and UCX. - -In Spark, shuffle managers are configured via the `spark.shuffle.manager` configuration variable. -The following table shows the appropriate configuration to use for each Spark version supported -in our plugin: - -| Spark Shim | spark.shuffle.manager value | -| --------------- | -------------------------------------------------------- | -| 3.1.1 | com.nvidia.spark.rapids.spark311.RapidsShuffleManager | -| 3.1.2 | com.nvidia.spark.rapids.spark312.RapidsShuffleManager | -| 3.1.3 | com.nvidia.spark.rapids.spark313.RapidsShuffleManager | -| 3.2.0 | com.nvidia.spark.rapids.spark320.RapidsShuffleManager | -| 3.2.1 | com.nvidia.spark.rapids.spark321.RapidsShuffleManager | -| 3.2.1 CDH | com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager | -| 3.2.2 | com.nvidia.spark.rapids.spark322.RapidsShuffleManager | -| 3.2.3 | com.nvidia.spark.rapids.spark323.RapidsShuffleManager | -| 3.2.4 | com.nvidia.spark.rapids.spark324.RapidsShuffleManager | -| 3.3.0 | com.nvidia.spark.rapids.spark330.RapidsShuffleManager | -| 3.3.1 | com.nvidia.spark.rapids.spark331.RapidsShuffleManager | -| 3.3.2 | com.nvidia.spark.rapids.spark332.RapidsShuffleManager | -| 3.3.3 | com.nvidia.spark.rapids.spark333.RapidsShuffleManager | -| 3.4.0 | com.nvidia.spark.rapids.spark340.RapidsShuffleManager | -| 3.4.1 | com.nvidia.spark.rapids.spark341.RapidsShuffleManager | -| 3.5.0 | com.nvidia.spark.rapids.spark350.RapidsShuffleManager | -| Databricks 10.4 | com.nvidia.spark.rapids.spark321db.RapidsShuffleManager | -| Databricks 11.3 | com.nvidia.spark.rapids.spark330db.RapidsShuffleManager | -| Databricks 12.2 | com.nvidia.spark.rapids.spark332db.RapidsShuffleManager | - -## Multi-Threaded Mode - -Multi-threaded mode (default) is similar to the built-in Spark shuffle, but it attempts to use -more CPU threads for compute-intensive tasks, such as compression and decompression. - -The multi-threaded shuffle targets the "BypassMergeSortShuffle" shuffle algorithm in Spark, -which is the default when `spark.shuffle.partitions` is 200 or less. - -Minimum configuration: - -```shell ---conf spark.shuffle.manager=com.nvidia.spark.rapids.[shim package].RapidsShuffleManager \ ---conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR} \ ---conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR} -``` - -By default, a thread pool of 20 threads is used for shuffle writes and reads. This -configuration can be independently changed for writers and readers using: -`spark.rapids.shuffle.multiThreaded.[writer|reader].threads`. An appropriate value for these -pools is the number of cores in the system divided by the number of executors per machine. - -On the reader side, when blocks are received from the network, they are queued onto these threads -for decompression and decode. The amount of bytes we allow in flight per Spark task is -controlled by: `spark.rapids.shuffle.multiThreaded.maxBytesInFlight`, and it is set to -128MB-per-task as a default. Note that this memory comes from the Netty off-heap pool, and this -is sized at startup automatically by Netty, but this limit can be controlled by setting -`-Dio.netty.maxDirectMemory=[amount in Bytes]` under `spark.executor.extraJavaOptions`. - -## UCX Mode - ---- -**NOTE:** - -As of the spark-rapids 23.08 release, UCX packages support CUDA 11. -UCX support for CUDA 12 in the RAPIDS Accelerator will be added in a future release. - ---- - -UCX mode (`spark.rapids.shuffle.mode=UCX`) has two components: a spillable cache, and a transport that can utilize -Remote Direct Memory Access (RDMA) and high-bandwidth transfers -within a node that has multiple GPUs. This is possible because this mode -utilizes [Unified Communication X (UCX)](https://www.openucx.org/) as its transport. - -- **Spillable cache**: This store keeps GPU data close by where it was produced in device memory, -but can spill in the following cases: - - GPU out of memory: If an allocation in the GPU failed to acquire memory, spill will get triggered - moving GPU buffers to host to allow for the original allocation to succeed. - - Host spill store filled: If the host memory store has reached a maximum threshold - (`spark.rapids.memory.host.spillStorageSize`), host buffers will be spilled to disk until - the host spill store shrinks back below said configurable threshold. - - Tasks local to the producing executor will short-circuit read from the cache. - -- **Transport**: Handles block transfers between executors using various means like NVLink, -PCIe, Infiniband (IB), RDMA over Converged Ethernet (RoCE) or TCP, and as configured in UCX, -in these scenarios: - - GPU-to-GPU: Shuffle blocks that were able to fit in GPU memory. - - Host-to-GPU and Disk-to-GPU: Shuffle blocks that spilled to host (or disk) but will be manifested - in the GPU in the downstream Spark task. - -The RAPIDS Shuffle Manager uses the `spark.shuffle.manager` plugin interface in Spark and it relies -on fast connections between executors, where shuffle data is kept in a cache backed by GPU, host, or disk. -As such, it doesn't implement functionality to interact with the External Shuffle Service (ESS). -To enable the RAPIDS Shuffle Manager, users need to disable ESS using `spark.shuffle.service.enabled=false`. -Note that Spark's Dynamic Allocation feature requires ESS to be configured, and must also be -disabled with `spark.dynamicAllocation.enabled=false`. - -### System Setup - -In order to enable the RAPIDS Shuffle Manager, UCX user-space libraries and its dependencies must -be installed on the host and inside Docker containers (if not baremetal). A host has additional -requirements, like the MLNX_OFED driver and `nv_peer_mem` kernel module. - -The minimum UCX requirement for the RAPIDS Shuffle Manager is -[UCX 1.14.0](https://github.com/openucx/ucx/releases/tag/v1.14.0). - -#### Baremetal - -1. If you have Mellanox hardware, please ensure you have the [MLNX_OFED driver](https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/), and the -[`nv_peer_mem` kernel module](https://network.nvidia.com/products/GPUDirect-RDMA/) installed. UCX packages - are compatible with MLNX_OFED 5.0+. Please install the latest driver available. - - With `nv_peer_mem` (GPUDirectRDMA), IB/RoCE-based transfers can perform zero-copy transfers - directly from GPU memory. Note that GPUDirectRDMA is known to show - [performance and bugs](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems) - in machines that don't connect their GPUs and NICs to PCIe switches (i.e. directly to the - root-complex). - - Other considerations: - - - Please refer to [Mellanox documentation]( - https://support.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment) - on how to configure RoCE networks (lossless/lossy, QoS, and more) - - - We recommend that the `--without-ucx` option is passed when installing MLNX_OFED - (`mlnxofedinstall`). This is because the UCX included in MLNX_OFED does not have CUDA support, - and is likely older than what is available in the UCX repo (see Step 2 below). - - If you encounter issues or poor performance, GPUDirectRDMA can be controlled via the - UCX environment variable `UCX_IB_GPU_DIRECT_RDMA=no`, but please - [file a GitHub issue](https://github.com/NVIDIA/spark-rapids/issues) so we can investigate - further. - -2. Fetch and install the UCX package for your OS from: - [UCX 1.14.0](https://github.com/openucx/ucx/releases/tag/v1.14.0). - - RDMA packages have extra requirements that should be satisfied by MLNX_OFED. - -##### Rocky UCX RPM -The UCX packages for Rocky are divided into different RPMs. - -For example, UCX 1.14.0 available at -https://github.com/openucx/ucx/releases/download/v1.14.0/ucx-1.14.0-centos8-mofed5-cuda11.tar.bz2 -contains: - -``` -ucx-devel-1.14.0-1.el8.x86_64.rpm -ucx-debuginfo-1.14.0-1.el8.x86_64.rpm -ucx-1.14.0-1.el8.x86_64.rpm -ucx-cuda-1.14.0-1.el8.x86_64.rpm -ucx-rdmacm-1.14.0-1.el8.x86_64.rpm -ucx-cma-1.14.0-1.el8.x86_64.rpm -ucx-ib-1.14.0-1.el8.x86_64.rpm -``` - -For a setup without RoCE or Infiniband networking, the only packages required are: - -``` -ucx-1.14.0-1.el8.x86_64.rpm -ucx-cuda-1.14.0-1.el8.x86_64.rpm -``` - -If accelerated networking is available, the package list is: - -``` -ucx-1.14.0-1.el8.x86_64.rpm -ucx-cuda-1.14.0-1.el8.x86_64.rpm -ucx-rdmacm-1.14.0-1.el8.x86_64.rpm -ucx-ib-1.14.0-1.el8.x86_64.rpm -``` - ---- -**NOTE:** - -The Rocky RPM requires CUDA installed via RPMs to satisfy its dependencies. The CUDA runtime can be -downloaded from [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads) -(note the [Archive of Previous CUDA releases](https://developer.nvidia.com/cuda-toolkit-archive) -link to download prior versions of the runtime). - -For example, in order to download the CUDA RPM for Rocky 8 running on x86: -`Linux` > `x86_64` > `Rocky` > `8` > `rpm (local)` or `rpm (network)`. - ---- - -#### Docker containers - -Running with UCX in containers imposes certain requirements. In a multi-GPU system, all GPUs that -want to take advantage of PCIe peer-to-peer or NVLink need to be visible within the container. For -example, if two containers are trying to communicate and each have an isolated GPU, the link between -these GPUs will not be optimal, forcing UCX to stage buffers to the host or use TCP. -Additionally, if you want to use RoCE/Infiniband, the `/dev/infiniband` device should be exposed -in the container. Also, to avoid potential `failed: Cannot allocate memory`, -please consider raising the `memlock` ulimit in the container via `--ulimit memlock=[maximum]`. Note that setting `--ulimit memlock=-1` disables the limit. - -If UCX will be used to communicate between containers, the IPC (`--ipc`) and -PID namespaces (`--pid`) should also be shared. - -As of the writing of this document we have successfully tested `--privileged` containers, which -essentially turns off all isolation. We are also assuming `--network=host` is specified, allowing -the container to share the host's network. We will revise this document to include any new -configurations as we are able to test different scenarios. - -NOTE: A system administrator should have performed Step 1 in [Baremetal](#baremetal) in the host -system if you have RDMA capable hardware. - -The following are examples of Docker containers with UCX 1.14.0 and CUDA 11.8 support. - -| OS Type | RDMA | Dockerfile | -|---------| ---- |--------------------------------------------------------------------------------| -| Ubuntu | Yes | [Dockerfile.ubuntu_rdma](shuffle-docker-examples/Dockerfile.ubuntu_rdma) | -| Ubuntu | No | [Dockerfile.ubuntu_no_rdma](shuffle-docker-examples/Dockerfile.ubuntu_no_rdma) | -| Rocky | Yes | [Dockerfile.rocky_rdma](shuffle-docker-examples/Dockerfile.rocky_rdma) | -| Rocky | No | [Dockerfile.rocky_no_rdma](shuffle-docker-examples/Dockerfile.rocky_no_rdma) | - -### Validating UCX Environment - -After installing UCX you can utilize `ucx_info` and `ucx_perftest` to validate the installation. - -In this section, we are using a docker container built using the sample dockerfile above. - -1. Start the docker container with `--privileged` mode, which makes Mellanox devices available - for our test (this is only required if you are using RDMA), `pid=host` and `ipc=host` are - requirements for container communication within the same machine: - ``` - nvidia-docker run \ - --privileged \ - --pid=host \ - --ipc=host \ - --network=host \ - -it \ - ucx_container:latest \ - /bin/bash - ``` - If you are testing between different machines, please run the above command in each node. - -2. Test to check whether UCX can link against CUDA: - ``` - root@test-machine:/# ucx_info -d|grep cuda - # Memory domain: cuda_cpy - # Component: cuda_cpy - # Transport: cuda_copy - # Device: cuda - # Memory domain: cuda_ipc - # Component: cuda_ipc - # Transport: cuda_ipc - # Device: cuda - ``` - -3. Mellanox device seen by UCX, and what transports are enabled (i.e. `rc`) - ``` - root@test-machine:/# ucx_info -d|grep mlx5_3:1 -B1 - # Transport: rc_verbs - # Device: mlx5_3:1 - -- - # Transport: rc_mlx5 - # Device: mlx5_3:1 - -- - # Transport: dc_mlx5 - # Device: mlx5_3:1 - -- - # Transport: ud_verbs - # Device: mlx5_3:1 - -- - # Transport: ud_mlx5 - # Device: mlx5_3:1 - ``` - -4. You should be able to execute `ucx_perftest`, and get a good idea that things are working as - you expect. - - Example 1: GPU <-> GPU in the same host. Without NVLink you should expect PCIe speeds. In this - case this is PCIe3, and somewhere along the lines of ~10GB/sec is expected. It should also match - the performance seen in `p2pBandwidthLatencyTest`, which is included with the cuda toolkit. - - - On server container: - ``` - root@test-server:/# CUDA_VISIBLE_DEVICES=0 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda - ``` - - - On client container: - ``` - root@test-client:/# CUDA_VISIBLE_DEVICES=1 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda localhost - +--------------+--------------+-----------------------------+---------------------+-----------------------+ - | | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | - +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ - | Stage | # iterations | typical | average | overall | average | overall | average | overall | - +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ - Final: 1000 0.000 986.122 986.122 9670.96 9670.96 1014 1014 - ``` - - Example 2: GPU <-> GPU across the network, using GPUDirectRDMA. You will notice that in this - example we picked GPU 3. In our test machine, GPU 3 is closest (same root complex) to the NIC - we are using for RoCE, and yields better performance than GPU 0, for example, which is sitting - on a different socket. - - - On server container: - ``` - root@test-server: CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda - ``` - - - On client container: - ``` - root@test-client:/# CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda test-server - +--------------+--------------+-----------------------------+---------------------+-----------------------+ - | | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | - +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ - | Stage | # iterations | typical | average | overall | average | overall | average | overall | - +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ - [thread 0] 498 0.000 2016.444 2016.444 4729.49 4729.49 496 496 - [thread 0] 978 0.000 2088.412 2051.766 4566.50 4648.07 479 487 - Final: 1000 0.000 3739.639 2088.899 2550.18 4565.44 267 479 - ``` - - Example 3: GPU <-> GPU across the network, without GPUDirectRDMA. You will notice that the - bandwidth achieved is higher than with GPUDirectRDMA on. This is expected, and a known issue in - machines where GPUs and NICs are connected directly to the root complex. - - - On server container: - ``` - root@test-server:/# UCX_IB_GPU_DIRECT_RDMA=no CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda - ``` - - - On client container: - ``` - root@test-client:/# UCX_IB_GPU_DIRECT_RDMA=no CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda test-server - +--------------+--------------+-----------------------------+---------------------+-----------------------+ - | | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | - +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ - | Stage | # iterations | typical | average | overall | average | overall | average | overall | - +--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+ - [thread 0] 670 0.000 1497.859 1497.859 6366.91 6366.91 668 668 - Final: 1000 0.000 1718.843 1570.784 5548.35 6071.33 582 637 - ``` - -### Spark App Configuration - -1. Choose the version of the shuffle manager that matches your Spark version. Please refer to - the table at the top of this document for `spark.shuffle.manager` values. - -2. Settings for UCX 1.14.0+: - - Minimum configuration: - - ```shell - ... - --conf spark.shuffle.manager=com.nvidia.spark.rapids.[shim package].RapidsShuffleManager \ - --conf spark.rapids.shuffle.mode=UCX \ - --conf spark.shuffle.service.enabled=false \ - --conf spark.dynamicAllocation.enabled=false \ - --conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR} \ - --conf spark.executorEnv.UCX_ERROR_SIGNALS= \ - --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n - ``` - - Recommended configuration: - - ```shell - ... - --conf spark.shuffle.manager=com.nvidia.spark.rapids.[shim package].RapidsShuffleManager \ - --conf spark.rapids.shuffle.mode=UCX \ - --conf spark.shuffle.service.enabled=false \ - --conf spark.dynamicAllocation.enabled=false \ - --conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR} \ - --conf spark.executorEnv.UCX_ERROR_SIGNALS= \ - --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \ - --conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024 \ - --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \ - --conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy \ - --conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 - ``` - -Please replace `[shim package]` with the appropriate value. For example, the full class name for -Apache Spark 3.1.3 is: `com.nvidia.spark.rapids.spark313.RapidsShuffleManager`. - -Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a -non-standard location. - -With the RAPIDS Shuffle Manager configured, the setting `spark.rapids.shuffle.enabled` (default on) -can be used to enable or disable the usage of RAPIDS Shuffle Manager during your application. - -#### Databricks - -Please make sure you follow the [Getting Started](../get-started/getting-started-databricks.md) -guide for Databricks. The following are extra steps required to enable UCX. - -1) Create and enable an additional "init script" that installs UCX: - -``` -#!/bin/bash -sudo apt install -y wget libnuma1 && -wget https://github.com/openucx/ucx/releases/download/v1.14.0/ucx-1.14.0-ubuntu20.04-mofed5-cuda11.tar.bz2 && -tar -xvf ucx-1.14.0-ubuntu20.04-mofed5-cuda11.tar.bz2 && -sudo dpkg -i ucx-1.14.0.deb ucx-cuda-1.14.0.deb && -rm ucx-1.14.0-ubuntu20.04-mofed5-cuda11.tar.bz2 ucx-1.14.0.deb ucx-cuda-1.14.0.deb -``` - -Save the script in Databricks workspace and add it to the "Init Scripts" list: - -![Init scripts panel showing UCX init script](../img/Databricks/initscript_ucx.png) - -2) Add the UCX minimum configuration for your Cluster. - -Databricks 10.4: - -``` -spark.shuffle.manager com.nvidia.spark.rapids.spark321db.RapidsShuffleManager -spark.rapids.shuffle.mode UCX -spark.shuffle.service.enabled false -spark.executorEnv.UCX_MEMTYPE_CACHE n -spark.executorEnv.UCX_ERROR_SIGNALS "" -``` - -Example of configuration panel with the new settings: - -![Configurations with UCX](../img/Databricks/sparkconfig_ucx.png) - -Please note that at this time, we have tested with Autoscaling off. It is not clear how an autoscaled -cluster will behave with the RAPIDS Shuffle Manager. - -#### UCX Environment Variables -- `UCX_TLS`: - - `cuda_copy`, and `cuda_ipc`: enables handling of CUDA memory in UCX, both for copy-based transport - and peer-to-peer communication between GPUs (NVLink/PCIe). - - `rc`: enables Infiniband and RoCE based transport in UCX. - - `tcp`: allows for TCP communication in cases where UCX deems necessary. -- `UCX_ERROR_SIGNALS=`: Disables UCX signal catching, as it can cause issues with the JVM. -- `UCX_MAX_RNDV_RAILS=1`: Set this to `1` to disable multi-rail transfers in UCX, where UCX splits - data to utilize various channels (e.g. two NICs). A value greater than `1` can cause a performance drop - for high-bandwidth transports between GPUs. -- `UCX_MEMTYPE_CACHE=n`: Disables a cache in UCX that can cause UCX to fail when running with CUDA buffers. -- `UCX_RNDV_SCHEME=put_zcopy`: By default, `UCX_RNDV_SCHEME=auto` will pick different schemes for - the RNDV protocol (`get_zcopy` or `put_zcopy`) depending on message size, and on other parameters - given the hardware, transports, and settings. We have found that `UCX_RNDV_SCHEME=put_zcopy` - is more reliable than automatic detection, or `get_zcopy` in our testing, especially in UCX 1.9.0. - The main difference between get and put is the direction of transfer. A send operation under - `get_zcopy` will really be `RDMA READ` from the receiver, whereas the same send will be - `RDMA_WRITE` from the sender if `put_zcopy` is utilized. -- `UCX_IB_RX_QUEUE_LEN=1024`: Length of receive queue for the Infiniband/RoCE transports. The - length change is recommended as it has shown better performance when there is memory pressure - and message sizes are relatively large (> few hundred Bytes) - -### Fine Tuning -Here are some settings that could be utilized to fine tune the _RAPIDS Shuffle Manager_: - -#### Bounce Buffers -The following configs control the number of bounce buffers, and the size. Please note that for -device buffers, two pools are created (for sending and receiving). Take this into account when -sizing your pools. - -The GPU buffers should be smaller than the [`PCI BAR -Size`](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#bar-sizes) for your GPU. Please verify -the [defaults](../configs.md) work in your case. - -- `spark.rapids.shuffle.ucx.bounceBuffers.device.count` -- `spark.rapids.shuffle.ucx.bounceBuffers.host.count` -- `spark.rapids.shuffle.ucx.bounceBuffers.size` - -#### Spillable Store -This setting controls the amount of host memory (RAM) that can be utilized to spill GPU blocks when -the GPU is out of memory, before going to disk. Please verify the [defaults](../configs.md). -- `spark.rapids.memory.host.spillStorageSize` - -##### Shuffle Garbage Collection -Shuffle buffers cached in the spillable store, whether they are in the GPU, host, or disk, will not -be removed even after all actions for your query complete. This is a design decision in Spark, where -shuffle temporary stores are cleaned when there is a garbage collection on the driver, and the -references to the RDDs supporting your query are not reachable. - -One of the issues with this is with large JVM footprints in the driver. The driver may not run a GC at -all between different parts of your application, causing output for shuffle to accumulate (output that -will not be reused), and eventually causing OOM or even filled disk. This is true for Spark even without -the RAPIDS Shuffle Manager, but in our case it's likely GPU memory that is being occupied, and performance -degrades given the churn due to spill to host memory or disk. As of this stage, there isn't a good solution -for this, other than to trigger a GC cycle on the driver. - -Spark has a configuration `spark.cleaner.periodicGC.interval` (defaults to 30 minutes), that -can be used to periodically cause garbage collection. If you are experiencing OOM situations, or -performance degradation with several Spark actions, consider tuning this setting in your jobs. - -#### Known Issues - -- UCX configures TCP keep-alive for TCP transports. We have seen [issues with keep-alive in multi-NIC environments](https://github.com/NVIDIA/spark-rapids/issues/7940) after long periods of inactivity (when the inactivity is greater - than the system's `tcp_keepalive_time`). If you see errors such as: - ``` - ERROR UCX: UcpListener detected an error for executorId 2: UCXError(-25,Connection reset by remote peer) - ``` - Consider turning off keep-alive for UCX using: `spark.executorEnv.UCX_TCP_KEEPINTVL=inf`. NOTE: this would only mitigate the issue but not resolve it. diff --git a/docs/download.md b/docs/download.md index c1ed44b8c63..0367d9a3795 100644 --- a/docs/download.md +++ b/docs/download.md @@ -59,7 +59,7 @@ The plugin is tested on the following architectures: for your hardware's minimum driver version. *For Cloudera and EMR support, please refer to the -[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ. +[Distributions](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#which-distributions-are-supported) section of the FAQ. ### Download v23.08.1 * Download the [RAPIDS @@ -100,4 +100,4 @@ For a detailed list of changes, please refer to the ## Archived releases -As new releases come out, previous ones will still be available in [archived releases](./archive.md). \ No newline at end of file +As new releases come out, previous ones will still be available in [archived releases](./archive.md). diff --git a/docs/examples.md b/docs/examples.md deleted file mode 100644 index 91cfbd726e5..00000000000 --- a/docs/examples.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -layout: page -title: Examples -nav_order: 14 ---- -# Examples - -Please visit [spark-rapids-examples](https://github.com/NVIDIA/spark-rapids-examples) repo for ETL, -ML/DL, UDF related examples using the RAPIDS Accelerator For Apache Spark. -It includes Scala/Python source code and related notebooks for different examples. - -# Benchmarks - -Please visit [spark-rapids-benchmarks](https://github.com/NVIDIA/spark-rapids-benchmarks) repo for -Spark related benchmark sets and utilities using the RAPIDS Accelerator For Apache Spark. \ No newline at end of file diff --git a/docs/get-started/Dockerfile.cuda b/docs/get-started/Dockerfile.cuda deleted file mode 100644 index 2c2324dabbd..00000000000 --- a/docs/get-started/Dockerfile.cuda +++ /dev/null @@ -1,89 +0,0 @@ -# -# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -FROM nvidia/cuda:11.8.0-devel-ubuntu20.04 -ARG spark_uid=185 - -# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771 -RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub - -# Install java dependencies -ENV DEBIAN_FRONTEND="noninteractive" -RUN apt-get update && apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre -ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64 -ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin - -# Before building the docker image, first either download Apache Spark 3.1+ from -# http://spark.apache.org/downloads.html or build and make a Spark distribution following the -# instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see -# https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions). If this -# docker file is being used in the context of building your images from a Spark distribution, the -# docker build command should be invoked from the top level directory of the Spark -# distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile . - -RUN set -ex && \ - ln -s /lib /lib64 && \ - mkdir -p /opt/spark && \ - mkdir -p /opt/spark/jars && \ - mkdir -p /opt/spark/examples && \ - mkdir -p /opt/spark/work-dir && \ - mkdir -p /opt/sparkRapidsPlugin && \ - touch /opt/spark/RELEASE && \ - rm /bin/sh && \ - ln -sv /bin/bash /bin/sh && \ - echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ - chgrp root /etc/passwd && chmod ug+rw /etc/passwd - -COPY spark/jars /opt/spark/jars -COPY spark/bin /opt/spark/bin -COPY spark/sbin /opt/spark/sbin -COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/ -COPY spark/examples /opt/spark/examples -COPY spark/kubernetes/tests /opt/spark/tests -COPY spark/data /opt/spark/data - -COPY rapids-4-spark_2.12-*.jar /opt/sparkRapidsPlugin -COPY getGpusResources.sh /opt/sparkRapidsPlugin - -RUN mkdir /opt/spark/python -# TODO: Investigate running both pip and pip3 via virtualenvs -RUN apt-get update && \ - apt install -y python wget && wget https://bootstrap.pypa.io/pip/2.7/get-pip.py && python get-pip.py && \ - apt install -y python3 python3-pip && \ - # We remove ensurepip since it adds no functionality since pip is - # installed on the image and it just takes up 1.6MB on the image - rm -r /usr/lib/python*/ensurepip && \ - pip install --upgrade pip setuptools && \ - # You may install with python3 packages by using pip3.6 - # Removed the .cache to save space - rm -r /root/.cache && rm -rf /var/cache/apt/* - -COPY spark/python/pyspark /opt/spark/python/pyspark -COPY spark/python/lib /opt/spark/python/lib - -ENV SPARK_HOME /opt/spark - -WORKDIR /opt/spark/work-dir -RUN chmod g+w /opt/spark/work-dir - -ENV TINI_VERSION v0.18.0 -ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini -RUN chmod +rx /usr/bin/tini - -ENTRYPOINT [ "/opt/entrypoint.sh" ] - -# Specify the User that the actual main process will run as -USER ${spark_uid} diff --git a/docs/get-started/best-practices.md b/docs/get-started/best-practices.md deleted file mode 100644 index 7ae131d28ec..00000000000 --- a/docs/get-started/best-practices.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -layout: page -title: Best Practices -nav_order: 15 -parent: Getting-Started ---- - -# Best Practices on the RAPIDS Accelerator for Apache Spark - -This article explains the most common best practices using the RAPIDS Accelerator especially for -performance tuning and troubleshooting. - -## Workload Qualification - -By following [Workload Qualification](./getting-started-workload-qualification.md) guide, you can -identify the best candidate Spark applications for the RAPIDS Accelerator and also the feature gaps. - -Based on [Qualification tool](../spark-qualification-tool.md)'s output, you can start with the top N -recommended CPU Spark jobs, especially if those jobs are computation heavy jobs (large joins, -hash aggregates, windowing, sorting). - -After those candidate jobs are run on GPU using the RAPIDS Accelerator, check the Spark driver -log to find the not-supported messages starting with `!`. Some missing features might be enabled by -turning on a configuration, but please understand the corresponding limitations detailed in -[the advanced configuration document](../additional-functionality/advanced_configs.md). -For other unsupported features, please file feature requests on -[spark-rapids Github repo](https://github.com/NVIDIA/spark-rapids/issues) with a minimum reproduce -and the not-supported messages if they are non-sensitive data. - -## Performance Tuning -Refer to [Tuning Guide](../tuning-guide.md) for more details. - -## How to handle GPU OOM issues - -Spark jobs running out of GPU memory is the most common issue because GPU memory is usually much -smaller than host memory. Below are some common tips to help avoid GPU OOM issues. - -### 1. Reduce the number of concurrent tasks per GPU - -This is controlled by `spark.rapids.sql.concurrentGpuTasks`. Try to decrease this value to 1. -If the job still hit OOMs, try the following steps. - -### 2. Install CUDA 11.5 or above version - -The default value for `spark.rapids.memory.gpu.pool` is changed to `ASYNC` from `ARENA` for CUDA -11.5+. For CUDA 11.4 and older, it will fall back to `ARENA`. -Using ASYNC allocator can avoid some memory fragmentation issues. - -The Spark executor log will contain a message like the following when using the ASYNC allocator: - -``` -INFO GpuDeviceManager: Initializing RMM ASYNC pool size = 17840.349609375 MB on gpuId 0 -``` - -### 3. Identify which SQL, job and stage is involved in the error - -The relationship between SQL/job/stage is: Stage belongs to a Job which belongs to SQL. -First check the Spark UI to identify the problematic SQL ID, Job ID, and Stage ID. - -Then find the failed stage in the `Stages` page in the Spark UI, and go into that stage to look at tasks. -If some tasks completed successfully while some tasks failed with OOM, check the amount of input -bytes or shuffle bytes read per task to see if there is any data skew. - -Check the DAG of the problematic stage to see if there are any suspicious operators which may -consume huge amounts of memory, such as windowing, collect_list/collect_set, explode, expand, etc. - -### 4. Increase the number of tasks/partitions based on the type of the problematic stage - -#### a. Table Scan Stage - -If it is a table scan stage on Parquet/ORC tables, then the number of tasks or partitions is normally -determined by `spark.sql.files.maxPartitionBytes`. We can decrease its value to increase the -number of tasks or partitions for this stage so that the memory pressure of each task is less. - -Iceberg or Delta tables may have different settings to control the concurrency in the table -scan stage. For example, Iceberg uses `read.split.target-size` as a table property or read option -to control the split size. - -#### b. Shuffle Stage - -If it is a shuffle stage, then the number of tasks or partitions is normally determined by -`spark.sql.shuffle.partitions` (default value=200), and also AQE's Coalescing Post Shuffle Partitions -feature (such as parameters `spark.sql.adaptive.coalescePartitions.minPartitionSize`, -`spark.sql.adaptive.advisoryPartitionSizeInBytes`, etc.). - -We can adjust the above parameters to increase the number of tasks or partitions for this shuffle -stage to reduce the memory pressure for each task. For example, we can start with increasing -`spark.sql.shuffle.partitions` by a factor of 2, then 4, then 8, etc. - -Even without an OOM error, if the SQL plan metrics show lots of spilling from the -Spark UI in this stage, increasing the number of tasks or partitions could decrease the -spilled data size to improve performance. - -Note: AQE's Coalescing Post Shuffle Partitions feature could have different behaviors in different -Spark 3.x versions. For example, in Spark 3.1.3, `spark.sql.adaptive.coalescePartitions.minPartitionNum` -by default is set to `spark.sql.shuffle.partitions`(default value=200). However in Spark 3.2 or 3.3, -`minPartitionNum` was removed. So always check the correct version of the Spark documentation. - -### 5. Reduce columnar batch size and file reader batch size - -Please refer to the [Tuning Guide](../tuning-guide.md#columnar-batch-size) for details on the following -RAPIDS parameters: -``` -spark.rapids.sql.batchSizeBytes -spark.rapids.sql.reader.batchSizeBytes -``` - -### 6. File an issue or ask a question on the GitHub repo - -If you are still getting an OOM exception, please get the Spark eventlog and stack trace from the -executor (the whole executor log ideally) and send to spark-rapids-support@nvidia.com , or file a -GitHub issue on [spark-rapids GitHub repo](https://github.com/NVIDIA/spark-rapids/issues) if it is -not sensitive. Or, open a [discussion thread](https://github.com/NVIDIA/spark-rapids/discussions). \ No newline at end of file diff --git a/docs/get-started/getting-started-OCI.md b/docs/get-started/getting-started-OCI.md deleted file mode 100644 index 8732d584a6c..00000000000 --- a/docs/get-started/getting-started-OCI.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -layout: page -title: Oracle Cloud Infrastructure -nav_order: 6 -parent: Getting-Started ---- -# Getting started with RAPIDS Accelerator on Oracle Cloud Infrastructure - [Oracle Cloud Infrastructure (OCI)](https://docs.oracle.com/en-us/iaas/Content/home.htm) is a set of complementary cloud services that - enable you to build and run a range of applications and services in a highly available hosted environment. - -Oracle now offers early access to NVIDIA GPU acceleration for workloads running on Oracle Cloud Infrastructure Data Flow. -Please refer the [Quickstart guides](https://docs.oracle.com/en-us/iaas/data-flow/using/GPU.htm) about how to use GPU Shapes for Spark Data Flow. \ No newline at end of file diff --git a/docs/get-started/getting-started-alluxio.md b/docs/get-started/getting-started-alluxio.md deleted file mode 100644 index e6e90c80cb6..00000000000 --- a/docs/get-started/getting-started-alluxio.md +++ /dev/null @@ -1,577 +0,0 @@ ---- -layout: page -title: Alluxio -nav_order: 8 -parent: Getting-Started ---- -# Getting Started with RAPIDS and Alluxio - -The RAPIDS plugin can remarkably accelerate the computing part of a SQL query by leveraging -GPUs, but it’s hard to accelerate the data reading process when the data is in a cloud -filesystem because of network overhead. - -[Alluxio](https://www.alluxio.io/) is an open source data orchestration platform that -brings your data closer to compute across clusters, regions, clouds, and countries for -reducing the network overhead. Compute applications talking to Alluxio can transparently -cache frequently accessed data from multiple sources, especially from remote locations. - -This guide will go through how to set up the RAPIDS Accelerator for Apache Spark with -Alluxio in an on-premise cluster. This guide sets up Alluxio to handle reads and -by default does not handle file updates in the remote blob store. - -## Prerequisites - -This guide assumes the user has successfully setup and run the RAPIDS Accelerator in an -on-premise cluster according to [this doc](getting-started-on-prem.md). - -This guide will go through deployment of Alluxio in a Yarn cluster with 2 NodeManagers and -1 ResourceManager, It will describe how to configure an S3 compatible filesystem as -Alluxio’s underlying storage system. - -We may want to put the Alluxio workers on the NodeManagers so they are on the same nodes as -the Spark tasks will run. The Alluxio master can go anywhere, we pick ResourceManager for -convenience. - -Let's assume the hostnames are: - -``` console -RM_hostname -NM_hostname_1 -NM_hostname_2 -``` - -## Alluxio setup - -1. Prerequisites - - - Download Alluxio binary file - - Download the latest Alluxio binary file **alluxio-${LATEST}-bin.tar.gz** from - [this site](https://www.alluxio.io/download/). - - - Copy `alluxio-${LATEST}-bin.tar.gz` to NodeManagers and ResourceManager - - - Extract `alluxio-${LATEST}-bin.tar.gz` to the directory specified by **ALLUXIO_HOME** in the NodeManagers and ResourceManager - - ``` shell - # Let's assume we extract alluxio to /opt - mkdir -p /opt - tar xvf alluxio-${LATEST}-bin.tar.gz -C /opt - export ALLUXIO_HOME=/opt/alluxio-${LATEST} - ``` - - For **SSH login wihtout password** and **Alluxio ports** problem, please refer to - [this site](https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-a-Cluster.html#prerequisites). - -2. Configure Alluxio - - - Alluxio master configuration - - On the master node, create `${ALLUXIO_HOME}/conf/alluxio-site.properties` configuration - file from the template. - - ```console - cp ${ALLUXIO_HOME}/conf/alluxio-site.properties.template ${ALLUXIO_HOME}/conf/alluxio-site.properties - ``` - - Add the recommended configuration below to `${ALLUXIO_HOME}/conf/alluxio-site.properties` . - - ``` xml - # set the hostname of the single master node - alluxio.master.hostname=RM_hostname - - ########################### worker properties ############################## - # The maximum number of storage tiers in Alluxio. Currently, Alluxio supports 1, - # 2, or 3 tiers. - alluxio.worker.tieredstore.levels=1 - - # The alias of top storage tier 0. Currently, there are 3 aliases, MEM, SSD, and HDD. - alluxio.worker.tieredstore.level0.alias=SSD - - # The paths of storage directories in top storage tier 0, delimited by comma. - # It is suggested to have one storage directory per hardware device for the - # SSD and HDD tiers. You need to create YOUR_CACHE_DIR first, - # For example, - # export YOUR_CACHE_DIR=/opt/alluxio/cache - # mkdir -p $YOUR_CACHE_DIR - alluxio.worker.tieredstore.level0.dirs.path=/YOUR_CACHE_DIR - - # The quotas for all storage directories in top storage tier 0 - # For example, set the quota to 100G. - alluxio.worker.tieredstore.level0.dirs.quota=100G - - # The path to the domain socket. Short-circuit reads make use of a UNIX domain - # socket when this is set (non-empty). This is a special path in the file system - # that allows the client and the AlluxioWorker to communicate. You will need to - # set a path to this socket. The AlluxioWorker needs to be able to create the - # path. If alluxio.worker.data.server.domain.socket.as.uuid is set, the path - # should be the home directory for the domain socket. The full path for the domain - # socket with be {path}/{uuid}. - # For example, - # export YOUR_DOMAIN_SOCKET_PATH=/opt/alluxio/domain_socket - # mkdir -p YOUR_DOMAIN_SOCKET_PATH - alluxio.worker.data.server.domain.socket.address=/YOUR_DOMAIN_SOCKET_PATH - alluxio.worker.data.server.domain.socket.as.uuid=true - - # Configure async cache manager - # When large amounts of data are expected to be asynchronously cached concurrently, - # it may be helpful to increase below async cache configuration to handle a higher - # workload. - - # The number of asynchronous threads used to finish reading partial blocks. - alluxio.worker.network.async.cache.manager.threads.max=64 - - # The maximum number of outstanding async caching requests to cache blocks in each - # data server. - alluxio.worker.network.async.cache.manager.queue.max=2000 - ############################################################################ - - ########################### Client properties ############################## - # When short circuit and domain socket both enabled, prefer to use short circuit. - alluxio.user.short.circuit.preferred=true - ############################################################################ - - # Running Alluxio locally with S3 - # Optionally, to reduce data latency or visit resources which are separated in - # different AWS regions, specify a regional endpoint to make AWS requests. - # An endpoint is a URL that is the entry point for a web service. - # - # For example, s3.cn-north-1.amazonaws.com.cn is an entry point for the Amazon S3 - # service in beijing region. - alluxio.underfs.s3.endpoint= - - # Optionally, specify to make all S3 requests path style - alluxio.underfs.s3.disable.dns.buckets=true - ``` - - For more explanations of each configuration, please refer to - [Alluxio Configuration](https://docs.alluxio.io/os/user/stable/en/reference/Properties-List.html) - and [Amazon AWS S3](https://docs.alluxio.io/os/user/stable/en/ufs/S3.html). - - Note, when preparing to mount S3 compatible file system to the root of Alluxio namespace, the user - needs to add below AWS credentials configuration to `${ALLUXIO_HOME}/conf/alluxio-site.properties` - in Alluxio master node. - - ``` xml - alluxio.master.mount.table.root.ufs=s3a:/// - alluxio.master.mount.table.root.option.aws.accessKeyId= - alluxio.master.mount.table.root.option.aws.secretKey= - ``` - - Instead, this guide demonstrates how to mount the S3 compatible file system with AWS credentials - to any path of Alluxio namespace, and please refer to [RAPIDS Configuration](#rapids-configuration). - For more explanations of AWS S3 credentials, please refer to - [Amazon AWS S3 Credentials setup](https://docs.alluxio.io/os/user/stable/en/ufs/S3.html#advanced-setup). - - Note, this guide demonstrates how to deploy Alluxio cluster in a insecure way, for the Alluxio security, - please refer to [this site](https://docs.alluxio.io/os/user/stable/en/operation/Security.html) - - - Add Alluxio worker hostnames into `${ALLUXIO_HOME}/conf/workers` - - ``` json - NM_hostname_1 - NM_hostname_2 - ``` - - - Copy configuration from Alluxio master to Alluxio workers - - ``` shell - ${ALLUXIO_HOME}/bin/alluxio copyDir ${ALLUXIO_HOME}/conf - ``` - - This command will copy the `conf/` directory to all the workers specified in the `conf/workers` file. - Once this command succeeds, all the Alluxio nodes will be correctly configured. - - - Alluxio worker configuration - - After copying configuration to every Alluxio worker from Alluxio master, User - needs to add below extra configuration for each Alluxio worker. - - ``` xml - # the hostname of Alluxio worker - alluxio.worker.hostname=NM_hostname_X - # The hostname to use for an Alluxio client - alluxio.user.hostname=NM_hostname_X - ``` - - Note that Alluxio can manage other storage media (e.g. MEM, HDD) in addition to SSD, - so local data access speed may vary depending on the local storage media. To learn - more about this topic, please refer to the - [tiered storage document](https://docs.alluxio.io/os/user/stable/en/core-services/Caching.html#multiple-tier-storage). -3. Create a link to ALLUXIO_HOME - Execute the following commands to create a link `/opt/alluxio` to actual Alluxio Home path: - ```bash - ln -s ${ALLUXIO_HOME} /opt/alluxio - ``` -4. Start Alluxio cluster - - - Format Alluxio - - Before Alluxio can be started **for the first time**, the journal must be formatted. Formatting the journal will delete all metadata - from Alluxio. However, the data in under storage will be untouched. - - Format the journal for the Alluxio master node with the following command: - - ``` bash - ${ALLUXIO_HOME}/bin/alluxio formatMasters - ``` - - - Launch Alluxio - - On the master node, start the Alluxio cluster with the following command: - - ``` bash - ${ALLUXIO_HOME}/bin/alluxio-start.sh all - ``` - - - Verify Alluxio - - To verify that Alluxio is running, visit `http://RM_hostname:19999` - to see the status page of the Alluxio master. - -5. Mount an existing data storage to Alluxio - - - Mount S3 bucket - - ``` bash - ${ALLUXIO_HOME}/bin/alluxio fs mount \ - --option aws.accessKeyId= \ - --option aws.secretKey= \ - alluxio://RM_hostname:19998/s3 s3a:/// - ``` - - - Mount Azure directory - - ``` bash - ${ALLUXIO_HOME}/bin/alluxio fs mount \ - --option fs.azure.account.key..blob.core.windows.net= \ - alluxio://master:port/azure wasb://@.blob.core.windows.net// - ``` - - For other filesystems, please refer to [this site](https://www.alluxio.io/). - We also provide auto mount feature for an easier usage. - Please refer to [Alluxio auto mount for AWS S3 buckets](#alluxio-auto-mount-for-aws-s3-buckets) - -## RAPIDS Configuration - -There are two ways to leverage Alluxio in RAPIDS. -We also provide an auto mount way for AWS S3 bucket if you install Alluxio in your Spark cluster. -Please refer to [Alluxio auto mount for AWS S3 buckets](#alluxio-auto-mount-for-aws-s3-buckets) - -1. Explicitly specify the Alluxio path - - This may require user to change code. For example, change - - ``` scala - val df = spark.read.parquet("s3a:////foo.parquet") - ``` - - to - - ``` scala - val df = spark.read.parquet("alluxio://RM_hostname:19998/s3/foo.parquet") - ``` - -2. Transparently replace in RAPIDS - - RAPIDS has added a configuration `spark.rapids.alluxio.pathsToReplace` which can allow RAPIDS - to replace the input file paths to the Alluxio paths transparently at runtime. So there is no - code change for users. - - Eg, at startup - - ``` shell - --conf spark.rapids.alluxio.pathsToReplace="s3://foo->alluxio://RM_hostname:19998/foo,gs://bar->alluxio://RM_hostname:19998/bar" - ``` - - This configuration allows RAPIDS to replace any file paths prefixed `s3://foo` with - `alluxio://RM_hostname:19998/foo` and `gs://bar` with `alluxio://RM_hostname:19998/bar`. - - Note, one side affect of using Alluxio in this way results in the sql function - **`input_file_name`** printing the `alluxio://` path rather than the original path. - Below is an example of using input_file_name. - - ``` python - spark.read.parquet(data_path) - .filter(f.col('a') > 0) - .selectExpr('a', 'input_file_name()', 'input_file_block_start()', 'input_file_block_length()') - ``` - -3. Submit an application - - Spark driver and tasks will parse `alluxio://` schema and access Alluxio cluster using - `alluxio-${LATEST}-client.jar`. - - The Alluxio client jar must be in the classpath of all Spark drivers and executors in order - for Spark applications to access Alluxio. - - We can specify it in the configuration of `spark.driver.extraClassPath` and - `spark.executor.extraClassPath`, but the Alluxio client jar should be present on the Yarn nodes. - - The other simplest way is copy `alluxio-${LATEST}-client.jar` into spark jars directory. - - ``` shell - cp ${ALLUXIO_HOME}/client/alluxio-${LATEST}-client.jar ${SPARK_HOME}/jars/ - ``` - - ``` shell - ${SPARK_HOME}/bin/spark-submit \ - ... \ - --conf spark.rapids.alluxio.pathsToReplace="REPLACEMENT_RULES" \ - --conf spark.executor.extraJavaOptions="-Dalluxio.conf.dir=${ALLUXIO_HOME}/conf" \ - ``` - -## Alluxio auto mount for AWS S3 buckets - -There's a more user-friendly way to use Alluxio with RAPIDS when accessing S3 buckets. -Suppose that a user has multiple buckets on AWS S3. -To use `spark.rapids.alluxio.pathsToReplace` requires to mount all the buckets beforehand -and put the path replacement one by one into this config. It'll be boring when there're many buckets. - -To solve this problem, we add a new feature of Alluxio auto mount, which can mount the S3 buckets -automatically when finding them from the input path in the Spark driver. -This feature requires the node running Spark driver has Alluxio installed, -which means the node is also the master of Alluxio cluster. It will use `alluxio fs mount` command to -mount the buckets in Alluxio. And the uid used to run the Spark application can run Alluxio command. -For example, the uid of Spark application is as same as the uid of Alluxio service -or the uid of Spark application can use `su alluxio_uid` to run Alluxio command. - -To enable the Alluxio auto mount feature, the simplest way is only to enable it by below config -without setting `spark.rapids.alluxio.pathsToReplace`, which takes precedence over auto mount feature. -``` shell ---conf spark.rapids.alluxio.automount.enabled=true -``` - -Additional configs: -``` shell ---conf spark.rapids.alluxio.bucket.regex="^s3a{0,1}://.*" -``` -The regex is used to match the s3 URI, to decide which bucket we should auto mount. -The default value is to match all the URIs which start with `s3://` or `s3a://`. -For exmaple, `^s3a{1,1}://foo.*` will match the buckets which start with `foo`. - -## Configure whether the disks used by Alluxio are fast -The default value of config `spark.rapids.alluxio.slow.disk` is true, indicating the disks used by Alluxio are slow. -The true value enables an improvement which reads from S3 directly to get better performance when the files being read are large. -The config `spark.rapids.alluxio.large.file.threshold`, which defaults to 64MB, controls the file size threshold used to trigger this improvement. -If the disks are fast, this feature should be disabled by setting it to false as it will be faster to read from Alluxio. -Typically, if speed of the disks is bigger than 300M/second, set it as false. - -## Alluxio Troubleshooting - -This section will give some links about how to configure, tune Alluxio and some troubleshooting. - -- [Quick Start Guide](https://docs.alluxio.io/os/user/stable/en/overview/Getting-Started.html) -- [Amazon S3 as Alluxio’s under storage system](https://docs.alluxio.io/os/user/stable/en/ufs/S3.html) -- [Alluxio metrics](https://docs.alluxio.io/os/user/stable/en/reference/Metrics-List.html) -- [Alluxio configuration](https://docs.alluxio.io/os/user/stable/en/reference/Properties-List.html) -- [Running Spark on Alluxio](https://docs.alluxio.io/os/user/stable/en/compute/Spark.html) -- [Performance Tuning](https://docs.alluxio.io/os/user/stable/en/operation/Performance-Tuning.html ) -- [Alluxio troubleshooting](https://docs.alluxio.io/os/user/stable/en/operation/Troubleshooting.html) - -## Alluxio reliability -The properties mentioned in this section can be found in [Alluxio configuration](https://docs.alluxio.io/os/user/stable/en/reference/Properties-List.html) - -### Dealing with Client side delays in response from master or workers -If the master is not responding, possibly due to it crashing or GC pause, -`alluxio.user.rpc.retry.max.duration` will cause the client to retry for 2 minutes. -This is a very long time and can cause delays in the running job, so we suggest lowering this value to 10 seconds. - -If the worker is not responding, possibly due to it crashing or GC pause, -`alluxio.user.block.read.retry.max.duration` will cause the client to retry for 5 minutes. -This is a very long time and can cause delays in the running job, so we suggest lowering this value to 1 minute. - -See relative configs also: -``` -alluxio.user.rpc.retry.max.duration -alluxio.user.rpc.retry.max.sleep -alluxio.user.rpc.retry.base.sleep - -alluxio.user.block.read.retry.max.duration -alluxio.user.block.read.retry.sleep.max -alluxio.user.block.read.retry.sleep.base -``` -Above configurations define the `ExponentialTimeBoundedRetry` retry policies and `max durations`, we can adjust them to appropriate values. - -Set these properties on Spark because Spark invokes Alluxio client. -``` -$SPARK_HOME/bin/spark-shell \ -...... ---conf spark.driver.extraJavaOptions='-Dalluxio.user.rpc.retry.max.duration=10sec -Dalluxio.user.block.read.retry.max.duration=1min' \ ---conf spark.executor.extraJavaOptions='-Dalluxio.user.rpc.retry.max.duration=10sec -Dalluxio.user.block.read.retry.max.duration=1min' \ -...... -``` - -### Worker server tunings to fail fast -By default, `alluxio.master.worker.timeout` is 5min, this is the timeout between master and worker indicating a lost worker. -If the worker holding cache is killed but the elapsed time does not exceed the timeout, -the master still marks the worker as alive. The client will connect this dead worker to pull data, and will fail. -If the worker holding cache is killed and the elapsed time exceeds the timeout, the master marks the worker as lost. -In this case, if cluster has one alive worker, the client will query an alive worker -and the alive worker will pull data from external file system if it is not holding the requested cache. - -To avoid failures when master marking an actual dead worker as alive, set the timeout to a reasonable value, like 1 minute. -vi $ALLUXIO_HOME/conf/alluxio-site.properties -``` -alluxio.master.worker.timeout=60sec -``` - -### The logs -By default, the log path is /logs. -See the master.log and worker.log in this path. - -### Auto start Alluxio the master and workers -After installing Alluxio master and workers, it's better to add a systemd service for each process of master and workers. -Systemd service can automatically restart a process if that process is terminated. - -## Alluxio limitations -### Alluxio does not sync metadata from UFS(e.g. S3) by default -Alluxio does not sync metadata from S3 by default, so it won't pick up any changed files. -For example: -If you update a file in the S3 from 1M size to 10M size and Alluxio already cached the 1M size file, -Alluxio cluster will always use the 1M file. -If you want to enable sync it has performance impact which will affect the read performance. -For details, please search `alluxio.user.file.metadata.sync.interval` in [Alluxio doc](https://docs.alluxio.io/ee/user/stable/en/reference/Properties-List.html). - -## Alluxio metrics -The following sections describes 3 methods to view Alluxio metrics GUI: -- Monitor Alluxio live metrics based on Alluxio Master Web: - When the Alluxio cluster is running, users can monitor current metrics based on Alluxio Master Web. -- Monitor Alluxio live metrics based on Grafana with Prometheus: - When the Alluxio cluster is running, users can monitor current metrics based on Grafana with Prometheus. -- View Alluxio historic metrics based on Grafana with Prometheus: - When the Alluxio cluster is shutdown, users can restore the saved historic metrics and view them locally. -### Monitor Alluxio live metrics based on Alluxio Master Web -Users can view the Alluxio metrics in the Web interface of Alluxio leading master: -http://:19999/metrics -For more details, please refer to section 3.1 of Alluxio doc: [Master Web UI Metrics](https://docs.alluxio.io/os/user/stable/en/operation/Metrics-System.html#default-http-json-sink) -The Alluxio Web UI is not available by default on Databricks, -the following provides a method to view the Web UI by SSH tunnel via jump server. -First forward the Alluxio port 19999 to a new port on a jump server, -then forward the new port on the jump server to a local port, -finally access the local port in the browser. -For example: -- Forward the Alluxio server 19999 port to the port 29999 on jump server. - ssh user@jump-server // login to jump server - ssh -L 29999:localhost:19999 alluxio_master_user@alluxio_master_host -p 2200 -i -- Forward the port 29999 on jump server to local port 39999 on your own machine. - ssh -L 39999:localhost:29999 user@jump-server -- Finally open http://localhost:39999/metrics on your own machine. - -### Monitor Alluxio live metrics based on Grafana with Prometheus - -#### Config Prometheus when creating Databricks cluster -When creating a Databricks cluster via the Docker container for Databricks, -Set Environment variable ENABLE_ALLUXIO and PROMETHEUS_COPY_DATA_PATH, for example: -``` -ENABLE_ALLUXIO=1 -PROMETHEUS_COPY_DATA_PATH=/dbfs/user1/dblogs-prometheus -``` -![img](../img/Databricks/save-prometheus.png) -The cluster will install Prometheus, configure Prometheus to collect the metrics into its own storage, -and also save Prometheus-format metrics into the path specified for back up purpose. -Note: If not set `ENABLE_ALLUXIO`, `PROMETHEUS_COPY_DATA_PATH` will not take effect. -For more details, refer to [spark-rapids-Databricks-container](https://github.com/NVIDIA/spark-rapids-container/tree/dev/Databricks) - -#### Install and start Grafana locally -For example: local machine is Ubuntu. -```bash -sudo apt-get install -y adduser libfontconfig1 -wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.2.6_amd64.deb -sudo dpkg -i grafana-enterprise_9.2.6_amd64.deb -sudo systemctl start grafana-server -sudo systemctl enable grafana-server -``` -For more details, refer to [doc](https://grafana.com/grafana/download) - -#### Forward the Prometheus port 9090 to local port -In order to access Prometheus-typed metrics on Databricks cluster by local Grafana, -users may need to create an SSH tunnel to access the Databricks internal port. -- Forward the Alluxio server 9090 port to the port 29090 on jump server. - ssh user@jump-server // login to jump server - ssh -L 29090:localhost:9090 alluxio_master_user@alluxio_master_host -p 2200 -i -- Forward the port 29090 on jump server to local port 39090 on your own machine. - ssh -L 39090:localhost:29090 user@jump-server - -It's similar to the tunnel method described in [the previous section](#Monitor Alluxio live metrics based on Alluxio Master Web). - -#### Config local Grafana to monitor the live metrics -The main flows are: -1. Create a Prometheus datasource in Grafana, - the URL of Prometheus datasource is: http://localhost:39090, note: the SSH tunnel port. - Refer to the [tutorial](https://grafana.com/docs/grafana/latest/datasources/add-a-data-source/#add-a-data-source) for help on importing a dashboard. - ![img](../img/Databricks/prometheus-datasource.png) -2. [Download](https://grafana.com/grafana/dashboards/13467) the Grafana template JSON file for Alluxio. -3. Import the template JSON file to create a dashboard. See this [example](https://grafana.com/docs/grafana/latest/dashboards/export-import/#importing-a-dashboard) for importing a dashboard. -4. Add the Prometheus data source to Grafana. -5. Modify the variables in the dashboard/settings with instructions here and save your dashboard. -- alluxio_datasource: Your prometheus datasource name used in step 1. -- masters: Alluxio master. It's the Master ‘job_name’ configured in prometheus.yml on Databricks cluster. -- workers: Currently, it's no use, the Databricks does not collect worker metrics. -- alluxio_user: ubuntu. The user used to start up Alluxio. - ![img](../img/Databricks/grafana.png) - -For more details, refer to section 3.2 of Alluxio doc: [Grafana Web UI with Prometheus](https://docs.alluxio.io/os/user/stable/en/operation/Metrics-System.html#default-http-json-sink) - -#### View a specific live Alluxio metrics in Prometheus Web UI -The graph in the previous may not show all the metrics users care about, -the following describes how to view a specific historic Alluxio metric as you want: -Open Prometheus Web UI: http://localhost:39090/graph -Click the `Open metrics explorer` button. -![img](../img/Databricks/list-metrics.png) -Then a list is shown: -![img](../img/Databricks/list-metrics-result.png) -Select a metric and then click Graph Tab, then a graph is shown: -![img](../img/Databricks/a-metric-graph.png) - -### View Alluxio historic metrics based on Grafana with Prometheus -This section is similar to [the previous section](#Monitor Alluxio live metrics based on Grafana with Prometheus) -After the Databricks cluster is shutdown, the Web UI on Databricks can not be accessed again, -this section describes how to view historic metrics. -The differences are: -- View historic metrics when Databricks cluster is shutdown -- Install and start Prometheus locally -- Restore the Prometheus data into local Prometheus - -The steps are as following: - -#### [Save Alluxio historic Prometheus-format metrics](#Config Prometheus when creating Databricks cluster) - -#### [Install and start Grafana locally](#Install and start Grafana locally) - -#### Install and start Prometheus locally -For example: local machine is Ubuntu. -```bash -wget https://github.com/prometheus/prometheus/releases/download/v2.37.3/prometheus-2.37.3.linux-amd64.tar.gz -tar xvfz prometheus-*.tar.gz -cd prometheus-* -``` -For more details, refer to [doc](https://prometheus.io/docs/prometheus/latest/installation) - -#### Restore historic metrics into Prometheus -Copy the saved data in `PROMETHEUS_COPY_DATA_PATH` into Prometheus data path. -``` -cd -mkdir data -# Copy the saved files into `data` directory. -cp -r $PROMETHEUS_COPY_DATA_PATH/some/sub/path/* /path/to/prometheus-root-dir/data -`ls /path/to/prometheus-root-dir/data` will show files like: -`chunks_head lock queries.active wal` -``` - -#### Start Prometheus -cd -./prometheus - -#### View Alluxio historic metrics based on Grafana -Refer to [Config local Grafana to monitor the live metrics](#Config local Grafana to monitor the live metrics) -The difference is: -The prometheus datasource is local(http://localhost:9090) instead of the remote prometheus on Databricks cluster. - -![img](../img/Databricks/grafana.png) - -#### View a specific historic Alluxio metrics in Prometheus Web UI -Refer to [the section](#View a specific live Alluxio metrics in Prometheus Web UI) -The difference is: -The prometheus datasource is local instead of the remote prometheus on Databricks cluster. -Open Prometheus Web UI: http://localhost:9090/graph. diff --git a/docs/get-started/getting-started-aws-emr.md b/docs/get-started/getting-started-aws-emr.md deleted file mode 100644 index 4d5dceacab0..00000000000 --- a/docs/get-started/getting-started-aws-emr.md +++ /dev/null @@ -1,385 +0,0 @@ ---- -layout: page -title: AWS-EMR -nav_order: 2 -parent: Getting-Started ---- -# Get Started with RAPIDS on AWS EMR - -This is a getting started guide for the RAPIDS Accelerator for Apache Spark on AWS EMR. At the end -of this guide, the user will be able to run a sample Apache Spark application on NVIDIA GPUs on AWS EMR. - -Different versions of EMR ship with different versions of Spark, RAPIDS Accelerator, cuDF and xgboost4j-spark: - -| EMR | Spark | RAPIDS Accelerator jar | cuDF jar | xgboost4j-spark jar -| --- | --- | --- | ---| --- | -| 6.13 | 3.4.1 | rapids-4-spark_2.12-23.06.0-amzn-1.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar | -| 6.12 | 3.4.0 | rapids-4-spark_2.12-23.06.0-amzn-0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar | -| 6.11 | 3.3.2 | rapids-4-spark_2.12-23.02.0-amzn-0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar | -| 6.10 | 3.3.1 | rapids-4-spark_2.12-22.12.0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar | -| 6.9 | 3.3.0 | rapids-4-spark_2.12-22.08.0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar | -| 6.8 | 3.3.0 | rapids-4-spark_2.12-22.06.0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar | -| 6.7 | 3.2.1 | rapids-4-spark_2.12-22.02.0.jar | cudf-22.02.0-cuda11.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar | -| 6.6 | 3.2.0 | rapids-4-spark_2.12-22.02.0.jar | cudf-22.02.0-cuda11.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar | -| 6.5 | 3.1.2 | rapids-4-spark_2.12-0.4.1.jar | cudf-0.18.1-cuda10-1.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar | -| 6.4 | 3.1.2 | rapids-4-spark_2.12-0.4.1.jar | cudf-0.18.1-cuda10-1.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar | -| 6.3 | 3.1.1 | rapids-4-spark_2.12-0.4.1.jar | cudf-0.18.1-cuda10-1.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar | -| 6.2 | 3.0.1 | rapids-4-spark_2.12-0.2.0.jar | cudf-0.15-cuda10-1.jar | xgboost4j-spark_3.0-1.0.0-0.2.0.jar | - -For more details about each EMR release, please see the [EMR release -notes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html). - -For more information on AWS EMR, please see the [AWS -documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). - -## Leveraging Spark RAPIDS User Tools for Qualification and Bootstrap - -To use the qualification and bootstrap tools for EMR, you will want to install the Spark RAPIDS user tools package. -Instructions for installing and setting up the Spark RAPIDS user tools package for EMR can be found here: -[link](https://github.com/NVIDIA/spark-rapids-tools/blob/main/user_tools/docs/user-tools-aws-emr.md). - -## Qualify CPU Workloads for GPU Acceleration - -The [qualification tool](https://nvidia.github.io/spark-rapids/docs/spark-qualification-tool.html) is launched to analyze CPU applications -that have already run. The tool will output the applications recommended for acceleration along with estimated speed-up -and cost saving metrics. Additionally, it will provide information on how to launch a GPU-accelerated cluster to take -advantage of the speed-up and cost savings. - -Usage: `spark_rapids_user_tools emr qualification --eventlogs --cpu_cluster ` - -Help (to see all options available): `spark_rapids_user_tools emr qualification --help` - -Example output: -``` -+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+ -| | App Name | App ID | Recommendation | Estimated GPU | Estimated GPU | App | Estimated GPU | -| | | | | Speedup | Duration(s) | Duration(s) | Savings(%) | -|----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------| -| 0 | query24 | application_1664888311321_0011 | Strongly Recommended | 3.49 | 257.18 | 897.68 | 59.70 | -| 1 | query78 | application_1664888311321_0009 | Strongly Recommended | 3.35 | 113.89 | 382.35 | 58.10 | -| 2 | query23 | application_1664888311321_0010 | Strongly Recommended | 3.08 | 325.77 | 1004.28 | 54.37 | -| 3 | query64 | application_1664888311321_0008 | Strongly Recommended | 2.91 | 150.81 | 440.30 | 51.82 | -| 4 | query50 | application_1664888311321_0003 | Recommended | 2.47 | 101.54 | 250.95 | 43.08 | -| 5 | query16 | application_1664888311321_0005 | Recommended | 2.36 | 106.33 | 251.95 | 40.63 | -| 6 | query38 | application_1664888311321_0004 | Recommended | 2.29 | 67.37 | 154.33 | 38.59 | -| 7 | query87 | application_1664888311321_0006 | Recommended | 2.25 | 75.67 | 170.69 | 37.64 | -| 8 | query51 | application_1664888311321_0002 | Recommended | 1.53 | 53.94 | 82.63 | 8.18 | -+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+ - -Instance types conversions: ------------ -- ------------ -m5d.8xlarge to g4dn.8xlarge ------------ -- ------------ -To support acceleration with T4 GPUs, switch the worker node instance types -``` - -## Configure and Launch AWS EMR with GPU Nodes - -Please follow AWS EMR document ["Using the NVIDIA Spark-RAPIDS Accelerator -for Spark"](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html). -Below is an example. - -### Launch an EMR Cluster using AWS Console (GUI) - -Go to the AWS Management Console and select the `EMR` service from the "Analytics" section. Choose -the region you want to launch your cluster in, e.g. US West (Oregon), using the dropdown menu in the -top right corner. Click `Create cluster`, which will bring up a detailed cluster configuration page. - -#### Step 1: EMR Release and Application Bundle Selection - -Enter a custom "Cluster name" for your cluster. - -Select **emr-6.13.0** for the release and pick "Custom" for the "Application bundle". Uncheck all the -software options, and then check **Hadoop 3.3.3**, **Spark 3.4.1**, **Livy 0.7.1** and -**JupyterEnterpriseGateway 2.6.0**. - -Optionally pick Amazon Linux Release or configure a "Custom AMI". - -![Step 1: Software, Configuration and Steps](../img/AWS-EMR/name-and-applications.png) - -#### Step 2: Hardware - -Keep the default "Primary" node instance type of **m5.xlarge**. - -Change the "Core" node "Instance type" to **g4dn.xlarge**, **g4dn.2xlarge**, or -**p3.2xlarge** - -An optional step is to have "Task" nodes. These nodes can run a Spark executor but they do not run -the HDFS Data Node service. You can click on "Remove instance group" if you would like to only run -"Core" nodes with the Data Node and Spark executors. If you want to add extra "Task" nodes, make sure -that that instance type matches what you selected for "Core". - -Under "Cluster scaling and provisioning potion", verify that the instance count for the "Core" instance group -is at least 1. - -![Step 2: Cluster Configuration](../img/AWS-EMR/cluster-configuration.png) - -Under "Networking", select the desired VPC and subnet. You can also create a new VPC and subnet for the cluster. - -*Optionally* set custom security groups in the "EC2 security groups" tab. - -In the "EC2 security groups" section, confirm that the security group chosen for the "Primary" node -allows for SSH access. Follow these instructions to [allow inbound SSH -traffic](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) -if the security group does not allow it yet. - -![Step 2: Cluster Configuration](../img/AWS-EMR/networking.png) - -#### Step 3: General Cluster Settings - -Add a custom bootstrap action under "Bootstrap Actions" to allow cgroup permissions to YARN on your cluster. -An example bootstrap script is as follows: -```bash -#!/bin/bash - -set -ex - -sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct -sudo chmod a+rwx -R /sys/fs/cgroup/devices -``` - -![Step 3: General Cluster Settings](../img/AWS-EMR/bootstrap-action.png) - -#### Step 4: Edit Software Configuration - -In the "Software settings" field, copy and paste the configuration from the [EMR -document](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html) in the textbox provided -under "Enter configuration". You can also create a JSON file on you own S3 bucket when selecting -"Load JSON from Amazon S3". - -For clusters with 2x g4dn.2xlarge GPU instances as worker nodes, we recommend the following -default settings: -```json -[ - { - "Classification":"spark", - "Properties":{ - "enableSparkRapids":"true" - } - }, - { - "Classification":"yarn-site", - "Properties":{ - "yarn.nodemanager.resource-plugins":"yarn.io/gpu", - "yarn.resource-types":"yarn.io/gpu", - "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", - "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", - "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", - "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", - "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", - "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" - } - }, - { - "Classification":"container-executor", - "Properties":{ - - }, - "Configurations":[ - { - "Classification":"gpu", - "Properties":{ - "module.enabled":"true" - } - }, - { - "Classification":"cgroups", - "Properties":{ - "root":"/sys/fs/cgroup", - "yarn-hierarchy":"yarn" - } - } - ] - }, - { - "Classification":"spark-defaults", - "Properties":{ - "spark.plugins":"com.nvidia.spark.SQLPlugin", - "spark.sql.sources.useV1SourceList":"", - "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", - "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar", - "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", - "spark.rapids.sql.concurrentGpuTasks":"2", - "spark.executor.resource.gpu.amount":"1", - "spark.executor.cores":"8", - "spark.task.cpus ":"1", - "spark.task.resource.gpu.amount":"0.125", - "spark.rapids.memory.pinnedPool.size":"2G", - "spark.executor.memoryOverhead":"2G", - "spark.sql.files.maxPartitionBytes":"256m", - "spark.sql.adaptive.enabled":"false" - } - }, - { - "Classification":"capacity-scheduler", - "Properties":{ - "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" - } - } -] - -``` -Adjust the settings as appropriate for your cluster. For example, setting the appropriate -number of cores based on the node type. The `spark.task.resource.gpu.amount` should be set to -1/(number of cores per executor) which will allow multiple tasks to run in parallel on the GPU. - -For example, for clusters with 2x g4dn.12xlarge as core nodes, use the following: - -```json - "spark.executor.cores":"12", - "spark.task.resource.gpu.amount":"0.0833", -``` - -More configuration details can be found in the [configuration](../configs.md) documentation. - -#### Step 5: Security - -Select an existing "EC2 key pair" that will be used to authenticate SSH access to the cluster's -nodes. If you do not have access to an EC2 key pair, follow these instructions to [create an EC2 key -pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). - -![Step 5: SSH Key Pair](../img/AWS-EMR/ssh-key-pair.png) - -#### Finish Cluster Configuration - -The EMR cluster management page displays the status of multiple clusters or detailed information -about a chosen cluster. In the detailed cluster view, the "Instances" and "Monitoring" tabs can be used -to monitor the status of the various cluster nodes. - -When the cluster is ready, a green-dot will appear next to the cluster name and the "Status" column -will display **Waiting, cluster ready**. - -In the cluster's "Summary" tab, find the "Primary node public DNS" field and click on -"Connect to the Primary Node using SSH". Follow the instructions to SSH to the new cluster's primary node. - -### Launch an EMR Cluster using AWS CLI - -In this example, we will use the AWS CLI to launch a cluster with one Primary node (m5.xlarge) and two -g4dn.2xlarge nodes. - -You will need: -- an SSH key-pair already registered in the AWS console -- a subnet and VPC configuration (default or a custom configuration) - -```bash -aws emr create-cluster \ ---release-label emr-6.13.0 \ ---applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \ ---service-role DemoServiceRole \ ---ec2-attributes KeyName=demo-key-pair,SubnetId=demo-subnet,InstanceProfile=DemoInstanceProfile \ ---instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \ - InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge ---configurations file://config.json \ ---bootstrap-actions Name='Setup cgroups bootstrap',Path=s3://demo-bucket/cgroup-bootstrap-action.sh -``` - -Please fill with actual value for `KeyName`, `SubnetId`, `service-role`, and `InstanceProfile`. -The service role and instance profile are AWS IAM roles associated with your cluster, which allow -the EMR cluster to access services provided by AWS. - -The `config.json` installs the spark-rapids plugin on your cluster, configures YARN to use -GPUs, configures Spark to use RAPIDS, and configures the YARN capacity scheduler. An [example JSON -configuration](#step-4--edit-software-configuration) can be found in the section on -launching in the GUI above. - -The `cgroup-boostrap-action.sh` script referenced in the above script opens cgroup permissions to YARN -on your cluster. You can find an example of -the [cgroup bootstrap action](#step-3--general-cluster-settings) above. - -### Running the Spark RAPIDS User Tools Bootstrap for Optimal Cluster Spark Settings - -The bootstrap tool will generate optimized settings for the RAPIDS Accelerator on Apache Spark on a -GPU cluster for EMR. The tool will fetch the characteristics of the cluster -- including -number of workers, worker cores, worker memory, and GPU accelerator type and count. It will use -the cluster properties to then determine the optimal settings for running GPU-accelerated Spark -applications. - -Usage: `spark_rapids_user_tools emr bootstrap --cluster ` - -Help (to see all options available): `spark_rapids_user_tools emr bootstrap --help` - -Example output: -``` -##### BEGIN : RAPIDS bootstrap settings for gpu-cluster -spark.executor.cores=16 -spark.executor.memory=32768m -spark.executor.memoryOverhead=7372m -spark.rapids.sql.concurrentGpuTasks=2 -spark.rapids.memory.pinnedPool.size=4096m -spark.sql.files.maxPartitionBytes=512m -spark.task.resource.gpu.amount=0.0625 -##### END : RAPIDS bootstrap settings for gpu-cluster -``` - -A detailed description for bootstrap settings with usage information is available in the [RAPIDS Accelerator for Apache Spark Configuration](https://nvidia.github.io/spark-rapids/docs/configs.html) and [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) page. - -### Running an Example Join Operation Using Spark Shell - -Please follow EMR doc [Connect to the primary node using -SSH](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html) to ssh -to the EMR cluster's primary node. And then get into sparks shell and run the sql join example to verify -GPU operation. - -Note: Use `hadoop` user for SSH and below command. - -```bash -spark-shell -``` - -Running following Scala code in Spark Shell - -```scala -val data = 1 to 10000 -val df1 = sc.parallelize(data).toDF() -val df2 = sc.parallelize(data).toDF() -val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value") -out.count() -out.explain() -``` - -### Submit Spark jobs to an EMR Cluster Accelerated by GPUs - -Similar to spark-submit for on-prem clusters, AWS EMR supports a Spark application job to be -submitted. The mortgage examples we use are also available as a spark application. You can also use -**spark shell** to run the scala code or **pyspark** to run the python code on the primary node through -CLI. -In the Spark History Server UI, you can find the cpu operations have been replaced by gpu operations with `GPU` prefix: -![Spark History Server UI](../img/AWS-EMR/spark-history-server-UI.png) - -### Running GPU Accelerated Mortgage ETL Example using EMR Notebook - -An EMR Notebook is a "serverless" Jupyter notebook. Unlike a traditional notebook, the contents of -an EMR Notebook itself—the equations, visualizations, queries, models, code, and narrative text—are -saved in Amazon S3 separately from the cluster that runs the code. This provides an EMR Notebook -with durable storage, efficient access, and flexibility. - -You can use the following step-by-step guide to run the example mortgage dataset using RAPIDS on -Amazon EMR GPU clusters. For more examples, please refer to [NVIDIA/spark-rapids for -ETL](https://github.com/NVIDIA/spark-rapids/tree/main/docs/demo) - -![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_2.png) - -#### Create EMR Notebook and Connect to EMR GPU Cluster - -Go to the Amazon EMR page and select "Studios" under "EMR Studios". You can create a Studio if -you haven't already. - -Create a notebook by clicking on "Workspaces (Notebooks)" on the left column and then clicking -on the "Create Workspace" button. Select the studio you selected in the prior step. - -Enter a Workspace name, description and a location (which should be set by default to the studio -S3 path). Under "Advanced configuration", you can pick an EMR cluster that you have already -launched. - -![Create EMR Notebook](../img/AWS-EMR/notebook-workspace-creation.png) - -#### Run Mortgage ETL PySpark Notebook on EMR GPU Cluster - -Download [the Mortgate ETL PySpark Notebook](../demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb). Make sure -to use PySpark as kernel. This example use 1 year (year 2000) data for a two node g4dn GPU -cluster. You can adjust settings in the notebook for full mortgage dataset ETL. - -When executing the ETL code, you can also see the Spark Job Progress within the notebook and the -code will also display how long it takes to run the query - -![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_3.png) diff --git a/docs/get-started/getting-started-azure-synapse-analytics.md b/docs/get-started/getting-started-azure-synapse-analytics.md deleted file mode 100644 index 5826ed28b4e..00000000000 --- a/docs/get-started/getting-started-azure-synapse-analytics.md +++ /dev/null @@ -1,19 +0,0 @@ ---- -layout: page -title: Azure Synapse Analytics -nav_order: 5 -parent: Getting-Started ---- -# Getting started with RAPIDS Accelerator on Azure Synapse Analytics - [Azure Synapse Analytics](https://docs.microsoft.com/en-us/azure/synapse-analytics/) is an analytics service that brings - together enterprise data warehousing and Big Data analytics. - -Synapse now offers the ability to create Apache Spark pools that use GPUs on the backend to run your Spark workloads on -GPUs for accelerated processing. These are called -[GPU-accelerated Apache Spark pools](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-gpu-concept). -Currently it ships with the RAPIDS Accelerator for Apache Spark version 21.10. - -Please follow the Quickstart guides below to learn how to create and use GPU pools in Azure Synapse Analytics: -1. [Quickstart: Create an Apache Spark GPU-enabled Pool in Azure Synapse Analytics using the Azure portal](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-gpu-pool-portal) -2. [Quickstart: Create an Apache Spark notebook to run on a GPU pool](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-rapids-gpu) - diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md deleted file mode 100644 index 459a637153c..00000000000 --- a/docs/get-started/getting-started-databricks.md +++ /dev/null @@ -1,171 +0,0 @@ ---- -layout: page -title: Databricks -nav_order: 3 -parent: Getting-Started ---- - -# Getting started with RAPIDS Accelerator on Databricks -This guide will run through how to set up the RAPIDS Accelerator for Apache Spark 3.x on Databricks. -At the end of this guide, the reader will be able to run a sample Apache Spark application that runs -on NVIDIA GPUs on Databricks. - -## Supported runtime versions -Please see [Software Requirements](../download.md#software-requirements) section for complete list of -Databricks runtime versions supported by RAPIDS plugin. - -Databricks may do [maintenance -releases](https://docs.databricks.com/release-notes/runtime/maintenance-updates.html) for their -runtimes which may impact the behavior of the plugin. - -The number of GPUs per node dictates the number of Spark executors that can run in that node. - -## Limitations - -1. When selecting GPU nodes, Databricks UI requires the driver node to be a GPU node. However you - can use Databricks API to create a cluster with CPU driver node. - Outside of Databricks the plugin can operate with the driver as a CPU node and workers as GPU nodes. - -2. Cannot spin off multiple executors on a multi-GPU node. - - Even though it is possible to set `spark.executor.resource.gpu.amount=1` in the in Spark - Configuration tab, Databricks overrides this to `spark.executor.resource.gpu.amount=N` - (where N is the number of GPUs per node). This will result in failed executors when starting the - cluster. - -3. Parquet rebase mode is set to "LEGACY" by default. - - The following Spark configurations are set to `LEGACY` by default on Databricks: - - ``` - spark.sql.legacy.parquet.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.int96RebaseModeInWrite - ``` - - These settings will cause a CPU fallback for Parquet writes involving dates and timestamps. - If you do not need `LEGACY` write semantics, set these configs to `EXCEPTION` which is - the default value in Apache Spark 3.0 and higher. - -4. Databricks makes changes to the runtime without notification. - - Databricks makes changes to existing runtimes, applying patches, without notification. - [Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this. We run - regular integration tests on the Databricks environment to catch these issues and fix them once - detected. - -5. In Databricks 11.3, an incorrect result is returned for window frames defined by a range in case - of DecimalTypes with precision greater than 38. There is a bug filed in Apache Spark for it - [here](https://issues.apache.org/jira/browse/SPARK-41793), whereas when using the plugin the - correct result will be returned. - -## Start a Databricks Cluster -Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/clusters/init-scripts.html) for the -cluster to install the RAPIDS jars. Databricks recommends storing all cluster-scoped init scripts using workspace files. -Each user has a Home directory configured under the /Users directory in the workspace. -Navigate to your home directory in the UI and select **Create** > **File** from the menu, -create an `init.sh` scripts with contents: - ```bash - #!/bin/bash - sudo wget -O /databricks/jars/rapids-4-spark_2.12-23.08.1.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.08.1/rapids-4-spark_2.12-23.08.1.jar - ``` -Then create a Databricks cluster by going to "Compute", then clicking `+ Create compute`. Ensure the -cluster meets the prerequisites above by configuring it as follows: -1. Select the Databricks Runtime Version from one of the supported runtimes specified in the - Prerequisites section. -2. Choose the number of workers that matches the number of GPUs you want to use. -3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`. - For Azure, choose GPU nodes such as Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs. -4. Select the driver type. Generally this can be set to be the same as the worker. -5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in - the advanced options section, and paste the workspace path to the initialization script:`/Users/user@domain/init.sh`, then click “Add”. - ![Init Script](../img/Databricks/initscript.png) -6. Now select the “Spark” tab, and paste the following config options into the Spark Config section. - Change the config values based on the workers you choose. See Apache Spark - [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator - for Apache Spark [descriptions](../configs.md) for each config. - - The - [`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling) - configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an - executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set - this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just - like the CPU side. Having the value smaller is fine as well. - Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks - cluster because Spark local mode does not support GPU scheduling. - - ```bash - spark.plugins com.nvidia.spark.SQLPlugin - spark.task.resource.gpu.amount 0.1 - spark.rapids.memory.pinnedPool.size 2G - spark.rapids.sql.concurrentGpuTasks 2 - ``` - - ![Spark Config](../img/Databricks/sparkconfig.png) - - If running Pandas UDFs with GPU support from the plugin, at least three additional options - as below are required. The `spark.python.daemon.module` option is to choose the right daemon module - of python for Databricks. On Databricks, the python runtime requires different parameters than the - Spark one, so a dedicated python deamon module `rapids.daemon_databricks` is created and should - be specified here. Set the config - [`spark.rapids.sql.python.gpu.enabled`](../additional-functionality/advanced_configs.md#sql.python.gpu.enabled) to `true` to - enable GPU support for python. Add the path of the plugin jar (supposing it is placed under - `/databricks/jars/`) to the `spark.executorEnv.PYTHONPATH` option. For more details please go to - [GPU Scheduling For Pandas UDF](../additional-functionality/rapids-udfs.md#gpu-support-for-pandas-udf) - - ```bash - spark.rapids.sql.python.gpu.enabled true - spark.python.daemon.module rapids.daemon_databricks - spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.08.1.jar:/databricks/spark/python - ``` - Note that since python memory pool require installing the cudf library, so you need to install cudf library in - each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool - `spark.rapids.python.memory.gpu.pooling.enabled=false`. - -7. Click `Create Cluster`, it is now enabled for GPU-accelerated Spark. - -## RAPIDS Accelerator for Apache Spark Docker container for Databricks - -Github repo [spark-rapids-container](https://github.com/NVIDIA/spark-rapids-container) provides the -Dockerfile and scripts to build custom Docker containers with RAPIDS Accelerator for Apache Spark. - -Please refer to [Databricks doc](https://github.com/NVIDIA/spark-rapids-container/tree/main/Databricks) -for more details. - -## Import the GPU Mortgage Example Notebook -Import the example [notebook](../demo/Databricks/Mortgage-ETL-db.ipynb) from the repo into your -workspace, then open the notebook. Please find this [instruction](https://github.com/NVIDIA/spark-rapids-examples/blob/main/docs/get-started/xgboost-examples/dataset/mortgage.md) -to download the dataset. - -```bash -%sh - -USER_ID= - -wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz -P /Users/${USER_ID}/ - -mkdir -p /dbfs/FileStore/tables/mortgage -mkdir -p /dbfs/FileStore/tables/mortgage_parquet_gpu/perf -mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/acq -mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/output - -tar xfvz /Users/${USER_ID}/mortgage_2000.tgz --directory /dbfs/FileStore/tables/mortgage -``` - -In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares -the data for XGBoost training. The temp and final output results are written back to the dbfs. - -```bash -orig_perf_path='dbfs:///FileStore/tables/mortgage/perf/*' -orig_acq_path='dbfs:///FileStore/tables/mortgage/acq/*' -tmp_perf_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/perf/' -tmp_acq_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/acq/' -output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/' -``` -Run the notebook by clicking “Run All”. - -## Hints -Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud -storage location using Databricks [cluster log -delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this -option before starting the cluster to capture the logs. - diff --git a/docs/get-started/getting-started-gcp.md b/docs/get-started/getting-started-gcp.md deleted file mode 100644 index 9c7537489d1..00000000000 --- a/docs/get-started/getting-started-gcp.md +++ /dev/null @@ -1,543 +0,0 @@ ---- -layout: page -title: GCP Dataproc -nav_order: 4 -parent: Getting-Started ---- - -# Getting Started with the RAPIDS Accelerator on GCP Dataproc - [Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache - Spark and Hadoop service. Please see [Software Requirements](../download.md#software-requirements) - section for complete list of Dataproc versions supported by RAPIDS plugin. - The quick start guide will go through: - -* [Create a Dataproc Cluster Accelerated by GPUs](#create-a-dataproc-cluster-accelerated-by-gpus) - * [Create a Dataproc Cluster using T4's](#create-a-dataproc-cluster-using-t4s) - * [Build custom Dataproc image to reduce cluster initialization time](#build-a-custom-dataproc-image-to-reduce-cluster-init-time) - * [Create a Dataproc Cluster using MIG with A100's](#create-a-dataproc-cluster-using-mig-with-a100s) - * [Cluster creation troubleshooting](#cluster-creation-troubleshooting) -* [Run Python or Scala Spark ETL and XGBoost training Notebook on a Dataproc Cluster Accelerated by -GPUs](#run-python-or-scala-spark-notebook-on-a-dataproc-cluster-accelerated-by-gpus) -* [Submit the same sample ETL application as a Spark job to a Dataproc Cluster Accelerated by -GPUs](#submit-spark-jobs-to-a-dataproc-cluster-accelerated-by-gpus) - -We provide some RAPIDS tools to analyze the clusters and the applications running on [Google Cloud Dataproc](https://cloud.google.com/dataproc) including: -* [Diagnosing a GPU Cluster](#diagnosing-a-gpu-cluster) -* [Bootstrap GPU cluster with optimized settings](#bootstrap-gpu-cluster-with-optimized-settings) -* [Qualify CPU workloads for GPU acceleration](#qualify-cpu-workloads-for-gpu-acceleration) -* [Tune applications on GPU cluster](#tune-applications-on-gpu-cluster) - -The Prerequisites of the RAPIDS tools including: -* gcloud CLI is installed: https://cloud.google.com/sdk/docs/install -* python 3.8+ -* `pip install spark-rapids-user-tools` - -## Create a Dataproc Cluster Accelerated by GPUs - -You can use [Cloud Shell](https://cloud.google.com/shell) to execute shell commands that will -create a Dataproc cluster. Cloud Shell contains command line tools for interacting with Google -Cloud Platform, including `gcloud` and `gsutil`. Alternatively, you can install -[GCloud SDK](https://cloud.google.com/sdk/install) on your machine. From the Cloud Shell, users -will need to enable services within your project. Enable the Compute and Dataproc APIs in order to -access Dataproc, and enable the Storage API as you’ll need a Google Cloud Storage bucket to house -your data. This may take several minutes. - -```bash -gcloud services enable compute.googleapis.com -gcloud services enable dataproc.googleapis.com -gcloud services enable storage-api.googleapis.com -``` - -After the command line environment is set up, log in to your GCP account. You can now create a -Dataproc cluster. Dataproc supports multiple different GPU types depending on your use case. -Generally, T4 is a good option for use with the RAPIDS Accelerator for Spark. We also support -MIG on the Ampere architecture GPUs like the A100. Using -[MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) you can request an A100 and split -it up into multiple different compute instances, and it runs like you have multiple separate GPUs. - -The example configurations below will allow users to run any of the [notebook -demos](../demo/GCP) on GCP. Adjust the sizes and -number of GPU based on your needs. - -The script below will initialize with the following: - -* [GPU Driver and RAPIDS Accelerator for Apache Spark](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/spark-rapids) -through initialization actions (please note it takes up to 1 week for the latest init script to be -merged into the GCP Dataproc public GCS bucket) - - To make changes to example configuration, make a copy of `spark-rapids.sh` and add the RAPIDS - Accelerator related parameters according to [tuning guide](../tuning-guide.md) and modify the - `--initialization-actions` parameter to point to the updated version. -* Configuration for [GPU scheduling and isolation](yarn-gpu.md) -* [Local SSDs](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds) are -recommended for Spark scratch space to improve IO -* Component gateway enabled for accessing Web UIs hosted on the cluster - -> NOTE: Dataproc `2.1` enables Secure Boot by default, causing issues with cluster creation on -> all operating systems due to GPU drivers that are not properly signed. Proper GPU driver signing -> is not currently supported on all operating systems and will cause cluster creation failures -> occur on all Dataproc `2.1` clusters that use the -> [RAPIDS Accelerator initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh). -> -> Add `--no-shielded-secure-boot` to `gcloud` cluster creation commands allow the cluster to boot -> correctly with unsigned GPU drivers. -> -> For more information: ->* [Shielded secure boot configuration](https://cloud.google.com/compute/shielded-vm/docs/modifying-shielded-vm#gcloud) ->* [Installing GPU drivers with secure boot](https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#secure-boot) - -### Create a Dataproc Cluster Using T4's -* One 16-core master node and 5 32-core worker nodes -* Two NVIDIA T4 for each worker node - -```bash - export REGION=[Your Preferred GCP Region] - export GCS_BUCKET=[Your GCS Bucket] - export CLUSTER_NAME=[Your Cluster Name] - # Number of GPUs to attach to each worker node in the cluster - export NUM_GPUS=2 - # Number of Spark worker nodes in the cluster - export NUM_WORKERS=5 - -gcloud dataproc clusters create $CLUSTER_NAME \ - --region=$REGION \ - --image-version=2.0-ubuntu18 \ - --master-machine-type=n1-standard-16 \ - --num-workers=$NUM_WORKERS \ - --worker-accelerator=type=nvidia-tesla-t4,count=$NUM_GPUS \ - --worker-machine-type=n1-highmem-32 \ - --num-worker-local-ssds=4 \ - --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \ - --optional-components=JUPYTER,ZEPPELIN \ - --metadata=rapids-runtime=SPARK \ - --bucket=$GCS_BUCKET \ - --enable-component-gateway \ - --subnet=default -``` - -This takes around 10-15 minutes to complete. You can navigate to the -[Dataproc clusters tab](https://console.cloud.google.com/dataproc/clusters) in the Google Cloud -Console to see the progress. - -![Dataproc Cluster](../img/GCP/dataproc-cluster.png) - -To reduce initialization time to 4-5 minutes, create a custom Dataproc image using -[this guide](#build-a-custom-dataproc-image-to-reduce-cluster-init-time). - -### Build a Custom Dataproc Image to Reduce Cluster Init Time -Building a custom Dataproc image that already has NVIDIA drivers, the CUDA toolkit, and the RAPIDS -Accelerator for Apache Spark preinstalled and preconfigured will reduce cluster initialization -time to 3-4 minutes. The custom image can also be used in air gap environments. In this example, -we adapt [these instructions from GCP](https://cloud.google.com/dataproc/docs/guides/dataproc-images) -to create a RAPIDS Accelerator custom Dataproc image. - -Google provides the `generate_custom_image.py` -[script](https://github.com/GoogleCloudDataproc/custom-images/blob/master/generate_custom_image.py) -that: -- Launches a temporary Compute Engine VM instance with the specified Dataproc base image. -- Then runs the customization script inside the VM instance to install custom packages and/or -update configurations. -- After the customization script finishes, it shuts down the VM instance and creates a Dataproc -custom image from the disk of the VM instance. -- The temporary VM is deleted after the custom image is created. -- The custom image is saved and can be used to create Dataproc clusters. - - -The following example clones the -[Dataproc custom-images](https://github.com/GoogleCloudDataproc/custom-images) repository, downloads the -[`spark-rapids.sh`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/spark-rapids) -script locally, and then creates a custom Dataproc image with RAPIDS Accelerator resources using -the downloaded scripts. The `spark-rapids.sh` script is passed as the customization script and -installs the RAPIDS Accelerator for Apache Spark, NVIDIA drivers, and other dependencies. -Custom image generation may take 20-25 minutes to complete. - -```bash -git clone https://github.com/GoogleCloudDataproc/custom-images -cd custom-images -wget https://raw.githubusercontent.com/GoogleCloudDataproc/initialization-actions/master/spark-rapids/spark-rapids.sh - -export ZONE=[Your Preferred GCP Zone] -export GCS_BUCKET=[Your GCS Bucket] -export CUSTOMIZATION_SCRIPT=./spark-rapids.sh -export IMAGE_NAME=sample-20-ubuntu18-gpu-t4 -export DATAPROC_VERSION=2.0-ubuntu18 -export GPU_NAME=nvidia-tesla-t4 -export GPU_COUNT=1 - -python generate_custom_image.py \ - --image-name $IMAGE_NAME \ - --dataproc-version $DATAPROC_VERSION \ - --customization-script $CUSTOMIZATION_SCRIPT \ - --no-smoke-test \ - --zone $ZONE \ - --gcs-bucket $GCS_BUCKET \ - --machine-type n1-standard-4 \ - --accelerator type=$GPU_NAME,count=$GPU_COUNT \ - --disk-size 200 \ - --subnet default -``` - -See [here](https://cloud.google.com/dataproc/docs/guides/dataproc-images#running_the_code) for -details on `generate_custom_image.py` script configuration and -[here](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions) for dataproc -version information. - -The custom `sample-20-ubuntu18-gpu-t4` image is now ready and can be viewed in the GCP console under -`Compute Engine > Storage > Images`. - -Let's launch a cluster using the `sample-20-ubuntu18-gpu-t4` custom image: - -```bash -export REGION=[Your Preferred GCP Region] -export GCS_BUCKET=[Your GCS Bucket] -export CLUSTER_NAME=[Your Cluster Name] -export IMAGE_NAME=sample-20-ubuntu18-gpu-t4 -export NUM_GPUS=1 -export NUM_WORKERS=2 - -gcloud dataproc clusters create $CLUSTER_NAME \ - --region=$REGION \ - --image=$IMAGE_NAME \ - --master-machine-type=n1-standard-4 \ - --num-workers=$NUM_WORKERS \ - --worker-accelerator=type=nvidia-tesla-t4,count=$NUM_GPUS \ - --worker-machine-type=n1-standard-4 \ - --num-worker-local-ssds=1 \ - --optional-components=JUPYTER,ZEPPELIN \ - --metadata=rapids-runtime=SPARK \ - --bucket=$GCS_BUCKET \ - --enable-component-gateway \ - --subnet=default -``` - -There are no initialization actions that need to be configured because NVIDIA drivers and -RAPIDS Accelerator resources are already installed in the custom image. The new cluster -should be up and running within 3-4 minutes! - -### Create a Dataproc Cluster using MIG with A100's -* One 16-core master node and five 12-core worker nodes -* One NVIDIA A100 for each worker node, split into two MIG instances using -[instance profile 3g.20gb](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#a100-profiles). - -```bash - export REGION=[Your Preferred GCP Region] - export ZONE=[Your Preferred GCP Zone] - export GCS_BUCKET=[Your GCS Bucket] - export CLUSTER_NAME=[Your Cluster Name] - # Number of GPUs to attach to each worker node in the cluster - export NUM_GPUS=1 - # Number of Spark worker nodes in the cluster - export NUM_WORKERS=4 - -gcloud dataproc clusters create $CLUSTER_NAME \ - --region=$REGION \ - --zone=$ZONE \ - --image-version=2.0-ubuntu18 \ - --master-machine-type=n1-standard-16 \ - --num-workers=$NUM_WORKERS \ - --worker-accelerator=type=nvidia-tesla-a100,count=$NUM_GPUS \ - --worker-machine-type=a2-highgpu-1g \ - --num-worker-local-ssds=4 \ - --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \ - --metadata=startup-script-url=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/mig.sh \ - --optional-components=JUPYTER,ZEPPELIN \ - --metadata=rapids-runtime=SPARK \ - --bucket=$GCS_BUCKET \ - --enable-component-gateway \ - --subnet=default -``` - -To change the MIG instance profile you can specify either the profile id or profile name via the -metadata parameter `MIG_CGI`. Below is an example of using a profile name and a profile id. - -```bash - --metadata=^:^MIG_CGI='3g.20gb,9' -``` - -This may take around 10-15 minutes to complete. You can navigate to the Dataproc clusters tab in -the Google Cloud Console to see the progress. - -![Dataproc Cluster](../img/GCP/dataproc-cluster.png) - -To reduce initialization time to 4-5 minutes, create a custom Dataproc image using -[this](#build-a-custom-dataproc-image-to-reduce-cluster-init-time) guide. - -### Cluster Creation Troubleshooting -If you encounter an error related to GPUs not being available because of your account quotas, please -go to this page for updating your quotas: [Quotas and limits](https://cloud.google.com/compute/quotas). - -If you encounter an error related to GPUs not available in the specific region or zone, you will -need to update the REGION or ZONE parameter in the cluster creation command. - -## Run Python or Scala Spark Notebook on a Dataproc Cluster Accelerated by GPUs -To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab -and navigate to the `Web Interfaces` tab. Under `Web Interfaces`, click on the JupyterLab or -Jupyter link. Download the sample [Mortgage ETL on GPU Jupyter -Notebook](../demo/GCP/Mortgage-ETL.ipynb) and upload it to Jupyter. - -To get example data for the sample notebook, please refer to these -[instructions](https://github.com/NVIDIA/spark-rapids-examples/blob/main/docs/get-started/xgboost-examples/dataset/mortgage.md). -Download the desired data, decompress it, and upload the csv files to a GCS bucket. - -![Dataproc Web Interfaces](../img/GCP/dataproc-service.png) - -The sample notebook will transcode the CSV files into Parquet files before running an ETL query -that prepares the dataset for training. The ETL query splits the data, saving 20% of the data in -a separate GCS location training for evaluation. Using the default notebook configuration the -first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second -stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the -[RAPIDS Accelerator for Apache Spark](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark), -which is pre-downloaded and pre-configured by the GCP Dataproc -[RAPIDS Accelerator init script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/spark-rapids). - -Once the data is prepared, we use the -[Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) in Dataproc's -Jupyter notebook to execute the training job on GPUs. Scala based XGBoost examples use -[DMLC XGBoost](https://github.com/dmlc/xgboost). For a PySpark based XGBoost example, please refer to -[RAPIDS Accelerator-examples](https://github.com/NVIDIA/spark-rapids-examples/blob/main/docs/get-started/xgboost-examples/on-prem-cluster/yarn-python.md) -that makes sure the required libraries are installed. - -The training time should be around 680 seconds (1/7 of CPU execution time with same config). This -is shown under cell: - -```scala -// Start training -println("\n------ Training ------") -val (xgbClassificationModel, _) = benchmark("train") { - xgbClassifier.fit(trainSet) -} -``` - -## Submit Spark jobs to a Dataproc Cluster Accelerated by GPUs -Similar to `spark-submit` for on-prem clusters, Dataproc supports submitting Spark applications to -Dataproc clusters. The previous mortgage examples are also available as a [Spark -application](https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/XGBoost-Examples). - -Follow these -[instructions](https://github.com/NVIDIA/spark-rapids-examples/blob/main/docs/get-started/xgboost-examples/building-sample-apps/scala.md) -to Build the -[xgboost-example](https://github.com/NVIDIA/spark-rapids-examples/blob/main/docs/get-started/xgboost-examples) -jars. Upload the `sample_xgboost_apps-${VERSION}-SNAPSHOT-jar-with-dependencies.jar` to a GCS -bucket by dragging and dropping the jar file from your local machine into the GCS web console or by running: -``` -gsutil cp aggregator/target/sample_xgboost_apps-${VERSION}-SNAPSHOT-jar-with-dependencies.jar gs://${GCS_BUCKET}/scala/ -``` - -Submit the Spark XGBoost application to dataproc using the following command: -```bash -export REGION=[Your Preferred GCP Region] -export GCS_BUCKET=[Your GCS Bucket] -export CLUSTER_NAME=[Your Cluster Name] -export VERSION=[Your jar version] -export SPARK_NUM_EXECUTORS=20 -export SPARK_EXECUTOR_MEMORY=20G -export SPARK_EXECUTOR_MEMORYOVERHEAD=16G -export SPARK_NUM_CORES_PER_EXECUTOR=7 -export DATA_PATH=gs://${GCS_BUCKET}/mortgage_full - -gcloud dataproc jobs submit spark \ - --cluster=$CLUSTER_NAME \ - --region=$REGION \ - --class=com.nvidia.spark.examples.mortgage.Main \ - --jars=gs://${GCS_BUCKET}/scala/sample_xgboost_apps-${VERSION}-SNAPSHOT-jar-with-dependencies.jar \ - --properties=spark.executor.cores=${SPARK_NUM_CORES_PER_EXECUTOR},spark.task.cpus=${SPARK_NUM_CORES_PER_EXECUTOR},spark.executor.memory=${SPARK_EXECUTOR_MEMORY},spark.executor.memoryOverhead=${SPARK_EXECUTOR_MEMORYOVERHEAD},spark.executor.resource.gpu.amount=1,spark.task.resource.gpu.amount=1,spark.rapids.sql.batchSizeBytes=512M,spark.rapids.sql.reader.batchSizeBytes=768M,spark.rapids.sql.variableFloatAgg.enabled=true,spark.rapids.memory.gpu.pooling.enabled=false,spark.dynamicAllocation.enabled=false \ - -- \ - -dataPath=train::${DATA_PATH}/train \ - -dataPath=trans::${DATA_PATH}/eval \ - -format=parquet \ - -numWorkers=${SPARK_NUM_EXECUTORS} \ - -treeMethod=gpu_hist \ - -numRound=100 \ - -maxDepth=8 -``` - -## Diagnosing a GPU Cluster - -The diagnostic tool can be run to check a GPU cluster with RAPIDS Accelerator for Apache Spark -is healthy and ready for Spark jobs, such as checking the version of installed NVIDIA driver, -cuda-toolkit, RAPIDS Accelerator and running Spark test jobs etc. This tool also can -be used by the front line support team for basic diagnostic and troubleshooting before escalating -to NVIDIA RAPIDS Accelerator for Apache Spark engineering team. - -Usage: `spark_rapids_dataproc diagnostic --cluster --region ` - -Help (to see all options available): `spark_rapids_dataproc diagnostic --help` - -Example output: - -```text -*** Running diagnostic function "nv_driver" *** -Warning: Permanently added 'compute.9009746126288801979' (ECDSA) to the list of known hosts. -Fri Oct 14 05:17:55 2022 -+-----------------------------------------------------------------------------+ -| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 | -|-------------------------------+----------------------+----------------------+ -| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | -| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | -| | | MIG M. | -|===============================+======================+======================| -| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 | -| N/A 48C P8 10W / 70W | 0MiB / 15109MiB | 0% Default | -| | | N/A | -+-------------------------------+----------------------+----------------------+ - -+-----------------------------------------------------------------------------+ -| Processes: | -| GPU GI CI PID Type Process name GPU Memory | -| ID ID Usage | -|=============================================================================| -| No running processes found | -+-----------------------------------------------------------------------------+ -NVRM version: NVIDIA UNIX x86_64 Kernel Module 460.106.00 Tue Sep 28 12:05:58 UTC 2021 -GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) -Connection to 34.68.242.247 closed. -*** Check "nv_driver": PASS *** -*** Running diagnostic function "nv_driver" *** -Warning: Permanently added 'compute.6788823627063447738' (ECDSA) to the list of known hosts. -Fri Oct 14 05:18:02 2022 -+-----------------------------------------------------------------------------+ -| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 | -|-------------------------------+----------------------+----------------------+ -| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | -| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | -| | | MIG M. | -|===============================+======================+======================| -| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 | -| N/A 35C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | -| | | N/A | -+-------------------------------+----------------------+----------------------+ - -+-----------------------------------------------------------------------------+ -| Processes: | -| GPU GI CI PID Type Process name GPU Memory | -| ID ID Usage | -|=============================================================================| -| No running processes found | -+-----------------------------------------------------------------------------+ -NVRM version: NVIDIA UNIX x86_64 Kernel Module 460.106.00 Tue Sep 28 12:05:58 UTC 2021 -GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) -Connection to 34.123.223.104 closed. -*** Check "nv_driver": PASS *** -*** Running diagnostic function "cuda_version" *** -Connection to 34.68.242.247 closed. -found cuda major version: 11 -*** Check "cuda_version": PASS *** -*** Running diagnostic function "cuda_version" *** -Connection to 34.123.223.104 closed. -found cuda major version: 11 -*** Check "cuda_version": PASS *** -... -******************************************************************************** -Overall check result: PASS -``` - -Please note that the diagnostic tool supports the following: - -* Dataproc 2.0 with image of Debian 10 or Ubuntu 18.04 (Rocky8 support is coming soon) -* GPU clusters that must have 1 worker node at least. Single node cluster (1 master, 0 workers) is -not supported - -## Bootstrap GPU Cluster with Optimized Settings - -The bootstrap tool will apply optimized settings for the RAPIDS Accelerator on Apache Spark on a -GPU cluster for Dataproc. The tool will fetch the characteristics of the cluster -- including -number of workers, worker cores, worker memory, and GPU accelerator type and count. It will use -the cluster properties to then determine the optimal settings for running GPU-accelerated Spark -applications. - -Usage: `spark_rapids_dataproc bootstrap --cluster --region ` - -Help (to see all options available): `spark_rapids_dataproc bootstrap --help` - -Example output: -``` -##### BEGIN : RAPIDS bootstrap settings for gpu-cluster -spark.executor.cores=16 -spark.executor.memory=32768m -spark.executor.memoryOverhead=7372m -spark.rapids.sql.concurrentGpuTasks=2 -spark.rapids.memory.pinnedPool.size=4096m -spark.sql.files.maxPartitionBytes=512m -spark.task.resource.gpu.amount=0.0625 -##### END : RAPIDS bootstrap settings for gpu-cluster -``` - -A detailed description for bootstrap settings with usage information is available in the -[RAPIDS Accelerator for Apache Spark Configuration](https://nvidia.github.io/spark-rapids/docs/configs.html) -and [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) page. - -## Qualify CPU Workloads for GPU Acceleration - -The [qualification tool](https://pypi.org/project/spark-rapids-user-tools/) is launched on a -Dataproc cluster that has applications that have already run. -The tool will output the applications recommended for acceleration along with estimated speed-up -and cost saving metrics. Additionally, it will provide information on how to launch a -GPU-accelerated cluster to take advantage of the speed-up and cost savings. - -Usage: `spark_rapids_dataproc qualification --cluster --region ` - -Help (to see all options available): `spark_rapids_dataproc qualification --help` - -Example output: -``` -+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+ -| | App Name | App ID | Recommendation | Estimated GPU | Estimated GPU | App | Estimated GPU | -| | | | | Speedup | Duration(s) | Duration(s) | Savings(%) | -|----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------| -| 0 | query24 | application_1664888311321_0011 | Strongly Recommended | 3.49 | 257.18 | 897.68 | 59.70 | -| 1 | query78 | application_1664888311321_0009 | Strongly Recommended | 3.35 | 113.89 | 382.35 | 58.10 | -| 2 | query23 | application_1664888311321_0010 | Strongly Recommended | 3.08 | 325.77 | 1004.28 | 54.37 | -| 3 | query64 | application_1664888311321_0008 | Strongly Recommended | 2.91 | 150.81 | 440.30 | 51.82 | -| 4 | query50 | application_1664888311321_0003 | Recommended | 2.47 | 101.54 | 250.95 | 43.08 | -| 5 | query16 | application_1664888311321_0005 | Recommended | 2.36 | 106.33 | 251.95 | 40.63 | -| 6 | query38 | application_1664888311321_0004 | Recommended | 2.29 | 67.37 | 154.33 | 38.59 | -| 7 | query87 | application_1664888311321_0006 | Recommended | 2.25 | 75.67 | 170.69 | 37.64 | -| 8 | query51 | application_1664888311321_0002 | Recommended | 1.53 | 53.94 | 82.63 | 8.18 | -+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+ -To launch a GPU-accelerated cluster with Spark RAPIDS, add the following to your cluster creation script: - --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \ - --worker-accelerator type=nvidia-tesla-t4,count=2 \ - --metadata gpu-driver-provider="NVIDIA" \ - --metadata rapids-runtime=SPARK \ - --cuda-version=11.5 -``` - -Please refer [Qualification Tool](https://nvidia.github.io/spark-rapids/docs/spark-qualification-tool.html) -guide for running qualification tool on more environment. - -## Tune Applications on GPU Cluster - -Once Spark applications have been run on the GPU cluster, the -[profiling tool](https://nvidia.github.io/spark-rapids/docs/spark-profiling-tool.html) can be run -to analyze the event logs of the applications to determine if more optimal settings should be -configured. The tool will output a per-application set of config settings to be adjusted for -enhanced performance. - -Usage: `spark_rapids_dataproc profiling --cluster --region ` - -Help (to see all options available): `spark_rapids_dataproc profiling --help` - -Example output: -``` -+--------------------------------+--------------------------------------------------+--------------------------------------------------------------------------------------------------+ -| App ID | Recommendations | Comments | -+================================+==================================================+==================================================================================================+ -| application_1664894105643_0011 | --conf spark.executor.cores=16 | - 'spark.task.resource.gpu.amount' was not set. | -| | --conf spark.executor.memory=32768m | - 'spark.rapids.sql.concurrentGpuTasks' was not set. | -| | --conf spark.executor.memoryOverhead=7372m | - 'spark.rapids.memory.pinnedPool.size' was not set. | -| | --conf spark.rapids.memory.pinnedPool.size=4096m | - 'spark.executor.memoryOverhead' was not set. | -| | --conf spark.rapids.sql.concurrentGpuTasks=2 | - 'spark.sql.files.maxPartitionBytes' was not set. | -| | --conf spark.sql.files.maxPartitionBytes=1571m | - 'spark.sql.shuffle.partitions' was not set. | -| | --conf spark.sql.shuffle.partitions=200 | | -| | --conf spark.task.resource.gpu.amount=0.0625 | | -+--------------------------------+--------------------------------------------------+--------------------------------------------------------------------------------------------------+ -| application_1664894105643_0002 | --conf spark.executor.cores=16 | - 'spark.task.resource.gpu.amount' was not set. | -| | --conf spark.executor.memory=32768m | - 'spark.rapids.sql.concurrentGpuTasks' was not set. | -| | --conf spark.executor.memoryOverhead=7372m | - 'spark.rapids.memory.pinnedPool.size' was not set. | -| | --conf spark.rapids.memory.pinnedPool.size=4096m | - 'spark.executor.memoryOverhead' was not set. | -| | --conf spark.rapids.sql.concurrentGpuTasks=2 | - 'spark.sql.files.maxPartitionBytes' was not set. | -| | --conf spark.sql.files.maxPartitionBytes=3844m | - 'spark.sql.shuffle.partitions' was not set. | -| | --conf spark.sql.shuffle.partitions=200 | | -| | --conf spark.task.resource.gpu.amount=0.0625 | | -+--------------------------------+--------------------------------------------------+--------------------------------------------------------------------------------------------------+ -``` diff --git a/docs/get-started/getting-started-kubernetes.md b/docs/get-started/getting-started-kubernetes.md deleted file mode 100644 index 5af9b957845..00000000000 --- a/docs/get-started/getting-started-kubernetes.md +++ /dev/null @@ -1,386 +0,0 @@ ---- -layout: page -title: Kubernetes -nav_order: 7 -parent: Getting-Started ---- - -# Getting Started with RAPIDS and Kubernetes - -This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster. -At the end of this guide, the reader will be able to run a sample Apache Spark application that runs -on NVIDIA GPUs in a Kubernetes cluster. - -This is a quick start guide which uses default settings which may be different from your cluster. - -Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker -image - Spark, the RAPIDS Accelerator for Spark jars, and the discovery script. See this -[Dockerfile.cuda](Dockerfile.cuda) example. - -You can find other supported base CUDA images for from -[CUDA dockerhub](https://hub.docker.com/r/nvidia/cuda). Its source Dockerfile is inside -[GitLab repository](https://gitlab.com/nvidia/container-images/cuda/) which can be used to build -the docker images from OS base image from scratch. - -## Prerequisites - * Kubernetes cluster is up and running with NVIDIA GPU support - * Docker is installed on a client machine - * A Docker repository which is accessible by the Kubernetes cluster - -These instructions do not cover how to setup a Kubernetes cluster. - -Please refer to [Install Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html) on -how to install a Kubernetes cluster with NVIDIA GPU support. - -## Docker Image Preparation - -On a client machine which has access to the Kubernetes cluster: - -1. [Download Apache Spark](https://spark.apache.org/downloads.html). - Supported versions of Spark are listed on the [RAPIDS Accelerator download page](../download.md). Please note that only - Scala version 2.12 is currently supported by the accelerator. - - Note that you can download these into a local directory and untar the Spark `.tar.gz` as a directory named `spark`. - -2. Download the [RAPIDS Accelerator for Spark jars](getting-started-on-prem.md#download-the-rapids-accelerator-jar) and the - [GPU discovery script](getting-started-on-prem.md#install-the-gpu-discovery-script). - - Put `rapids-4-spark_.jar` and `getGpusResources.sh` in the same directory as `spark`. - - Note: If here you decide to put above jar in the `spark/jars` directory which will be copied into - `/opt/spark/jars` directory in Docker image, then in the future you do not need to - specify `spark.driver.extraClassPath` or `spark.executor.extraClassPath` using `cluster` mode. - This example just shows you a way to put customized jars or 3rd party jars. - -3. Download the sample [Dockerfile.cuda](Dockerfile.cuda) in the same directory as `spark`. - - The sample Dockerfile.cuda will copy the `spark` directory's several sub-directories into `/opt/spark/` - along with the RAPIDS Accelerator jars and `getGpusResources.sh` into `/opt/sparkRapidsPlugin` - inside the Docker image. - - You can modify the Dockerfile to copy your application into the docker image, i.e. `test.py`. - - Examine the Dockerfile.cuda file to ensure the file names are correct and modify if needed. - - Currently the directory in the local machine should look as below: - ```shell - $ ls - Dockerfile.cuda getGpusResources.sh rapids-4-spark_.jar spark - ``` - -4. Build the Docker image with a proper repository name and tag and push it to the repository - ```shell - export IMAGE_NAME=xxx/yyy:tag - docker build . -f Dockerfile.cuda -t $IMAGE_NAME - docker push $IMAGE_NAME - ``` - -## Running Spark Applications in the Kubernetes Cluster - -### Submitting a Simple Test Job - -This simple job will test if the RAPIDS plugin can be found. -`ClassNotFoundException` is a common error if the Spark driver can not -find the RAPIDS Accelerator jar, resulting in an exception like this: -``` -Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin -``` - -Here is an example job: - -```shell -export SPARK_HOME=~/spark -export IMAGE_NAME=xxx/yyy:tag -export K8SMASTER=k8s://https://: -export SPARK_NAMESPACE=default -export SPARK_DRIVER_NAME=exampledriver - -$SPARK_HOME/bin/spark-submit \ - --master $K8SMASTER \ - --deploy-mode cluster \ - --name examplejob \ - --class org.apache.spark.examples.SparkPi \ - --conf spark.executor.instances=1 \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=1 \ - --conf spark.task.cpus=1 \ - --conf spark.task.resource.gpu.amount=1 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.executor.memoryOverhead=3G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.sql.shuffle.partitions=10 \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \ - --conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME \ - --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \ - --conf spark.executor.resource.gpu.vendor=nvidia.com \ - --conf spark.kubernetes.container.image=$IMAGE_NAME \ - --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_.jar \ - --conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_.jar \ - --driver-memory 2G \ - local:///opt/spark/examples/jars/spark-examples_2.12-3.0.2.jar -``` - - Note: `local://` means the jar file location is inside the Docker image. - Since this is `cluster` mode, the Spark driver is running inside a pod in Kubernetes. - The driver and executor pods can be seen when the job is running: -```shell -$ kubectl get pods -NAME READY STATUS RESTARTS AGE -spark-pi-d11075782f399fd7-exec-1 1/1 Running 0 9s -exampledriver 1/1 Running 0 15s -``` - - To view the Spark driver log, use below command: -```shell -kubectl logs $SPARK_DRIVER_NAME -``` - - To view the Spark driver UI when the job is running first expose the driver UI port: -```shell -kubectl port-forward $SPARK_DRIVER_NAME 4040:4040 -``` - Then open a web browser to the Spark driver UI page on the exposed port: -```shell -http://localhost:4040 -``` - - To kill the Spark job: -```shell -$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME -``` - - To delete the driver POD: -```shell -kubectl delete pod $SPARK_DRIVER_NAME -``` - -### Running an Interactive Spark Shell - -If you need an interactive Spark shell with executor pods running inside the Kubernetes cluster: -```shell -$SPARK_HOME/bin/spark-shell \ - --master $K8SMASTER \ - --name mysparkshell \ - --deploy-mode client \ - --conf spark.executor.instances=1 \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=1 \ - --conf spark.task.cpus=1 \ - --conf spark.task.resource.gpu.amount=1 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.executor.memoryOverhead=3G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.sql.shuffle.partitions=10 \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \ - --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \ - --conf spark.executor.resource.gpu.vendor=nvidia.com \ - --conf spark.kubernetes.container.image=$IMAGE_NAME \ - --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_.jar \ - --driver-class-path=./rapids-4-spark_.jar \ - --driver-memory 2G -``` - -Only the `client` deploy mode should be used. If you specify the `cluster` deploy mode, you would see the following error: -```shell -Cluster deploy mode is not applicable to Spark shells. -``` -Also notice that `--conf spark.driver.extraClassPath` was removed but `--driver-class-path` was added. -This is because now the driver is running on the client machine, so the jar paths should be local filesystem paths. - -When running the shell you can see only the executor pods are running inside Kubernetes: -``` -$ kubectl get pods -NAME READY STATUS RESTARTS AGE -mysparkshell-bfe52e782f44841c-exec-1 1/1 Running 0 11s -``` - -The following Scala code can be run in the Spark shell to test if the RAPIDS Accelerator is enabled. -```shell -val df = spark.sparkContext.parallelize(Seq(1)).toDF() -df.createOrReplaceTempView("df") -spark.sql("SELECT value FROM df WHERE value <>1").show -spark.sql("SELECT value FROM df WHERE value <>1").explain -:quit -``` -The expected `explain` plan should contain the GPU related operators: -```shell -scala> spark.sql("SELECT value FROM df WHERE value <>1").explain -== Physical Plan == -GpuColumnarToRow false -+- GpuFilter NOT (value#2 = 1) - +- GpuRowToColumnar TargetSize(2147483647) - +- *(1) SerializeFromObject [input[0, int, false] AS value#2] - +- Scan[obj#1] -``` - -### Running PySpark in Client Mode - -Of course, you can `COPY` the Python code in the Docker image when building it -and submit it using the `cluster` deploy mode as showin in the previous example pi job. - -However if you do not want to re-build the Docker image each time and just want to submit the Python code -from the client machine, you can use the `client` deploy mode. - -```shell -$SPARK_HOME/bin/spark-submit \ - --master $K8SMASTER \ - --deploy-mode client \ - --name mypythonjob \ - --conf spark.executor.instances=1 \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=1 \ - --conf spark.task.cpus=1 \ - --conf spark.task.resource.gpu.amount=1 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.executor.memoryOverhead=3G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.sql.shuffle.partitions=10 \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.kubernetes.namespace=$SPARK_NAMESPACE \ - --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \ - --conf spark.executor.resource.gpu.vendor=nvidia.com \ - --conf spark.kubernetes.container.image=$IMAGE_NAME \ - --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_.jar \ - --driver-memory 2G \ - --driver-class-path=./rapids-4-spark_.jar \ - test.py -``` - -A sample `test.py` is as below: -```shell -from pyspark.sql import SQLContext -from pyspark import SparkConf -from pyspark import SparkContext -conf = SparkConf() -sc = SparkContext.getOrCreate() -sqlContext = SQLContext(sc) -df=sqlContext.createDataFrame([1,2,3], "int").toDF("value") -df.createOrReplaceTempView("df") -sqlContext.sql("SELECT * FROM df WHERE value<>1").explain() -sqlContext.sql("SELECT * FROM df WHERE value<>1").show() -sc.stop() -``` - -## Running Spark Applications using Spark Operator - -Using Spark Operator is another way to submit Spark Applications into a Kubernetes Cluster. - -1. Locate the Spark Application jars/files in the docker image when preparing docker image. - - For example, assume `/opt/sparkRapidsPlugin/test.py` is inside the docker image. - - This is because currently only `cluster` deployment mode is supported by Spark Operator. - -2. Create spark operator using `helm`. - - Follow [Spark Operator quick start guide](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md) - -3. Create a Spark Application YAML file - - For example, create a file named `testpython-rapids.yaml` with the following contents: - - ``` - apiVersion: "sparkoperator.k8s.io/v1beta2" - kind: SparkApplication - metadata: - name: testpython-rapids - namespace: default - spec: - sparkConf: - "spark.ui.port": "4045" - "spark.rapids.sql.concurrentGpuTasks": "1" - "spark.executor.resource.gpu.amount": "1" - "spark.task.resource.gpu.amount": "1" - "spark.executor.memory": "1g" - "spark.rapids.memory.pinnedPool.size": "2g" - "spark.executor.memoryOverhead": "3g" - "spark.sql.files.maxPartitionBytes": "512m" - "spark.sql.shuffle.partitions": "10" - "spark.plugins": "com.nvidia.spark.SQLPlugin" - "spark.executor.resource.gpu.discoveryScript": "/opt/sparkRapidsPlugin/getGpusResources.sh" - "spark.executor.resource.gpu.vendor": "nvidia.com" - "spark.executor.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar" - "spark.driver.extraClassPath": "/opt/sparkRapidsPlugin/rapids-4-spark.jar" - type: Python - pythonVersion: 3 - mode: cluster - image: "" - imagePullPolicy: Always - mainApplicationFile: "local:///opt/sparkRapidsPlugin/test.py" - sparkVersion: "3.1.1" - restartPolicy: - type: Never - volumes: - - name: "test-volume" - hostPath: - path: "/tmp" - type: Directory - driver: - cores: 1 - coreLimit: "1200m" - memory: "1024m" - labels: - version: 3.1.1 - serviceAccount: spark - volumeMounts: - - name: "test-volume" - mountPath: "/tmp" - executor: - cores: 1 - instances: 1 - memory: "5000m" - gpu: - name: "nvidia.com/gpu" - quantity: 1 - labels: - version: 3.1.1 - volumeMounts: - - name: "test-volume" - mountPath: "/tmp" - ``` - -4. Submit the Spark Application - - ``` - sparkctl create testpython-rapids.yaml - ``` - - Note: `sparkctl` can be built from the Spark Operator repo after [installing golang](https://golang.org/doc/install): - - ``` - cd sparkctl - go build -o sparkctl - ``` - -5. Check the driver log - - ``` - sparkctl log testpython-rapids - ``` - -6. Check the status of this Spark Application - - ``` - sparkctl status testpython-rapids - ``` - -7. Port forwarding when Spark driver is running - - ``` - sparkctl forward testpython-rapids --local-port 1234 --remote-port 4045 - ``` - - Then open browser with `http://localhost:1234/` to check Spark UI. - -8. Delete the Spark Application - - ``` - sparkctl delete testpython-rapids - ``` - -Please refer to [Running Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html) for more information. diff --git a/docs/get-started/getting-started-on-prem.md b/docs/get-started/getting-started-on-prem.md deleted file mode 100644 index 51e3205b02d..00000000000 --- a/docs/get-started/getting-started-on-prem.md +++ /dev/null @@ -1,437 +0,0 @@ ---- -layout: page -title: On-Prem -nav_order: 1 -parent: Getting-Started ---- - -# Getting Started with RAPIDS Accelerator with on premise cluster or local mode -## Spark Deployment Methods -The way you decide to deploy Spark affects the steps you must take to install and setup Spark and -the RAPIDS Accelerator for Apache Spark. Please see [Software Requirements](../download.md#software-requirements) -section for complete list of Spark versions supported by RAPIDS plugin. The primary methods to -deploy Spark are: -* [Local mode](#local-mode) - this is for dev/testing only, not for production -* [Standalone Mode](#spark-standalone-cluster) -* [On a YARN cluster](#running-on-yarn) -* [On a Kubernetes cluster](#running-on-kubernetes) - -## Apache Spark Setup for GPU -Each GPU node where you are running Spark needs to have the following installed. If you are running -with Docker on Kubernetes then skip these as you will do this as part of the docker build. -- Install Java 8 - - Ubuntu: `sudo apt install openjdk-8-jdk-headless` - - While JDK11 is supported by Spark, RAPIDS Spark is built and tested with JDK8, so JDK8 is - recommended. -- Install the GPU driver and CUDA toolkit - - [Download](https://developer.nvidia.com/cuda-11-6-1-download-archive) and install - GPU drivers and the CUDA Toolkit. A reboot will be required after installation. -```bash -wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin -sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 -wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb -sudo dpkg -i cuda-keyring_1.0-1_all.deb -sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" -sudo apt-get update -sudo apt-get -y install cuda -``` - -You can check if the GPU driver and CUDA toolkit is -installed successfully by running the [nvidia-smi](https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf) command. - -Below are sections on installing Spark and the RAPIDS Accelerator on a single node. You may want -to read the deployment method sections before doing any installations. - -## Install Spark -To install Apache Spark please follow the official -[instructions](https://spark.apache.org/docs/latest/#launching-on-a-cluster). Supported versions of -Spark are listed on the [download](../download.md) page. Please note that only -scala version 2.12 is currently supported by the accelerator. - -## Download the RAPIDS Accelerator jar -The [accelerator](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark_2.12) jar is -available in the [download](../download.md) section. - -Download the RAPIDS Accelerator for Apache Spark plugin jar. Each jar is for a specific version of -CUDA and will not run on other versions. The jars use a classifier to keep them separate. - -- CUDA 11.x => classifier cuda11 - -For example, here is a sample version of the jar with CUDA 11.x support: -- rapids-4-spark_2.12-23.10.0-SNAPSHOT-cuda11.jar - -For simplicity export the location to this jar. This example assumes the sample jar above has -been placed in the `/opt/sparkRapidsPlugin` directory: -```shell -export SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin -export SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-23.10.0-SNAPSHOT-cuda11.jar -``` - -## Install the GPU Discovery Script -If you are using Apache Spark's GPU scheduling feature please be sure to follow what your cluster -administrator recommends. Often this will involve downloading a GPU discovery script and this -example will assume as such. -Download the -[getGpusResource.sh](https://github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh) -script and install it on all the nodes. Put it into a local folder. You may put it in the same -directory as the plugin jar (`/opt/sparkRapidsPlugin` in the example). - -## Local Mode -This is for testing/dev setup only. It is not to be used in production. In this mode Spark runs -everything in a single process on a single node. -- [Install Spark](#install-spark) -- [Install the RAPIDS Accelerator_jar](#download-the-rapids-accelerator-jar) -- Launch your Spark shell session. - -Default configs usually work fine in local mode. The required changes are setting the config -`spark.plugins` to `com.nvidia.spark.SQLPlugin` and including the jar as a dependency. All of the -other config settings and command line parameters are to try and better configure spark for GPU -execution. - -```shell -$SPARK_HOME/bin/spark-shell \ - --master local \ - --num-executors 1 \ - --conf spark.executor.cores=1 \ - --conf spark.rapids.sql.concurrentGpuTasks=1 \ - --driver-memory 10g \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --jars ${SPARK_RAPIDS_PLUGIN_JAR} -``` -You can run one of the examples below such as the [Example Join Operation](#example-join-operation) - -## Spark Standalone Cluster -For reference, the Spark documentation is -[here](http://spark.apache.org/docs/latest/spark-standalone.html). - -Spark Standalone mode requires starting the Spark master and worker(s). You can run it on a single -machine or multiple machines for distributed setup. - -The first step is to [Install Spark](#install-spark), the -[RAPIDS Accelerator for Spark jar](#download-the-rapids-accelerator-jar), and the -[GPU discovery script](#install-the-gpu-discovery-script) on all the nodes you want to use. -See the note at the end of this section if using Spark 3.1.1 or above. -After that choose one of the nodes to be your master node and start the master. Note that the -master process does **not** need a GPU to function properly. - -On the master node: - - Make sure `SPARK_HOME` is exported - - run `$SPARK_HOME/sbin/start-master.sh` - - This script will print a message saying starting Master and have a path to a log file. - Examine the log file to make sure there are no errors starting the Spark Master process. - - `export MASTER_HOST=`[_the hostname of the master_] - - Go to the Spark Master UI to verify it has started. The UI should be accessible at - `http://${MASTER_HOST}:8080` - - Find the Spark URL for the Spark Master. This can be found in the Spark Master logs or from - the Spark Master UI. It will likely be: `spark://${MASTER_HOST}:7077`. You will need this URL - for starting the workers and submitting applications. - -Now for each worker node: -- Setup worker configs on each node - - `cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh` - - Edit `$SPARK_HOME/conf/spark-env.sh` and add any worker options. The example below sets the - number of GPUs per worker to 4 and points to the discovery script. Change this for your setup. - - `SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=4 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"` -- Start the worker(s) - - For multiple workers: - - You can add each hostname to the file `$SPARK_HOME/conf/workers` and use the scripts provided - by Spark to start all of them. This requires password-less ssh to be setup. If you do not - have a password-less setup, you can set the environment variable `SPARK_SSH_FOREGROUND` and - serially provide a password for each worker. - - Run `$SPARK_HOME/sbin/start-workers.sh` -- For a single worker: - - `$SPARK_HOME/sbin/start-worker.sh spark://${MASTER_HOST}:7077` - -Now you can go to the master UI at `http://${MASTER_HOST}:8080` and verify all the workers have -started. - -Submitting a Spark application to a standalone mode cluster requires a few configs to be set. These -configs can be placed in the Spark default confs if you want all jobs to use the GPU. The plugin -requires its jar to be in the executor classpath. GPU scheduling also requires the Spark job to -ask for GPUs. The plugin cannot utilize more than one GPU per executor. - -In this case we are asking for 1 GPU per executor (the plugin cannot utilize more than one), and 4 -CPU tasks per executor (but only one task will be on the GPU at a time). This allows for -overlapping I/O and computation. - -```shell -$SPARK_HOME/bin/spark-shell \ - --master spark://${MASTER_HOST}:7077 \ - --conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR} \ - --conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR} \ - --conf spark.rapids.sql.concurrentGpuTasks=1 \ - --driver-memory 2G \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=4 \ - --conf spark.task.cpus=1 \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.task.resource.gpu.amount=0.25 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin -``` - -Please note that the RAPIDS Accelerator for Apache Spark plugin jar does not need to be installed -on all the nodes and the configs `spark.executor.extraClassPath` and `spark.driver.extraClassPath` -can be replaced in the above command with `--jars ${SPARK_RAPIDS_PLUGIN_JAR}`. -This will automatically distribute the jar to the nodes for you. - -## Running on YARN - -YARN requires you to [Install Spark](#install-spark), the -[RAPIDS Accelerator for Spark jar](#download-the-rapids-accelerator-jar), and the -[GPU discovery script](#install-the-gpu-discovery-script) on a launcher node. YARN handles -shipping them to the cluster nodes as needed. If you want to use the GPU scheduling feature in -Spark it requires YARN version >= 2.10 or >= 3.1.1 and ideally you would use >= 3.1.3 in order to -get support for nvidia-docker version 2. - -It is recommended to run your YARN cluster with Docker, cgroup isolation and GPU -scheduling enabled. This way your Spark containers run isolated and can only see the GPUs that -were requested. If you do not run in an isolated environment then you need to ensure you run on -hosts that have GPUs and there is a mechanism that allows you to allocate GPUs when the GPUs are -in process-exclusive mode. See the `nvidia-smi` documentation for more details on setting up -process-exclusive mode. If you have a pre-existing method for allocating GPUs and dealing with -multiple applications you could write your own custom discovery class to deal with that. - -This assumes you have [YARN](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html) -already installed and set up. Setting up a YARN cluster is not covered -in these instructions. Spark must have been built specifically for the Hadoop/YARN version you -use - either 3.x or 2.x. - -YARN GPU scheduling does not support [MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#introduction) -enabled GPUs by default, see section [MIG GPU on YARN](#mig-gpu-on-yarn) on how to add support. - -### YARN 3.1.3 with Isolation and GPU Scheduling Enabled -- Configure YARN to support - [GPU scheduling and isolation](https://hadoop.apache.org/docs/r3.1.3/hadoop-yarn/hadoop-yarn-site/UsingGpus.html). -- Install [Spark](#install-spark), the - [RAPIDS Accelerator for Spark jar](#download-the-rapids-accelerator-jar), and the - [GPU discovery script](#install-the-gpu-discovery-script) on the node from which you are - launching your Spark application. -- Use the following configuration settings when running Spark on YARN, changing the values as - necessary: -```shell -$SPARK_HOME/bin/spark-shell \ - --master yarn \ - --conf spark.rapids.sql.concurrentGpuTasks=1 \ - --driver-memory 2G \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=4 \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.task.cpus=1 \ - --conf spark.task.resource.gpu.amount=0.25 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ - --files ${SPARK_RAPIDS_DIR}/getGpusResources.sh \ - --jars ${SPARK_RAPIDS_PLUGIN_JAR} -``` - -### YARN 2.10 with Isolation and GPU Scheduling Enabled -- Configure YARN to support - [GPU scheduling and isolation](https://hadoop.apache.org/docs/r2.10.0/hadoop-yarn/hadoop-yarn-site/ResourceProfiles.html) -- Install [Spark](#install-spark), the - [RAPIDS Accelerator for Spark jar](#download-the-rapids-accelerator-jar), and the - [GPU discovery script](#install-the-gpu-discovery-script) on the node from which you are - launching your Spark application. -- Use the following configs when running Spark on YARN, changing the values as necessary: -```shell -$SPARK_HOME/bin/spark-shell \ - --master yarn \ - --conf spark.rapids.sql.concurrentGpuTasks=1 \ - --driver-memory 2G \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=4 \ - --conf spark.task.cpus=1 \ - --conf spark.task.resource.gpu.amount=0.25 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ - --files ${SPARK_RAPIDS_DIR}/getGpusResources.sh \ - --jars ${SPARK_RAPIDS_PLUGIN_JAR} -``` - -### YARN without Isolation -If you run YARN without isolation then you can run the RAPIDS Accelerator for Spark as long as you -run your Spark application on nodes with GPUs and the GPUs are configured in `EXCLUSIVE_PROCESS` -mode. Without this, there would need to be a mechanism to ensure that only one executor is -accessing a GPU at once. Note it does not matter if GPU scheduling support is enabled. -- On all your YARN nodes, ensure the GPUs are in `EXCLUSIVE_PROCESS` mode: - - Run `nvidia-smi` to see how many GPUs and get the indexes of the GPUs - - Foreach GPU index set it to `EXCLUSIVE_PROCESS` mode: - - `nvidia-smi -c EXCLUSIVE_PROCESS -i $index` -- Install [Spark](#install-spark), the - [RAPIDS Accelerator for Spark jar](#download-the-rapids-accelerator-jar), and the - [GPU discovery script](#install-the-gpu-discovery-script) on the node from which you are - launching your Spark application. -- Use the following configs when running Spark on YARN. Note that we are configuring a resource -discovery plugin. Spark will first try to discover the GPUs using this plugin and then fall back -to the discovery script if this doesn’t work. This plugin knows how to atomically acquire a GPU in -process exclusive mode and expose it to the tasks. - -```shell -$SPARK_HOME/bin/spark-shell \ - --master yarn \ - --conf spark.rapids.sql.concurrentGpuTasks=1 \ - --driver-memory 2G \ - --conf spark.executor.memory=4G \ - --conf spark.executor.cores=4 \ - --conf spark.task.cpus=1 \ - --conf spark.task.resource.gpu.amount=0.25 \ - --conf spark.rapids.memory.pinnedPool.size=2G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.resources.discoveryPlugin=com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin \ - --conf spark.executor.resource.gpu.amount=1 \ - --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ - --files ${SPARK_RAPIDS_DIR}/getGpusResources.sh \ - --jars ${SPARK_RAPIDS_PLUGIN_JAR} -``` - -### MIG GPU on YARN -Using [MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#introduction) -enabled GPUs on YARN requires enabling YARN GPU scheduling and using -NVIDIA Docker runtime v2. The way to set this up depends on the version of YARN and the version -of Spark. It is important to note that CUDA 11 only supports enumeration of a single MIG instance. -This means that using any MIG device on YARN means only 1 GPU per container is allowed. See the -limitations section in the documentation referred to below for the specific YARN version you -are using. - -#### YARN version 3.3.0+ -YARN version 3.3.0 and newer support a pluggable device framework which allows adding support for -MIG devices via a plugin. See -[NVIDIA GPU Plugin for YARN with MIG support for YARN 3.3.0+](https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/MIG-Support/device-plugins/gpu-mig). -If you are using that plugin with a Spark version older than 3.2.1 and/or specifying the resource -as `nvidia/miggpu` you will also need to specify the config: - -```shell ---conf spark.rapids.gpu.resourceName=nvidia/miggpu -``` - -This tells the RAPIDS Accelerator for Apache Spark plugin to look for the Spark GPU resource -assigned to it using the name `nvidia/miggpu`. If you are using the Spark config -`spark.yarn.resourceGpuDeviceName` and using the normal `gpu` Spark resource name, this is not -required. - -#### YARN version 3.1.2 until 3.3.0 -If you are using YARN version from 3.1.2 up until 3.3.0, it requires making modifications to YARN -and deploying a version that adds support for MIG to the built-in YARN GPU resource plugin. - -See [NVIDIA Support for GPU for YARN with MIG support for YARN 3.1.2 until YARN 3.3.0](https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/MIG-Support/resource-types/gpu-mig) -for details. - -## Running on Kubernetes -Please refer to [Getting Started with RAPIDS and Kubernetes](./getting-started-kubernetes.md). - -## RAPIDS Accelerator Configuration and Tuning -Most of what you need you can get from [tuning guide](../tuning-guide.md). - -The following configs will help you to get started but must be configured based on your cluster -and application. - -1. If you are using the KryoSerializer with Spark, e.g.: - `--conf spark.serializer=org.apache.spark.serializer.KryoSerializer`, you will have to register - the GpuKryoRegistrator class, e.g.: - `--conf spark.kryo.registrator=com.nvidia.spark.rapids.GpuKryoRegistrator`. -1. Configure the amount of executor memory like you would for a normal Spark application. If most - of the job will run on the GPU then often you can run with less executor heap memory than would - be needed for the corresponding Spark job on the CPU. - -In case of a "com.esotericsoftware.kryo.KryoException: Buffer overflow" error it is advisable to -increase the -[`spark.kryoserializer.buffer.max`](https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization) -setting to a value higher than the default. - -### Example Command Running on YARN -```shell -$SPARK_HOME/bin/spark-shell --master yarn \ - --num-executors 1 \ - --conf spark.plugins=com.nvidia.spark.SQLPlugin \ - --conf spark.executor.cores=6 \ - --conf spark.rapids.sql.concurrentGpuTasks=2 \ - --executor-memory 20g \ - --conf spark.executor.memoryOverhead=10g \ - --conf spark.rapids.memory.pinnedPool.size=8G \ - --conf spark.sql.files.maxPartitionBytes=512m \ - --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ - --conf spark.task.resource.gpu.amount=0.166 \ - --conf spark.executor.resource.gpu.amount=1 \ - --files $SPARK_RAPIDS_DIR/getGpusResources.sh - --jars ${SPARK_RAPIDS_PLUGIN_JAR} -``` - -## Example Join Operation -Once you have started your Spark shell you can run the following commands to do a basic join and -look at the UI to see that it runs on the GPU. -```scala -val df = sc.makeRDD(1 to 10000000, 6).toDF -val df2 = sc.makeRDD(1 to 10000000, 6).toDF -df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" === $"b").count -``` -Go to the Spark UI and click on the application you ran and on the “SQL” tab. If you click the -operation “count at ...”, you should see the graph of Spark Execs and some of those should have -the label Gpu... For instance, in the screenshot below you will see `GpuRowToColumn`, `GpuProject`, -and `GpuColumnarExchange`. Those correspond to operations that run on the GPU. - -![Join Example on Spark SQL UI](../img/join-sql-ui-example.png) - -## Enabling RAPIDS Shuffle Manager - -The RAPIDS Shuffle Manager is an implementation of the `ShuffleManager` interface in Apache Spark -that allows custom mechanisms to exchange shuffle data, enabling Remote Direct Memory Access (RDMA) -and peer-to-peer communication between GPUs (NVLink/PCIe), by leveraging [Unified Communication X -(UCX)](https://www.openucx.org/). - -You can find out how to enable the accelerated shuffle in the -[RAPIDS Shuffle Manager documentation](../additional-functionality/rapids-shuffle.md). - - -## Advanced Configuration - -See the [RAPIDS Accelerator for Apache Spark Configuration Guide](../configs.md) for details on all -of the configuration settings specific to the RAPIDS Accelerator for Apache Spark. - -## Monitoring -Since the plugin runs without any API changes, the easiest way to see what is running on the GPU -is to look at the "SQL" tab in the Spark web UI. The SQL tab only shows up after you have actually -executed a query. Go to the SQL tab in the UI, click on the query you are interested in and it -shows a DAG picture with details. You can also scroll down and twisty the "Details" section to see -the text representation. - -If you want to look at the Spark plan via the code you can use the `explain()` function call. -For example: if `query` is the resulting DataFrame from the query then `query.explain()` will -print the physical plan from Spark. From the query's physical plan you can see what nodes were -replaced with GPU calls. - -The following is an example of a physical plan with operators running on the GPU: - -``` -== Physical Plan == -GpuColumnarToRow false -+- GpuProject [cast(c_customer_sk#0 as string) AS c_customer_sk#40] - +- GpuFileGpuScan parquet [c_customer_sk#0] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct -``` - - -## Debugging -For now, the best way to debug is how you would normally do it on Spark. Look at the UI and log -files to see what failed. If you got a segmentation fault from the GPU find the hs_err_pid.log -file. To make sure your hs_err_pid.log file goes into the YARN application log dir you can add in -the config: `--conf spark.executor.extraJavaOptions="-XX:ErrorFile=/hs_err_pid_%p.log"` - -If you want to see why an operation did not run on the GPU you can turn on the configuration: -[`--conf spark.rapids.sql.explain=NOT_ON_GPU`](../configs.md#sql.explain). A log message will then -be emitted to the driver log as to why a Spark operation is not able to run on the GPU. - -## Out of GPU Memory -GPU out of memory errors can show up in multiple ways. You can see an error that it out of memory -or it can also manifest as crashes. Generally this means your partition size is too big. In that -case go back to the [Configuration section](#rapids-accelerator-configuration-and-tuning) for the -partition size and/or the number of partitions. Possibly reduce the number of concurrent GPU tasks -to 1. The Spark UI may give you a hint at the size of the data. Look at either the input data or -the shuffle data size for the stage that failed. diff --git a/docs/get-started/getting-started-workload-qualification.md b/docs/get-started/getting-started-workload-qualification.md deleted file mode 100644 index be628f58b40..00000000000 --- a/docs/get-started/getting-started-workload-qualification.md +++ /dev/null @@ -1,264 +0,0 @@ ---- -layout: page -title: Workload Qualification -nav_order: 9 -parent: Getting-Started ---- -# Getting Started on Spark Workload Qualification - -The RAPIDS Accelerator for Apache Spark runs as many operations as possible on the GPU. If there -are operators which do not yet run on GPU, they will seamlessly fallback to the CPU. There may be -some performance overhead because of host memory to GPU memory transfer. When converting an -existing Spark workload from CPU to GPU, it is recommended to do an analysis to understand if there -are any features (functions, expressions, data types, data formats) that do not yet run on the GPU. -Understanding this will help prioritize workloads that are best suited to the GPU. - -Significant performance benefits can be gained even if all operations are not yet fully supported by -the GPU. It all depends on how critical the portion that is executing on the CPU is to the overall -performance of the query. - -This article describes the tools we provide and how to do gap analysis and workload qualification. - -## 1. Qualification and Profiling tool - -### Requirements - -- Spark event logs from Spark 2.x or 3.x -- Spark 3.0.1+ jars -- `rapids-4-spark-tools` jar - -### How to use - -If you have Spark event logs from prior runs of the applications on Spark 2.x or 3.x, you can use -the [Qualification tool](../spark-qualification-tool.md) and -[Profiling tool](../spark-profiling-tool.md) to analyze them. The Qualification tool outputs the score, rank -and some of the potentially not-supported features for each Spark application. For example, the CSV -output can print `Unsupported Read File Formats and Types`, `Unsupported Write Data Format` and -`Potential Problems` which are the indication of some not-supported features. Its output can help -you focus on the Spark applications which are best suited for the GPU. - -The profiling tool outputs SQL plan metrics and also prints out actual query plans to provide more -insights. In the following example the profiling tool output for a specific Spark application shows -that it has a query with a large (processing millions of rows) `HashAggregate` and `SortMergeJoin`. -Those are indicators for a good candidate application for the RAPIDS Accelerator. - -``` -+--------+-----+------+----------------------------------------------------+-------------+------------------------------------+-------------+----------+ -|appIndex|sqlID|nodeID|nodeName |accumulatorId|name |max_value |metricType| -+--------+-----+------+----------------------------------------------------+-------------+------------------------------------+-------------+----------+ -|1 |88 |8 |SortMergeJoin |11111 |number of output rows |500000000 |sum | -|1 |88 |9 |HashAggregate |22222 |number of output rows |600000000 |sum | -``` - -Since the two tools are only analyzing Spark event logs they do not have the detail that can be -captured from a running Spark job. However it is very convenient because you can run the tools on -existing logs and do not need a GPU cluster to run the tools. - -## 2. Get the Explain Output - -This allows running queries on the CPU and the RAPIDS Accelerator will evaluate the queries as if it was -going to run on the GPU and tell you what would and wouldn't have been run on the GPU. -There are two ways to run this, one is running with the RAPIDS Accelerator set to explain only mode and -the other is to modify your existing Spark application code to call a function directly. - -Please note that if using adaptive execution in Spark the explain output may not be perfect -as the plan could have changed along the way in a way that we wouldn't see by looking at just -the CPU plan. The same applies if you are using an older version of Spark. Spark planning -may be slightly different when you go up to a newer version of Spark. One example where we have -seen Spark 2.4.X plan differently is in the use of the EqualNullSafe expression. We have seen Spark 2.4.X -use EqualNullSafe but in Spark 3.X it used other expressions to do the same thing. In this case -it shows up as GPU doesn't support EqualNullSafe in the Spark 2.X explain output but when you -go to Spark 3.X those parts would run on the GPU because it is using different operators. This -is something to keep in mind when doing the analysis. - -### Using the Configuration Flag for Explain Only Mode - -Starting with version 22.02, the RAPIDS Accelerator can be run in explain only mode. -This mode allows you to run on a CPU cluster and can help us understand the potential GPU plan and -if there are any unsupported features. Basically it will log the output which is the same as -the driver logs with `spark.rapids.sql.explain=all`. - -#### Requirements - -- A Spark 3.x CPU cluster -- The `rapids-4-spark` [jar](../download.md) - -#### Usage - -1. In `spark-shell`, add the `rapids-4-spark` jar into --jars option or put it in the - Spark classpath and enable the configs `spark.rapids.sql.mode=explainOnly` and - `spark.plugins=com.nvidia.spark.SQLPlugin`. - - For example: - - ```bash - spark-shell --jars /PathTo/rapids-4-spark_.jar --conf spark.rapids.sql.mode=explainOnly --conf spark.plugins=com.nvidia.spark.SQLPlugin - ``` -2. Enable optional RAPIDS Accelerator related parameters based on your setup. - - Enabling optional parameters may allow more operations to run on the GPU but please understand - the meaning and risk of above parameters before enabling it. Please refer to the - [configuration documentation](../configs.md) for details of RAPIDS Accelerator - parameters. - - For example, if your jobs Scala UDFs, you can set the following parameters: - - ```scala - spark.conf.set("spark.rapids.sql.udfCompiler.enabled",true) - ``` - -3. Run your query and check the driver logs for the explain output. - - Below are sample driver log messages starting with `!` which indicate the unsupported features in - this version: - - ``` - ! cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RowDataSourceScanExec - ``` - -This log can show you which operators (on what data type) can not run on GPU and the reason. -If it shows a specific RAPIDS Accelerator parameter which can be turned on to enable that feature, -you should first understand the risk and applicability of that parameter based on -[configs doc](../configs.md) and then enable that parameter and try the tool again. - -Since its output is directly based on specific version of `rapids-4-spark` jar, the gap analysis is -pretty accurate. - -### How to use the Function Call - -A function named `explainPotentialGpuPlan` is available which can help us understand the potential -GPU plan and if there are any unsupported features on a CPU cluster. Basically it can return output -which is the same as the driver logs with `spark.rapids.sql.explain=all`. - -#### Requirements with Spark 3.X - -- A Spark 3.X CPU cluster -- The `rapids-4-spark` [jar](../download.md) -- Ability to modify the existing Spark application code -- RAPIDS Accelerator for Apache Spark version 21.12 or newer - -#### Function Documentation - -```scala -explainPotentialGpuPlan(df: DataFrame, explain: String = "ALL") -``` - -Looks at the CPU plan associated with the dataframe and outputs information -about which parts of the query the RAPIDS Accelerator for Apache Spark -could place on the GPU. This only applies to the initial plan, so if running -with adaptive query execution enable, it will not be able to show any changes -in the plan due to that. - -This is very similar output you would get by running the query with the -RAPIDS Accelerator enabled and with the config `spark.rapids.sql.enabled` enabled. - -Requires the RAPIDS Accelerator for Apache Spark jar be included -in the classpath but the RAPIDS Accelerator for Apache Spark should be disabled. - -Calling from Scala: -```scala -val output = com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(df) -``` - -Calling from PySpark: -```python -output = sc._jvm.com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(df._jdf, "ALL") -``` - -Parameters: -`df` - The Spark DataFrame to get the query plan from -`explain` - If ALL returns all the explain data, otherwise just returns what does not - work on the GPU. Default is ALL. - -Returns: -String containing the explain output. - -Throws: -`java.lang.IllegalArgumentException` - if an argument is invalid or it is unable to determine the Spark version -`java.lang.IllegalStateException` - if the plugin gets into an invalid state while trying - to process the plan or there is an unexepected exception. - -#### Usage - -1. In `spark-shell`, add the necessary jar into --jars option or put it in the - Spark classpath. - - For example, on Spark 3.X: - - ```bash - spark-shell --jars /PathTo/rapids-4-spark_.jar - ``` - -2. Test if the class can be successfully loaded or not. - - ```scala - import com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan - ``` - -3. Enable optional RAPIDS Accelerator related parameters based on your setup. - - Enabling optional parameters may allow more operations to run on the GPU but please understand - the meaning and risk of above parameters before enabling it. Please refer to the - [configuration documentation](../configs.md) for details of RAPIDS Accelerator parameters. - - For example, if your jobs have Scala UDFs, you can set the following parameters: - - ```scala - spark.conf.set("spark.rapids.sql.udfCompiler.enabled",true) - ``` - -4. Run the function `explainPotentialGpuPlan` on the query DataFrame. - - For example: - - ```scala - val jdbcDF = spark.read.format("jdbc"). - option("driver", "com.mysql.jdbc.Driver"). - option("url", "jdbc:mysql://localhost:3306/hive?useSSL=false"). - option("dbtable", "TBLS").option("user", "xxx"). - option("password", "xxx"). - load() - jdbcDF.createOrReplaceTempView("t") - val mydf=spark.sql("select count(distinct TBL_ID) from t") - - val output=com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(mydf) - println(output) - ``` - - Below are sample driver log messages starting with `!` which indicate the unsupported features in - this version: - - ``` - ! cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RowDataSourceScanExec - ``` -The output will show you which operators (on what data type) can not run on GPU and the reason. -If it shows a specific RAPIDS Accelerator parameter which can be turned on to enable that feature, -you should first understand the risk and applicability of that parameter based on -[configs doc](../configs.md) and then enable that parameter and try the tool again. - -Since its output is directly based on specific version of `rapids-4-spark` jar, the gap analysis is -pretty accurate. - -## 3. Run Spark applications with Spark RAPIDS Accelerator on a GPU Spark Cluster - -### Requirements - -- A Spark 3.x GPU cluster -- The `rapids-4-spark` [jar](../download.md) - -### How to use - -Follow the getting-started guides to start a Spark 3+ GPU cluster and run the existing Spark -workloads on the GPU cluster with parameter `spark.rapids.sql.explain=all`. The Spark driver log -should be collected to check the not-supported messages. This is the most accurate way to do gap -analysis. - -For example, the log lines starting with `!` is the so-called not-supported messages: -``` -!Exec cannot run on GPU because not all expressions can be replaced - ! replicaterows(sum#99L, gender#76) cannot run on GPU because GPU does not currently support the operator ReplicateRows -``` -The indentation indicates the parent and child relationship for those expressions. -If not all of the children expressions can run on GPU, the parent can not run on GPU either. -So above example shows the missing feature is `ReplicateRows` expression. So we filed a feature request -[issue-4104](https://github.com/NVIDIA/spark-rapids/issues/4104) based on 21.12 version. diff --git a/docs/get-started/getting-started.md b/docs/get-started/getting-started.md deleted file mode 100644 index d523192456d..00000000000 --- a/docs/get-started/getting-started.md +++ /dev/null @@ -1,75 +0,0 @@ ---- -layout: page -title: Getting-Started -nav_order: 2 -has_children: true -permalink: /Getting-Started/ ---- -# Getting Started with the RAPIDS Accelerator for Apache Spark - -Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame -operations. This requires no API changes from the user. The plugin will replace SQL operations it -supports with GPU accelerated versions. If an operation is not supported it will fall back to using -the Spark CPU version. Note that the plugin cannot accelerate operations that manipulate RDDs -directly. - -The accelerator library also provides an implementation of Spark's shuffle that can leverage -[UCX](https://www.openucx.org/) to optimize GPU data transfers keeping as much data on the GPU as -possible and bypassing the CPU to do GPU to GPU transfers. - -The GPU accelerated processing plugin does not require the accelerated shuffle implementation. -However, if accelerated SQL processing is not enabled, the shuffle implementation falls back to the -default `SortShuffleManager`. - -To enable GPU processing acceleration you will need: -- Apache Spark 3.1+ -- A Spark cluster configured with GPUs that comply with the - [requirements for RAPIDS](https://rapids.ai/start.html#prerequisites). - - One GPU per executor. -- The RAPIDS Accelerator for Apache Spark plugin jar. -- To set the config `spark.plugins` to `com.nvidia.spark.SQLPlugin` - -## Spark GPU Scheduling Overview -Apache Spark 3.0 now supports GPU scheduling as long as you are using a cluster manager that -supports it. You can have Spark request GPUs and assign them to tasks. The exact configs you use -will vary depending on your cluster manager. Here are some example configs: -- Request your executor to have GPUs: - - `--conf spark.executor.resource.gpu.amount=1` -- Specify the number of GPUs per task: - - `--conf spark.task.resource.gpu.amount=0.125` will allow up to 8 concurrent tasks per executor. - It is recommended to be 1/{executor core count} to get the best performance. -- Specify a GPU discovery script (required on YARN and K8S): - - `--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh` -- Explain why some operations of a query were not placed on a GPU or not: - - `--conf spark.rapids.sql.explain=ALL` will display whether each operation is placed on GPU. - - `--conf spark.rapids.sql.explain=NOT_ON_GPU` will display only parts that did not go on the GPU, - and it's the default setting. - - `--conf spark.rapids.sql.explain=NONE` will disable the log of `rapids.sql.explain`. - -See the deployment specific sections for more details and restrictions. Note that -`spark.task.resource.gpu.amount` can be a decimal amount, so if you want multiple tasks to be run -on an executor at the same time and assigned to the same GPU you can set this to a decimal value -less than 1. You would want this setting to correspond to the `spark.executor.cores` setting. For -instance, if you have `spark.executor.cores=2` which would allow 2 tasks to run on each executor -and you want those 2 tasks to run on the same GPU then you would set -`spark.task.resource.gpu.amount=0.5`. See the [Tuning Guide](../tuning-guide.md) for more details -on controlling the task concurrency for each executor. - -You can also refer to the official Apache Spark documentation. -- [Overview](https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview) -- [Kubernetes specific documentation](https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#resource-allocation-and-configuration-overview) -- [Yarn specific documentation](https://github.com/apache/spark/blob/master/docs/running-on-yarn.md#resource-allocation-and-configuration-overview) -- [Standalone specific documentation](https://github.com/apache/spark/blob/master/docs/spark-standalone.md#resource-allocation-and-configuration-overview) - -## Spark workload qualification - -If you plan to convert existing Spark workload from CPU to GPU, please refer to this -[Spark workload qualification](./getting-started-workload-qualification.md) to check if your Spark -Applications are good fit for the RAPIDS Accelerator for Apache Spark. - -## Spark benchmark - -Please visit [spark-rapids-benchmarks](https://github.com/NVIDIA/spark-rapids-benchmarks) repo for -benchmark tests using the RAPIDS Accelerator For Apache Spark, if you plan to compare the CPU and -GPU Spark jobs' performance. - diff --git a/docs/get-started/yarn-gpu.md b/docs/get-started/yarn-gpu.md deleted file mode 100644 index 07331e1b943..00000000000 --- a/docs/get-started/yarn-gpu.md +++ /dev/null @@ -1,158 +0,0 @@ ---- -layout: page -title: yarn-gpu -nav_exclude: true ---- - -## Spark3 GPU Configuration Guide on Yarn 3.2.1 - -Following files recommended to be configured to enable GPU scheduling on Yarn 3.2.1 and later. - -GPU resource discovery script - `/usr/lib/spark/scripts/gpu/getGpusResources.sh`: -```bash -mkdir -p /usr/lib/spark/scripts/gpu/ -cd /usr/lib/spark/scripts/gpu/ -wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh -chmod a+rwx -R /usr/lib/spark/scripts/gpu/ -``` - -Spark config - `/etc/spark/conf/spark-default.conf`: -```bash -spark.rapids.sql.concurrentGpuTasks=2 -spark.executor.resource.gpu.amount=1 -spark.executor.cores=8 -spark.task.cpus=1 -spark.task.resource.gpu.amount=0.125 -spark.rapids.memory.pinnedPool.size=2G -spark.executor.memoryOverhead=2G -spark.plugins=com.nvidia.spark.SQLPlugin -spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true' -spark.executor.resource.gpu.discoveryScript=/usr/lib/spark/scripts/gpu/getGpusResources.sh # this match the location of discovery script -spark.sql.files.maxPartitionBytes=512m -``` - -Yarn Scheduler config - `/etc/hadoop/conf/capacity-scheduler.xml`: -```xml - - - yarn.scheduler.capacity.resource-calculator - org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - - -``` - -Yarn config - `/etc/hadoop/conf/yarn-site.xml`: -```xml - - - yarn.nodemanager.resource-plugins - yarn.io/gpu - - - yarn.resource-types - yarn.io/gpu - - - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices - auto - - - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables - /usr/bin - - - yarn.nodemanager.linux-container-executor.cgroups.mount - true - - - yarn.nodemanager.linux-container-executor.cgroups.mount-path - /sys/fs/cgroup - - - yarn.nodemanager.linux-container-executor.cgroups.hierarchy - yarn - - - yarn.nodemanager.container-executor.class - org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor - - - yarn.nodemanager.linux-container-executor.group - yarn - - -``` - -`/etc/hadoop/conf/container-executor.cfg` - user yarn as service account: -```bash -yarn.nodemanager.linux-container-executor.group=yarn - -#--Original container-exectuor.cfg Content-- - -[gpu] -module.enabled=true -[cgroups] -root=/sys/fs/cgroup -yarn-hierarchy=yarn -``` - -Need to share node manager local dir to all user, run below in bash: -```bash -chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct -chmod a+rwx -R /sys/fs/cgroup/devices -local_dirs=$(bdconfig get_property_value \ - --configuration_file /etc/hadoop/conf/yarn-site.xml \ - --name yarn.nodemanager.local-dirs 2>/dev/null) -mod_local_dirs=${local_dirs//\,/ } -chmod a+rwx -R ${mod_local_dirs} -``` - -In the end, restart node manager and resource manager service: -On all workers: -```bash -sudo systemctl restart hadoop-yarn-nodemanager.service -``` -On all masters: -```bash -sudo systemctl restart hadoop-yarn-resourcemanager.service -``` - -Note: If `cgroup` is mounted on `tmpfs` and a node is rebooted, -the cgroup directory permission gets reverted. Please check the -cgroup documentation for your platform for more details. -Below is one example of how this can be handled: - -Update the cgroup permissions: -```bash -chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct -chmod a+rwx -R /sys/fs/cgroup/devices -``` -Or the operation can be added in the systemd scripts: - -Create mountCgroup scripts: -```bash -sudo bash -c "cat >/etc/systemd/system/mountCgroup.service" </etc/mountCgroup.sh" <RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. +The RAPIDS Accelerator for Apache Spark combines the power of the [RAPIDS cuDF](https://github.com/rapidsai/cudf/) library and +the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a +built-in accelerated shuffle based on [UCX](https://github.com/openucx/ucx/) that can be configured to leverage GPU-to-GPU +communication and RDMA capabilities. -## Performance & Cost Benefits -Rapids Accelerator for Apache Spark reaps the benefit of GPU performance while saving infrastructure costs. -![Perf-cost](/docs/img/perf-cost.png) -*ETL for FannieMae Mortgage Dataset (~200GB) as shown in our -[demo](https://www.youtube.com/watch?v=4MI_LYah900). Costs -based on Cloud T4 GPU instance market price. - -Please refer to [spark-rapids-examples repo](https://github.com/NVIDIA/spark-rapids-examples/tree/main/examples/XGBoost-Examples) -for details of this example job. - -## Ease of Use -Run your existing Apache Spark applications with no code change. Launch Spark with the RAPIDS Accelerator for Apache Spark plugin jar and enable a configuration setting: - -`spark.conf.set('spark.rapids.sql.enabled','true')` - -The following is an example of a physical plan with operators running on the GPU: - -``` -== Physical Plan == -GpuColumnarToRow false -+- GpuProject [cast(c_customer_sk#0 as string) AS c_customer_sk#40] - +- GpuFileGpuScan parquet [c_customer_sk#0] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct -``` - -Learn more on how to [get started](get-started/getting-started.md). - -## A Unified AI framework for ETL + ML/DL -A single pipeline, from ingest to data preparation to model training -![spark3cluster](/docs/img/spark3cluster.png) - -## Technical Support - -If you need any help or have questions on this product, please contact us : -spark-rapids-support@nvidia.com +If you are a customer looking for information on how to adopt RAPIDS Accelerator for Apache Spark +for your Spark workloads, please go to our User Guide for more information: [link](https://docs.nvidia.com/spark-rapids/user-guide/latest/index.html). diff --git a/docs/spark-profiling-tool.md b/docs/spark-profiling-tool.md deleted file mode 100644 index f164a03d9df..00000000000 --- a/docs/spark-profiling-tool.md +++ /dev/null @@ -1,715 +0,0 @@ ---- -layout: page -title: Profiling Tool -nav_order: 9 ---- -# Profiling Tool - -The Profiling tool analyzes both CPU or GPU generated event logs and generates information -which can be used for debugging and profiling Apache Spark applications. -The output information contains the Spark version, executor details, properties, etc. -Starting with release _22.10_, the Profiling tool optionally provides optimized RAPIDS -configurations based on the worker's information (see [Auto-Tuner support](#auto-tuner-support)). - -* TOC -{:toc} - -## How to use the Profiling tool - -### Prerequisites -- Java 8 or above, Spark 3.0.1+ jars -- Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs - with `.lz4`, `.lzf`, `.snappy` and `.zstd` suffixes as well as - Databricks-specific rolled and compressed(`.gz`) event logs. -- The tool does not support nested directories. - Event log files or event log directories should be at the top level when specifying a directory. - -Note: Spark event logs can be downloaded from Spark UI using a _Download_ button on the right side, -or can be found in the location specified by `spark.eventLog.dir`. See the -[Apache Spark Monitoring](http://spark.apache.org/docs/latest/monitoring.html) documentation for -more information. - -### Step 1a: Download the tools jar -- Download the latest RAPIDS Accelerator for Apache Spark tools jar from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/) - -If you want to compile the jar, please refer to the instructions [here](./spark-qualification-tool.md#How-to-compile-the-tools-jar). - -### Step 1b: Download the Apache Spark 3 distribution -The Profiling tool requires the Spark 3.x jars to be able to run but does not need an Apache Spark run time. -If you do not already have Spark 3.x installed, -you can download the Spark distribution to any machine and include the jars in the classpath. -- [Download Apache Spark 3.x](http://spark.apache.org/downloads.html) - -### Step 2 How to run the Profiling tool -The profiling tool parses the Spark CPU or GPU event log(s) and creates an output report. -If necessary, extract the Spark distribution into a local directory. To run the tool, please note the following: -- Either set `SPARK_HOME` to point to that local directory or add it to the -classpath `java -cp toolsJar:pathToSparkJars/*:...` when you run the Profiling tool. -- Acceptable input event log paths are files or directories containing spark events logs -in the local filesystem, HDFS, S3 or mixed. -- If you are processing a lot of event logs, then use combined or compare mode. Both these modes may need you to increase -the java heap size using `-Xmx` option. For instance, to specify 30 GB heap size `java -Xmx30g`. - -There are 3 modes of operation for the Profiling tool: - 1. Collection Mode: - Collection mode is the default mode when no other options are specified it simply collects information - on each application individually and outputs a file per application - - ```bash - Usage: java -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* - com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] - - ``` - Note that this is the only mode that supports the _Auto-Tuner_ option described in more details - in the [Auto-Tuner support](#auto-tuner-support) section. - - 2. Combined Mode: - Combined mode is collection mode but then combines all the applications - together and you get one file for all applications. - - ```bash - Usage: java -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* - com.nvidia.spark.rapids.tool.profiling.ProfileMain --combined - - ``` - 3. Compare Mode: - Compare mode will combine all the applications information in the same tables into a single file - and also adds in tables to compare stages and sql ids across all of those applications. - The Compare mode will use more memory if comparing lots of applications. - - ```bash - Usage: java -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* - com.nvidia.spark.rapids.tool.profiling.ProfileMain --compare - - ``` - Note that if you are on an HDFS cluster the default filesystem is likely HDFS for both the input and output - so if you want to point to the local filesystem be sure to include `file:` in the path. - - Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath) - - ```bash - java -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.profiling.ProfileMain /eventlogDir - ``` - -Run `--help` for more information. - -## Understanding Profiling tool detailed output and examples -The default output location is the current directory. -The output location can be changed using the `--output-directory` option. -The output goes into a sub-directory named `rapids_4_spark_profile/` inside that output location. -- If running in normal collect mode, it processes event logs individually and outputs files for each application under -a directory named `rapids_4_spark_profile/{APPLICATION_ID}`. It creates a summary text file named `profile.log`. -- If running combine mode the output is put under a directory named `rapids_4_spark_profile/combined/` and creates a summary -text file named `rapids_4_spark_tools_combined.log`. -- If running compare mode the output is put under a directory named `rapids_4_spark_profile/compare/` and creates a summary -text file named `rapids_4_spark_tools_compare.log`. -The output will go into your default filesystem and the tool supports local filesystem or HDFS. - -If you are on an HDFS cluster, then the default filesystem is likely HDFS for both the input and output -so if you want to point to the local filesystem be sure to include `file:` in the path. -There are separate files that are generated under the same sub-directory when using the options to generate query -visualizations or printing the SQL plans. -Optionally if the `--csv` option is specified then it creates a csv file for each table for each application in the -corresponding sub-directory. - -Additional notes: -- There is a 100 characters limit for each output column. If the result of the column exceeds this limit, it is suffixed with -`...` for that column. -- ResourceProfile ids are parsed for the event logs that are from Spark 3.1 or later. A ResourceProfile allows the user -to specify executor and task requirements for an RDD that will get applied during a stage. This allows the user to change -the resource requirements between stages. - -#### A. Collect Information or Compare Information(if more than 1 event logs are as input and option --compare is specified) -- Application information -- Application log path mapping -- Data Source information -- Executors information -- Job, stage and SQL ID information -- SQL to stage information -- Rapids related parameters -- Spark Properties -- Rapids Accelerator jar -- SQL Plan Metrics -- WholeStageCodeGen to node mappings (only applies to CPU plans) -- IO Metrics -- Compare Mode: Matching SQL IDs Across Applications -- Compare Mode: Matching Stage IDs Across Applications -- Optionally: SQL Plan for each SQL query -- Optionally: Generates DOT graphs for each SQL query -- Optionally: Generates timeline graph for application - -For example, GPU run vs CPU run performance comparison or different runs with different parameters. - -We can input multiple Spark event logs and this tool can compare environments, executors, Rapids related Spark parameters, - -- Compare the durations/versions/gpuMode on or off: - - -- Application information - -``` -Application Information: - -+--------+-----------+-----------------------+---------+-------------+-------------+--------+-----------+------------+-------------+ -|appIndex|appName |appId |sparkUser|startTime |endTime |duration|durationStr|sparkVersion|pluginEnabled| -+--------+-----------+-----------------------+---------+-------------+-------------+--------+-----------+------------+-------------+ -|1 |Spark shell|app-20210329165943-0103|user1 |1617037182848|1617037490515|307667 |5.1 min |3.0.1 |false | -|2 |Spark shell|app-20210329170243-0018|user1 |1617037362324|1617038578035|1215711 |20 min |3.0.1 |true | -+--------+-----------+-----------------------+---------+-------------+-------------+--------+-----------+------------+-------------+ -``` - -- Data Source information -The details of this output differ between using a Spark Data Source V1 and Data Source V2 reader. -The Data Source V2 truncates the schema, so if you see `...`, then -the full schema is not available. - -``` -Data Source Information: -+--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+ -|appIndex|sqlID|format |location |pushedFilters |schema | -+--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+ -|1 |0 |Text |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/resources/trucks-comments.csv]|[] |value:string | -|1 |1 |csv |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/re... |PushedFilters: []|_c0:string | -|1 |2 |parquet|Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotscolumnsout] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| -|1 |3 |parquet|Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotscolumnsout] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| -|1 |4 |orc |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/logscolumsout.orc] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| -|1 |5 |orc |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/logscolumsout.orc] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| -|1 |6 |json |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotsofcolumnsout.json] |PushedFilters: []|adj_remaining_months_to_maturity:double,asset_recovery_costs:double,credit_enhancement_pro...| -|1 |7 |json |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotsofcolumnsout.json] |PushedFilters: []|adj_remaining_months_to_maturity:double,asset_recovery_costs:double,credit_enhancement_pro...| -|1 |8 |json |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotsofcolumnsout.json] |PushedFilters: []|adj_remaining_months_to_maturity:double,asset_recovery_costs:double,credit_enhancement_pro...| -|1 |9 |JDBC |unknown |unknown | | -+--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+ -``` - -- Executor information: - -``` -Executor Information: -+--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+ -|appIndex|resourceProfileId|numExecutors|executorCores|maxMem |maxOnHeapMem|maxOffHeapMem|executorMemory|numGpusPerExecutor|executorOffHeap|taskCpu|taskGpu| -+--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+ -|1 |0 |1 |4 |11264537395|11264537395 |0 |20480 |1 |0 |1 |0.0 | -|1 |1 |2 |2 |3247335014 |3247335014 |0 |6144 |2 |0 |2 |2.0 | -+--------+-----------------+------------+-------------+-----------+------------+-------------+-------------+--------------+------------------+---------------+-------+-------+ -``` - -- Matching SQL IDs Across Applications: - -``` -Matching SQL IDs Across Applications: -+-----------------------+-----------------------+ -|app-20210329165943-0103|app-20210329170243-0018| -+-----------------------+-----------------------+ -|0 |0 | -|1 |1 | -|2 |2 | -|3 |3 | -|4 |4 | -+-----------------------+-----------------------+ -``` - -There is one column per application. There is a row per SQL ID. The SQL IDs are matched -primarily on the structure of the SQL query run, and then on the order in which they were -run. Be aware that this is truly the structure of the query. Two queries that do similar -things, but on different data are likely to match as the same. An effort is made to -also match between CPU plans and GPU plans so in most cases the same query run on the -CPU and on the GPU will match. - -- Matching Stage IDs Across Applications: - -``` -Matching Stage IDs Across Applications: -+-----------------------+-----------------------+ -|app-20210329165943-0103|app-20210329170243-0018| -+-----------------------+-----------------------+ -|31 |31 | -|32 |32 | -|33 |33 | -|39 |38 | -|40 |40 | -|41 |41 | -+-----------------------+-----------------------+ -``` - -There is one column per application. There is a row per stage ID. If a SQL query matches -between applications, see Matching SQL IDs Across Applications, then an attempt is made -to match stages within that application to each other. This has the same issues with -stages when generating a dot graph. This can be especially helpful when trying to compare -large queries and Spark happened to assign the stage IDs slightly differently, or in some -cases there are a different number of stages because of slight differences in the plan. This -is a best effort, and it is not guaranteed to match up all stages in a plan. - -- SQL to Stage Information (sorted by stage duration) - -Note that not all SQL nodes have a mapping to stage id so some nodes might be missing. - -``` -SQL to Stage Information: -+--------+-----+-----+-------+--------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -|appIndex|sqlID|jobID|stageId|stageAttemptId|Stage Duration|SQL Nodes(IDs) | -+--------+-----+-----+-------+--------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -|1 |0 |1 |1 |0 |8174 |Exchange(9),WholeStageCodegen (1)(10),Scan(13) | -|1 |0 |1 |2 |0 |8154 |Exchange(16),WholeStageCodegen (3)(17),Scan(20) | -|1 |0 |1 |3 |0 |2148 |Exchange(2),HashAggregate(4),SortMergeJoin(6),WholeStageCodegen (5)(3),Sort(8),WholeStageCodegen (2)(7),Exchange(9),Sort(15),WholeStageCodegen (4)(14),Exchange(16)| -|1 |0 |1 |4 |0 |126 |HashAggregate(1),WholeStageCodegen (6)(0),Exchange(2) | -+--------+-----+-----+-------+--------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -``` - -- Compare Rapids related Spark properties side-by-side: - -``` -Compare Rapids Properties which are set explicitly: -+-------------------------------------------+----------+----------+ -|propertyName |appIndex_1|appIndex_2| -+-------------------------------------------+----------+----------+ -|spark.rapids.memory.pinnedPool.size |null |2g | -|spark.rapids.sql.castFloatToDecimal.enabled|null |true | -|spark.rapids.sql.concurrentGpuTasks |null |2 | -|spark.rapids.sql.enabled |false |true | -|spark.rapids.sql.explain |null |NOT_ON_GPU| -|spark.rapids.sql.incompatibleOps.enabled |null |true | -+-------------------------------------------+----------+----------+ -``` - -- List rapids-4-spark jars based on classpath: - -``` -Rapids Accelerator jar: -+--------+------------------------------------------------------------+ -|appIndex|Rapids4Spark jars | -+--------+------------------------------------------------------------+ -|1 |spark://10.10.10.10:43445/jars/rapids-4-spark_2.12-0.5.0.jar| -|2 |spark://10.10.10.11:41319/jars/rapids-4-spark_2.12-0.5.0.jar| -+--------+------------------------------------------------------------+ -``` - -- Job, stage and SQL ID information(not in `compare` mode yet): - -``` -Job Information: -+--------+-----+---------+-----+-------------+-------------+ -|appIndex|jobID|stageIds |sqlID|startTime |endTime | -+--------+-----+---------+-----+-------------+-------------+ -|1 |0 |[0] |null |1622846402778|1622846410240| -|1 |1 |[1,2,3,4]|0 |1622846431114|1622846441591| -+--------+-----+---------+-----+-------------+-------------+ -``` - -- SQL Plan Metrics for Application for each SQL plan node in each SQL: - -These are also called accumulables in Spark. -Note that not all SQL nodes have a mapping to stage id. - -``` -SQL Plan Metrics for Application: -+--------+-----+------+-----------------------------------------------------------+-------------+-----------------------+-------------+----------+--------+ -|appIndex|sqlID|nodeID|nodeName |accumulatorId|name |max_value |metricType|stageIds| -+--------+-----+------+-----------------------------------------------------------+-------------+-----------------------+-------------+----------+--------+ -|1 |0 |1 |GpuColumnarExchange |111 |output rows |1111111111 |sum |4,3 | -|1 |0 |1 |GpuColumnarExchange |112 |output columnar batches|222222 |sum |4,3 | -|1 |0 |1 |GpuColumnarExchange |113 |data size |333333333333 |size |4,3 | -|1 |0 |1 |GpuColumnarExchange |114 |shuffle bytes written |444444444444 |size |4,3 | -|1 |0 |1 |GpuColumnarExchange |115 |shuffle records written|555555 |sum |4,3 | -|1 |0 |1 |GpuColumnarExchange |116 |shuffle write time |666666666666 |nsTiming |4,3 | -``` - -- WholeStageCodeGen to Node Mapping (only for CPU logs): - -``` -WholeStageCodeGen Mapping: -+--------+-----+------+---------------------+-------------------+------------+ -|appIndex|sqlID|nodeID|SQL Node |Child Node |Child NodeID| -+--------+-----+------+---------------------+-------------------+------------+ -|1 |0 |0 |WholeStageCodegen (6)|HashAggregate |1 | -|1 |0 |3 |WholeStageCodegen (5)|HashAggregate |4 | -|1 |0 |3 |WholeStageCodegen (5)|Project |5 | -|1 |0 |3 |WholeStageCodegen (5)|SortMergeJoin |6 | -|1 |0 |7 |WholeStageCodegen (2)|Sort |8 | -``` - - -#### B. Analysis -- Job + Stage level aggregated task metrics -- SQL level aggregated task metrics -- SQL duration, application during, if it contains Dataset or RDD operation, potential problems, executor CPU time percent -- Shuffle Skew Check: (When task's Shuffle Read Size > 3 * Avg Stage-level size) - -Below we will aggregate the task level metrics at different levels -to do some analysis such as detecting possible shuffle skew. - -- Job + Stage level aggregated task metrics: - -``` -Job + Stage level aggregated task metrics: -+--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ -|appIndex|ID |numTasks|Duration|diskBytesSpilled_sum|duration_sum|duration_max|duration_min|duration_avg|executorCPUTime_sum|executorDeserializeCPUTime_sum|executorDeserializeTime_sum|executorRunTime_sum|input_bytesRead_sum|input_recordsRead_sum|jvmGCTime_sum|memoryBytesSpilled_sum|output_bytesWritten_sum|output_recordsWritten_sum|peakExecutionMemory_max|resultSerializationTime_sum|resultSize_max|sr_fetchWaitTime_sum|sr_localBlocksFetched_sum|sr_localBytesRead_sum|sr_remoteBlocksFetched_sum|sr_remoteBytesRead_sum|sr_remoteBytesReadToDisk_sum|sr_totalBytesRead_sum|sw_bytesWritten_sum|sw_recordsWritten_sum|sw_writeTime_sum| -+--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ -|1 |job_0 |3333 |222222 |0 |11111111 |111111 |111 |1111.1 |6666666 |55555 |55555 |55555555 |222222222222 |22222222222 |111111 |0 |0 |0 |222222222 |1 |11111 |11111 |99999 |22222222222 |2222221 |222222222222 |0 |222222222222 |222222222222 |5555555 |444444 | -``` - -- SQL level aggregated task metrics: - -``` -SQL level aggregated task metrics: -+--------+------------------------------+-----+--------------------+--------+--------+---------------+---------------+----------------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ -|appIndex|appID |sqlID|description |numTasks|Duration|executorCPUTime|executorRunTime|executorCPURatio|diskBytesSpilled_sum|duration_sum|duration_max|duration_min|duration_avg|executorCPUTime_sum|executorDeserializeCPUTime_sum|executorDeserializeTime_sum|executorRunTime_sum|input_bytesRead_sum|input_recordsRead_sum|jvmGCTime_sum|memoryBytesSpilled_sum|output_bytesWritten_sum|output_recordsWritten_sum|peakExecutionMemory_max|resultSerializationTime_sum|resultSize_max|sr_fetchWaitTime_sum|sr_localBlocksFetched_sum|sr_localBytesRead_sum|sr_remoteBlocksFetched_sum|sr_remoteBytesRead_sum|sr_remoteBytesReadToDisk_sum|sr_totalBytesRead_sum|sw_bytesWritten_sum|sw_recordsWritten_sum|sw_writeTime_sum| -+--------+------------------------------+-----+--------------------+--------+--------+---------------+---------------+----------------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ -|1 |application_1111111111111_0001|0 |show at :11|1111 |222222 |6666666 |55555555 |55.55 |0 |13333333 |111111 |999 |3333.3 |6666666 |55555 |66666 |11111111 |111111111111 |11111111111 |111111 |0 |0 |0 |888888888 |8 |11111 |11111 |99999 |11111111111 |2222222 |222222222222 |0 |222222222222 |444444444444 |5555555 |444444 | -``` - -- SQL duration, application during, if it contains Dataset or RDD operation, potential problems, executor CPU time percent: - -``` -SQL Duration and Executor CPU Time Percent -+--------+-------------------+-----+------------+--------------------------+------------+---------------------------+-------------------------+ -|appIndex|App ID |sqlID|SQL Duration|Contains Dataset or RDD Op|App Duration|Potential Problems |Executor CPU Time Percent| -+--------+-------------------+-----+------------+--------------------------+------------+---------------------------+-------------------------+ -|1 |local-1626104300434|0 |1260 |false |131104 |NESTED COMPLEX TYPE |92.65 | -|1 |local-1626104300434|1 |259 |false |131104 |NESTED COMPLEX TYPE |76.79 | -``` - -- Shuffle Skew Check: - -``` -Shuffle Skew Check: (When task's Shuffle Read Size > 3 * Avg Stage-level size) -+--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+ -|appIndex|stageId|stageAttemptId|taskId|attempt|taskDurationSec|avgDurationSec|taskShuffleReadMB|avgShuffleReadMB|taskPeakMemoryMB|successful|reason | -+--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+ -|1 |2 |0 |2222 |0 |111.11 |7.7 |2222.22 |111.11 |0.01 |false |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /dddd/xxxxxxx/ccccc/bbbbbbbbb/aaaaaaa| -|1 |2 |0 |2224 |1 |222.22 |8.8 |3333.33 |111.11 |0.01 |false |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /dddd/xxxxxxx/ccccc/bbbbbbbbb/aaaaaaa| -+--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+ -``` - -#### C. Health Check -- List failed tasks, stages and jobs -- Removed BlockManagers and Executors -- SQL Plan HealthCheck - -Below are examples. -- Print failed tasks: - -``` -Failed tasks: -+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+ -|appIndex|stageId|stageAttemptId|taskId|attempt|failureReason | -+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+ -|3 |4 |0 |2842 |0 |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /home/jenkins/agent/workspace/jenkins| -|3 |4 |0 |2858 |0 |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(22000),None,false,true,None)| -|3 |4 |0 |2884 |0 |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(21148),None,false,true,None)| -|3 |4 |0 |2908 |0 |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(20420),None,false,true,None)| -|3 |4 |0 |3410 |1 |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /home/jenkins/agent/workspace/jenkins| -|4 |1 |0 |1948 |1 |TaskKilled(another attempt succeeded,List(AccumulableInfo(290,None,Some(1107),None,false,true,None),| -+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+ -``` - -- Print failed stages: - -``` -Failed stages: -+--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+ -|appIndex|stageId|attemptId|name |numTasks|failureReason | -+--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+ -|3 |4 |0 |attachTree at Spark300Shims.scala:624|1000 |Job 0 cancelled as part of cancellation of all jobs| -+--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+ -``` - -- Print failed jobs: - -``` -Failed jobs: -+--------+-----+---------+------------------------------------------------------------------------+ -|appIndex|jobID|jobResult|failureReason | -+--------+-----+---------+------------------------------------------------------------------------+ -|3 |0 |JobFailed|java.lang.Exception: Job 0 cancelled as part of cancellation of all j...| -+--------+-----+---------+------------------------------------------------------------------------+ -``` - -- SQL Plan HealthCheck: - - Prints possibly unsupported query plan nodes such as `$Lambda` key word means dataset API. - -``` -+--------+-----+------+--------+---------------------------------------------------------------------------------------------------+ -|appIndex|sqlID|nodeID|nodeName|nodeDescription | -+--------+-----+------+--------+---------------------------------------------------------------------------------------------------+ -|3 |1 |8 |Filter |Filter $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$4578/0x00000008019f1840@4b63e04c.apply| -+--------+-----+------+--------+---------------------------------------------------------------------------------------------------+ -``` - -#### D. Recommended Configuration - -The _Auto-Tuner_ output has 2 main sections: -1. _Spark Properties_: A list of Apache Spark configurations to tune the performance of the app. - The list is the result of `diff` between the existing app configurations and the recommended - ones. Therefore, a recommendation matches the existing app configuration, it will not show up in - the list. -2. _Comments_: A list of messages to highlight properties that were missing in the app - configurations, or the cause of failure to generate the recommendations. - -**Examples** - -- A succesful run with missing _softwareProperties_: - ``` - Spark Properties: - --conf spark.executor.cores=16 - --conf spark.executor.instances=8 - --conf spark.executor.memory=32768m - --conf spark.executor.memoryOverhead=7372m - --conf spark.rapids.memory.pinnedPool.size=4096m - --conf spark.rapids.sql.concurrentGpuTasks=2 - --conf spark.sql.files.maxPartitionBytes=512m - --conf spark.sql.shuffle.partitions=200 - --conf spark.task.resource.gpu.amount=0.0625 - - Comments: - - 'spark.executor.instances' was not set. - - 'spark.executor.cores' was not set. - - 'spark.task.resource.gpu.amount' was not set. - - 'spark.rapids.sql.concurrentGpuTasks' was not set. - - 'spark.executor.memory' was not set. - - 'spark.rapids.memory.pinnedPool.size' was not set. - - 'spark.executor.memoryOverhead' was not set. - - 'spark.sql.files.maxPartitionBytes' was not set. - - 'spark.sql.shuffle.partitions' was not set. - - 'spark.sql.adaptive.enabled' should be enabled for better performance. - ``` - -- A succesful run with defined _softwareProperties_. In this example, only - two recommendations did not match the existing app app configurations: - ``` - Spark Properties: - --conf spark.executor.instances=8 - --conf spark.sql.shuffle.partitions=200 - - Comments: - - 'spark.sql.shuffle.partitions' was not set. - ``` - -- Failing to load the worker info: - ``` - Cannot recommend properties. See Comments. - - Comments: - - java.io.FileNotFoundException: File worker-info.yaml does not exist - - 'spark.executor.memory' should be set to at least 2GB/core. - - 'spark.executor.instances' should be set to (gpuCount * numWorkers). - - 'spark.task.resource.gpu.amount' should be set to Max(1, (numCores / gpuCount)). - - 'spark.rapids.sql.concurrentGpuTasks' should be set to Max(4, (gpuMemory / 8G)). - - 'spark.rapids.memory.pinnedPool.size' should be set to 2048m. - - 'spark.sql.adaptive.enabled' should be enabled for better performance. - ``` - -### Generating Visualizations - -- Print SQL Plans (--print-plans option): -Prints the SQL plan as a text string to a file named `planDescriptions.log`. - -- Generate DOT graph for each SQL (--generate-dot option): - -``` -Generated DOT graphs for app app-20210507103057-0000 to /path/. in 17 second(s) -``` - -A dot file will be generated for each query in the application. -Once the DOT file is generated, you can install [graphviz](http://www.graphviz.org) to convert the DOT file -as a graph in pdf format using below command: - -```bash -dot -Tpdf ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.pdf -``` - -Or to svg using -```bash -dot -Tsvg ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.svg -``` - -The pdf or svg file has the SQL plan graph with metrics. The svg file will act a little -more like the Spark UI and include extra information for nodes when hovering over it with -a mouse. - -As a part of this an effort is made to associate parts of the graph with the Spark stage it is a -part of. This is not 100% accurate. Some parts of the plan like `TakeOrderedAndProject` may -be a part of multiple stages and only one of the stages will be selected. `Exchanges` are purposely -left out of the sections associated with a stage because they cover at least 2 stages and possibly -more. In other cases we may not be able to determine what stage something was a part of. In those -cases we mark it as `UNKNOWN STAGE`. This is because we rely on metrics to link a node to a stage. -If a stage has no metrics, like if the query crashed early, we cannot establish that link. - -- Generate timeline for application (--generate-timeline option): - -The output of this is an [svg](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics) file -named `timeline.svg`. Most web browsers can display this file. It is a -timeline view similar to Apache Spark's -[event timeline](https://spark.apache.org/docs/latest/web-ui.html). - -This displays several data sections. - -1. **Tasks** This shows all tasks in the application divided by executor. Please note that this - tries to pack the tasks in the graph. It does not represent actual scheduling on CPU cores. - The tasks are labeled with the time it took for them to run. There is a breakdown of some metrics - per task in the lower half of the task block with different colors used to designate different - metrics. - 1. Yellow is the deserialization time for the task as reported by Spark. This works for both CPU - and GPU tasks. - 2. White is the read time for a task. This is a combination of the "buffer time" GPU SQL metric - and the shuffle read time as reported by Spark. The shuffle time works for both CPU and GPU - tasks, but "buffer time" only is reported for GPU accelerated file reads. - 3. Red is the semaphore wait time. This is the amount of time a task spent waiting to get access - to the GPU. When processing logs generated by versions of the spark rapids plugin prior to - 23.04 this would only show up on GPU tasks when DEBUG metrics are enabled. For logs generated - with 23.04 and above it is always on. It does not apply to CPU tasks, as they don't go through - the Semaphore. - 4. Green is the "op time" SQL metric along with a few other metrics that also indicate the amount - of time the GPU was being used to process data. This is GPU specific. - 5. Blue is the write time for a task. This is the "write time" SQL metric used when writing out - results as files using GPU acceleration, or it is the shuffle write time as reported by Spark. - The shuffle metrics work for both CPU and GPU tasks, but the "write time" metrics is GPU specific. - 6. Anything else is time that is not accounted for by these metrics. Typically, this is time - spent on the CPU, but could also include semaphore wait time as DEBUG metrics are not on by - default. -2. **STAGES** This shows the stages times reported by Spark. It starts with when the stage was - scheduled and ends when Spark considered the stage done. -3. **STAGE RANGES** This shows the time from the start of the first task to the end of the last - task. Often a stage is scheduled, but there are not enough resources in the cluster to run it. - This helps to show. How long it takes for a task to start running after it is scheduled, and in - many cases how long it took to run all of the tasks in the stage. This is not always true because - Spark can intermix tasks from different stages. -4. **JOBS** This shows the time range reported by Spark from when a job was scheduled to when it - completed. -5. **SQL** This shows the time range reported by Spark from when a SQL statement was scheduled to - when it completed. - -Tasks and stages all are color coordinated to help know what tasks are associated with a given -stage. Jobs and SQL are not color coordinated. - -## Profiling tool options - -``` -Profiling tool for the RAPIDS Accelerator and Apache Spark - -Usage: java -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* - com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] - - - -a, --auto-tuner Toggle AutoTuner module. - --combined Collect mode but combine all applications into - the same tables. - -c, --compare Compare Applications (Note this may require - more memory if comparing a large number of - applications). Default is false. - --csv Output each table to a CSV file as well - creating the summary text file. - -f, --filter-criteria Filter newest or oldest N eventlogs based on - application start timestamp for processing. - Filesystem based filtering happens before - application based filtering (see start-app-time). - eg: 100-newest-filesystem (for processing newest - 100 event logs). eg: 100-oldest-filesystem (for - processing oldest 100 event logs). - -g, --generate-dot Generate query visualizations in DOT format. - Default is false - --generate-timeline Write an SVG graph out for the full - application timeline. - -m, --match-event-logs Filter event logs whose filenames contain the - input string - -n, --num-output-rows Number of output rows for each Application. - Default is 1000 - --num-threads Number of thread to use for parallel - processing. The default is the number of cores - on host divided by 4. - -o, --output-directory Base output directory. Default is current - directory for the default filesystem. The - final output will go into a subdirectory - called rapids_4_spark_profile. It will - overwrite any existing files with the same - name. - -p, --print-plans Print the SQL plans to a file named - 'planDescriptions.log'. - Default is false. - -s, --start-app-time Filter event logs whose application start - occurred within the past specified time - period. Valid time periods are - min(minute),h(hours),d(days),w(weeks),m(months). - If a period is not specified it defaults to - days. - -t, --timeout Maximum time in seconds to wait for the event - logs to be processed. Default is 24 hours - (86400 seconds) and must be greater than 3 - seconds. If it times out, it will report what - it was able to process up until the timeout. - -w, --worker-info File path containing the system information of - a worker node. It is assumed that all workers - are homogenous. It requires the AutoTuner to - be enabled. Default is ./worker_info.yaml - -h, --help Show help message - - trailing arguments: - eventlog (required) Event log filenames(space separated) or directories - containing event logs. eg: s3a:///eventlog1 - /path/to/eventlog2 -``` - -### Auto-Tuner support - -Starting with release _22.10_, the Profiling tool a new _Auto-Tuner_ that aims at optimizing -Apache Spark applications by recommending a set of configurations to tune the performance of -Rapids accelerator. - -Currently, the _Auto-Tuner_ calculates a set of configurations that impact the performance of Apache -Spark apps executing on GPU. Those calculations can leverage cluster information -(e.g. memory, cores, Spark default configurations) as well as information processed in the -application event logs. Note that the tool also will recommend settings for the application assuming -that the job will be able to use all the cluster resources (CPU and GPU) when it is running. - -The values loaded from the app logs have higher precedence than the default configs. -Please refer to [Understanding the Profiling tool output](#d-recommended-configuration) for -more details on the output of the _Auto-Tuner_. - -Note the following _Auto-Tuner_ limitations: -- It is currently only supported in the _Collection Mode_ (see [the 3 different modes](#step-2-how-to-run-the-profiling-tool)), and -- It is assumed that all the _worker_ nodes on the cluster are homogenous. - -To run the _Auto-Tuner_, enable the `auto-tuner` flag and pass a valid `--worker-info `. -The _Auto-Tuner_ needs to learn the system properties of the _worker_ nodes that run application -code in the cluster. The argument `FILE_PATH` can either be local or remote file (i.e., HDFS). -A template of the worker information is shown below: - - ``` - system: - numCores: 32 - memory: 212992MiB - numWorkers: 5 - gpu: - memory: 15109MiB - count: 4 - name: T4 - softwareProperties: - spark.driver.maxResultSize: 7680m - spark.driver.memory: 15360m - spark.executor.cores: '8' - spark.executor.instances: '2' - spark.executor.memory: 47222m - spark.executorEnv.OPENBLAS_NUM_THREADS: '1' - spark.scheduler.mode: FAIR - spark.sql.cbo.enabled: 'true' - spark.ui.port: '0' - spark.yarn.am.memory: 640m - ``` - - -| Property | Optional | If Missing | -|--------------------|:--------:|:----------------------------------------------------------------------------------------------------------------------------:| -| system.numCores | No | _Auto-Tuner_ does not calculate recommendations | -| system.memory | No | _Auto-Tuner_ does not calculate any recommendations | -| system.numWorkers | Yes | Default: 1 | -| gpu.name | Yes | Default: T4 (Nvidia Tesla T4) | -| gpu.memory | Yes | Default: 16G | -| softwareProperties | Yes | This section is optional. The _Auto-Tuner_ reads the configs within the logs of the Apache Spark apps with higher precedence | - -## Profiling tool metrics definitions - -All the metrics definitions can be found in the -[executor task metrics doc](https://spark.apache.org/docs/latest/monitoring.html#executor-task-metrics) / -[executor metrics doc](https://spark.apache.org/docs/latest/monitoring.html#executor-metrics) or -the [SPARK webUI doc](https://spark.apache.org/docs/latest/web-ui.html#content). diff --git a/docs/spark-qualification-tool.md b/docs/spark-qualification-tool.md deleted file mode 100644 index 9b8199deda2..00000000000 --- a/docs/spark-qualification-tool.md +++ /dev/null @@ -1,1013 +0,0 @@ ---- -layout: page -title: Qualification Tool -nav_order: 8 ---- -# Qualification Tool - -The Qualification tool analyzes Spark events generated from CPU based Spark applications to help quantify -the expected acceleration of migrating a Spark application or query to GPU. - -The tool first analyzes the CPU event log and determine which operators are likely to run on the GPU. -The tool then uses estimates from historical queries and benchmarks to estimate a speed-up at an individual operator -level to calculate how much a specific operator would accelerate on GPU for the specific query or application. -It calculates an _"Estimated GPU App Duration"_ by adding up the accelerated operator durations along with durations -that could not run on GPU because they are unsupported operators or not SQL/Dataframe. - -This tool is intended to give the users a starting point and does not guarantee the -queries or applications with the highest _recommendation_ will actually be accelerated the most. Currently, -it reports by looking at the amount of time spent in tasks of SQL Dataframe operations. Note that the qualification -tool estimates assume that the application is run on a dedicated cluster where it can use all of the available -Spark resources. - -The estimations for GPU duration are available for different environments and are based on benchmarks run in the -applicable environments. Here are the cluster information for the ETL benchmarks used for the estimates: - -| Environment | CPU Cluster | GPU Cluster | -|------------------|-------------------|--------------------------------| -| On-prem | 8x 128-core | 8x 128-core + 8x A100 40 GB | -| Dataproc (T4) | 4x n1-standard-32 | 4x n1-standard-32 + 8x T4 16GB | -| Dataproc (L4) | 8x n1-standard-16 | 8x g2-standard-16 | -| EMR | 8x m5d.8xlarge | 4x g4dn.12xlarge | -| Databricks AWS | 8x m6gd.8xlage | 8x g5.8xlarge | -| Databricks Azure | 8x E8ds_v4 | 8x NC8as_T4_v3 | - -Note that all benchmarks were run using the [NDS benchmark](https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds) at SF3K (3 TB). - -> **Disclaimer!** -> Estimates provided by the Qualification tool are based on the currently supported "_SparkPlan_" or "_Executor Nodes_" -> used in the application. It currently does not handle all the expressions or datatypes used. -> Please refer to "[Understanding Execs report](#execs-report)" section and the -> "[Supported Operators](https://github.com/NVIDIA/spark-rapids/blob/main/docs/supported_ops.md)" guide to check the types and expressions you are using are supported. - -This document covers below topics: - -* TOC -{:toc} - -## How to use the Qualification tool - -The Qualification tool can be run in three different ways. One is to run it as a standalone tool on the -Spark event logs after the application(s) have run, the second is to be integrated into a running Spark -application using explicit API calls, and the third is to install a Spark listener which can output -results on a per SQL query basis. - -In running the qualification tool standalone on Spark event logs, the tool can be run as a user tool command -via a [pip package](https://pypi.org/project/spark-rapids-user-tools/) for CSP environments (Google Dataproc, -AWS EMR, Databricks AWS) or as a java application for other environments. - -## Running the Qualification tool standalone for CSP environments on Spark event logs -### User Tools Prerequisites and Setup for CSP environments - -* [Dataproc](https://github.com/NVIDIA/spark-rapids-tools/blob/main/user_tools/docs/user-tools-dataproc.md) -* [EMR](https://github.com/NVIDIA/spark-rapids-tools/blob/main/user_tools/docs/user-tools-aws-emr.md) -* [Databricks AWS](https://github.com/NVIDIA/spark-rapids-tools/blob/main/user_tools/docs/user-tools-databricks-aws.md) - -### Qualify CPU Workloads for Potential Cost Savings and Acceleration with GPUs - -The qualification tool will run against logs from your CSP environment and then will output the applications -recommended for acceleration along with estimated speed-up and cost saving metrics. - -Usage: `spark_rapids_user_tools qualification --cpu_cluster --eventlogs ` - -The supported CSPs are *dataproc*, *emr*, and *databricks-aws*. The EVENTLOGS-PATH should be the storage location -for your eventlogs. For Dataproc, it should be set to the GCS path. For EMR and Databricks-AWS, it should be set to -the S3 path. THE CLUSTER can be a live cluster or a configuration file representing the cluster instances and size. -More details are in the above documentation links per CSP environment - -Help (to see all options available): `spark_rapids_user_tools qualification --help` - -Example output: -``` -+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+ -| | App Name | App ID | Recommendation | Estimated GPU | Estimated GPU | App | Estimated GPU | -| | | | | Speedup | Duration(s) | Duration(s) | Savings(%) | -|----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------| -| 0 | query24 | application_1664888311321_0011 | Strongly Recommended | 3.49 | 257.18 | 897.68 | 59.70 | -| 1 | query78 | application_1664888311321_0009 | Strongly Recommended | 3.35 | 113.89 | 382.35 | 58.10 | -| 2 | query23 | application_1664888311321_0010 | Strongly Recommended | 3.08 | 325.77 | 1004.28 | 54.37 | -| 3 | query64 | application_1664888311321_0008 | Strongly Recommended | 2.91 | 150.81 | 440.30 | 51.82 | -| 4 | query50 | application_1664888311321_0003 | Recommended | 2.47 | 101.54 | 250.95 | 43.08 | -| 5 | query16 | application_1664888311321_0005 | Recommended | 2.36 | 106.33 | 251.95 | 40.63 | -| 6 | query38 | application_1664888311321_0004 | Recommended | 2.29 | 67.37 | 154.33 | 38.59 | -| 7 | query87 | application_1664888311321_0006 | Recommended | 2.25 | 75.67 | 170.69 | 37.64 | -| 8 | query51 | application_1664888311321_0002 | Recommended | 1.53 | 53.94 | 82.63 | 8.18 | -+----+------------+--------------------------------+----------------------+-----------------+-----------------+---------------+-----------------+ -``` - -## Running the Qualification tool standalone on Spark event logs - -### Prerequisites -- Java 8 or above, Spark 3.0.1+ jars. -- Spark event log(s) from Spark 2.0 or above version. Supports both rolled and compressed event logs - with `.lz4`, `.lzf`, `.snappy` and `.zstd` suffixes as well as Databricks-specific rolled and compressed(.gz) event logs. -- The tool does not support nested directories. - Event log files or event log directories should be at the top level when specifying a directory. - -Note: Spark event logs can be downloaded from Spark UI using a "Download" button on the right side, -or can be found in the location specified by `spark.eventLog.dir`. See the -[Apache Spark Monitoring](http://spark.apache.org/docs/latest/monitoring.html) documentation for -more information. - -### Step 1 Download the tools jar and Apache Spark 3 Distribution - -The Qualification tool require the Spark 3.x jars to be able to run but do not need an Apache Spark run time. -If you do not already have Spark 3.x installed, you can download the Spark distribution to -any machine and include the jars in the classpath. -- Download the latest jar from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/) -- [Download Apache Spark 3.x](http://spark.apache.org/downloads.html) - Spark 3.1.1 for Apache Hadoop is recommended - -### Step 2 Run the Qualification tool - -1. The Qualification tool reads the log files and process them in-memory. So the heap memory should be increased when - processing large volume of events. It is recommended to pass VM options `-Xmx10g` and adjust according to the - number-of-apps / size-of-logs being processed. - ``` - export QUALIFICATION_HEAP=-Xmx10g - ``` - -2. Event logs stored on a local machine: - - Extract the Spark distribution into a local directory if necessary. - - Either set SPARK_HOME to point to that directory or just put the path inside of the classpath - `java -cp toolsJar:pathToSparkJars/*:...` when you run the Qualification tool. - - This tool parses the Spark CPU event log(s) and creates an output report. Acceptable inputs are either individual or - multiple event logs files or directories containing spark event logs in the local filesystem, HDFS, S3 or mixed. - - ```bash - Usage: java ${QUALIFICATION_HEAP} \ - -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] - - ``` - - ```bash - Sample: java ${QUALIFICATION_HEAP} \ - -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain /usr/logs/app-name1 - ``` - -3. Event logs stored on an on-premises HDFS cluster: - - Example running on files in HDFS: (include `$HADOOP_CONF_DIR` in classpath) - - ```bash - Usage: java ${QUALIFICATION_HEAP} \ - -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain /eventlogDir - ``` - - Note, on an HDFS cluster, the default filesystem is likely HDFS for both the input and output - so if you want to point to the local filesystem be sure to include file: in the path. - -### Qualification tool options - Note: `--help` should be before the trailing event logs. - -```bash -java -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain --help - -RAPIDS Accelerator Qualification tool for Apache Spark - -Usage: java -cp rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/* - com.nvidia.spark.rapids.tool.qualification.QualificationMain [options] - - - --all Apply multiple event log filtering criteria - and process only logs for which all - conditions are satisfied.Example: - --all -> result is - AND AND . - Default is all=true - --any Apply multiple event log filtering criteria - and process only logs for which any condition - is satisfied.Example: - --any -> result is OR - OR - -a, --application-name Filter event logs by application name. The - string specified can be a regular expression, - substring, or exact match. For filtering - based on complement of application name, use - ~APPLICATION_NAME. i.e Select all event logs - except the ones which have application name - as the input string. - -f, --filter-criteria Filter newest or oldest N eventlogs based on - application start timestamp, unique - application name or filesystem timestamp. - Filesystem based filtering happens before any - application based filtering.For application - based filtering, the order in which filters - areapplied is: application-name, - start-app-time, filter-criteria.Application - based filter-criteria are:100-newest (for - processing newest 100 event logs based on - timestamp insidethe eventlog) i.e application - start time) 100-oldest (for processing - oldest 100 event logs based on timestamp - insidethe eventlog) i.e application start - time) 100-newest-per-app-name (select at - most 100 newest log files for each unique - application name) 100-oldest-per-app-name - (select at most 100 oldest log files for each - unique application name)Filesystem based - filter criteria are:100-newest-filesystem - (for processing newest 100 event logs based - on filesystem timestamp). - 100-oldest-filesystem (for processing oldest - 100 event logsbased on filesystem timestamp). - -h, --html-report Default is to generate an HTML report. - --no-html-report Disables generating the HTML report. - -m, --match-event-logs Filter event logs whose filenames contain the - input string. Filesystem based filtering - happens before any application based - filtering. - --max-sql-desc-length Maximum length of the SQL description - string output with the per sql output. - Default is 100. - --ml-functions Report if there are any SparkML or Spark XGBoost - functions in the eventlog. - -n, --num-output-rows Number of output rows in the summary report. - Default is 1000. - --num-threads Number of thread to use for parallel - processing. The default is the number of - cores on host divided by 4. - --order Specify the sort order of the report. desc or - asc, desc is the default. desc (descending) - would report applications most likely to be - accelerated at the top and asc (ascending) - would show the least likely to be accelerated - at the top. - -o, --output-directory Base output directory. Default is current - directory for the default filesystem. The - final output will go into a subdirectory - called rapids_4_spark_qualification_output. - It will overwrite any existing directory with - the same name. - -p, --per-sql Report at the individual SQL query level. - --platform Cluster platform where Spark CPU workloads were - executed. Options include onprem, dataproc-t4, - dataproc-l4, emr, databricks-aws, and - databricks-azure. - Default is onprem. - -r, --report-read-schema Whether to output the read formats and - datatypes to the CSV file. This can be very - long. Default is false. - --spark-property ... Filter applications based on certain Spark - properties that were set during launch of the - application. It can filter based on key:value - pair or just based on keys. Multiple configs - can be provided where the filtering is done - if any of theconfig is present in the - eventlog. filter on specific configuration: - --spark-property=spark.eventLog.enabled:truefilter - all eventlogs which has config: - --spark-property=spark.driver.portMultiple - configs: - --spark-property=spark.eventLog.enabled:true - --spark-property=spark.driver.port - -s, --start-app-time Filter event logs whose application start - occurred within the past specified time - period. Valid time periods are - min(minute),h(hours),d(days),w(weeks),m(months). - If a period is not specified it defaults to - days. - -t, --timeout Maximum time in seconds to wait for the event - logs to be processed. Default is 24 hours - (86400 seconds) and must be greater than 3 - seconds. If it times out, it will report what - it was able to process up until the timeout. - -u, --user-name Applications which a particular user has - submitted. - --help Show help message - - trailing arguments: - eventlog (required) Event log filenames(space separated) or directories - containing event logs. eg: s3a:///eventlog1 - /path/to/eventlog2 -``` - -Example commands: -- Process the 10 newest logs, and only output the top 3 in the output: - -```bash -java ${QUALIFICATION_HEAP} \ - -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir -``` - -- Process last 100 days' logs: - -```bash -java ${QUALIFICATION_HEAP} \ - -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir -``` - -- Process only the newest log with the same application name: - -```bash -java ${QUALIFICATION_HEAP} \ - -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir -``` - -- Parse ML functions from the eventlog: - -```bash -java ${QUALIFICATION_HEAP} \ - -cp ~/rapids-4-spark-tools_2.12-.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ - com.nvidia.spark.rapids.tool.qualification.QualificationMain --ml-functions /eventlogDir -``` - -Note: the “regular expression” used by `-a` option is based on -[java.util.regex.Pattern](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). - -### The Qualification tool output - -After the above command is executed, the summary report goes to STDOUT and by default it outputs -log/CSV files under `./rapids_4_spark_qualification_output/` that contain the processed applications. -The output will go into your default filesystem and it supports both local filesystem and HDFS. -Note that if you are on an HDFS cluster the default filesystem is likely HDFS for both the input and output. -If you want to point to the local filesystem be sure to include `file:` in the path. - -The Qualification tool generates a brief summary on the STDOUT, which also gets saved as a text file. -The detailed report of the processed apps is saved as a set of CSV files that can be used for post-processing. -The CSV reports include the estimated performance if the app is run on the GPU for each of the following: -_app execution_; _stages_; and _execs_. - -Starting with release "_22.06_", the default is to generate the report into two different formats: -text files; and HTML. - -The tree structure of the output directory `${OUTPUT_FOLDER}/rapids_4_spark_qualification_output` is as follows: - -```bash - rapids_4_spark_qualification_output - ├── rapids_4_spark_qualification_output.csv - ├── rapids_4_spark_qualification_output.log - ├── rapids_4_spark_qualification_output_persql.log - ├── rapids_4_spark_qualification_output_persql.csv - ├── rapids_4_spark_qualification_output_execs.csv - ├── rapids_4_spark_qualification_output_stages.csv - ├── rapids_4_spark_qualification_output_mlfunctions.csv - ├── rapids_4_spark_qualification_output_mlfunctions_totalduration.csv - └── ui - ├── assets - │   ├── bootstrap/ - │   ├── datatables/ - │   ├── jquery/ - │   ├── mustache-js/ - │   └── spur/ - ├── css - │   └── rapids-dashboard.css - ├── html - │   ├── application.html - │   ├── index.html - │   ├── raw.html - │   └── sql-recommendation.html - └── js - ├── app-report.js - ├── data-output.js - ├── per-sql-report.js - ├── qual-report.js - ├── raw-report.js - ├── ui-config.js - └── uiutils.js -``` - -For information on the files content and processing the Qualification report and the recommendation, please refer -to [Understanding the Qualification tool output](#understanding-the-qualification-tool-output) and -[Output Formats](#output-formats) sections below. - -## Running using a Spark Listener - -We provide a Spark Listener that can be installed at application start that will produce output -for each SQL queries in the running application and indicate if that query is a good fit to try -with the Rapids Accelerator for Spark. - -### Prerequisites -- Java 8 or above, Spark 3.0.1+ - -### Download the tools jar -- Download the latest jar from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/) - -### Configuration - -Add the RunningQualificationEventProcess to the spark listeners configuration: -`spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor` - -The user should specify the output directory if they want the output to go to separate -files, otherwise it will go to the Spark driver log. If the output directory is specified, it outputs -two different files, one csv and one pretty printed log file. The output directory can be a local directory -or point to a distributed file system or blobstore like S3. - - `spark.rapids.qualification.outputDir` - -By default, this will output results for 10 SQL queries per file and will -keep 100 files. This behavior is because many blob stores don't show files until -they are fully written so you wouldn't be able to see the results for a running -application until it finishes the number of SQL queries per file. This behavior -can be configured with the following configs. - - `spark.rapids.qualification.output.numSQLQueriesPerFile` - default 10 - - `spark.rapids.qualification.output.maxNumFiles` - default 100 - -### Run the Spark application - -Run the application and include the tools jar, `spark.extraListeners` config and optionally the other -configs to control the tools behavior. - -For example: - -```bash -$SPARK_HOME/bin/spark-shell \ ---jars rapids-4-spark-tools_2.12-.jar \ ---conf spark.extraListeners=org.apache.spark.sql.rapids.tool.qualification.RunningQualificationEventProcessor \ ---conf spark.rapids.qualification.outputDir=/tmp/qualPerSqlOutput \ ---conf spark.rapids.qualification.output.numSQLQueriesPerFile=5 \ ---conf spark.rapids.qualification.output.maxNumFiles=10 -``` - -After running some SQL queries you can look in the output directory and see files like: - -``` -rapids_4_spark_qualification_output_persql_0.csv -rapids_4_spark_qualification_output_persql_0.log -rapids_4_spark_qualification_output_persql_1.csv -rapids_4_spark_qualification_output_persql_1.log -rapids_4_spark_qualification_output_persql_2.csv -rapids_4_spark_qualification_output_persql_2.log -``` - -See the [Understanding the Qualification tool output](#understanding-the-qualification-tool-output) -section on the file contents details. - -## Running the Qualification tool inside a running Spark application using the API - -### Prerequisites -- Java 8 or above, Spark 3.0.1+ - -### Download the tools jar -- Download the latest jar from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/) - -### Modify your application code to call the api's - -Currently only Scala api's are supported. Note this does not support reporting at the per sql level currently. This can be done manually by just wrapping and reporting around those queries instead of the entire application. - -Create the `RunningQualicationApp`: -``` -val qualApp = new com.nvidia.spark.rapids.tool.qualification.RunningQualificationApp() -``` - -Get the event listener from it and install it as a Spark listener: -``` -val listener = qualApp.getEventListener -spark.sparkContext.addSparkListener(listener) -``` - -Run your queries and then get the summary or detailed output to see the results. - -The summary output api: -``` -/** - * Get the summary report for qualification. - * @param delimiter The delimiter separating fields of the summary report. - * @param prettyPrint Whether to including the separate at start and end and - * add spacing so the data rows align with column headings. - * @return String of containing the summary report. - */ -getSummary(delimiter: String = "|", prettyPrint: Boolean = true): String -``` - -The detailed output api: -``` -/** - * Get the detailed report for qualification. - * @param delimiter The delimiter separating fields of the summary report. - * @param prettyPrint Whether to including the separate at start and end and - * add spacing so the data rows align with column headings. - * @return String of containing the detailed report. - */ -getDetailed(delimiter: String = "|", prettyPrint: Boolean = true, reportReadSchema: Boolean = false): String -``` - -Example: -``` -// run your sql queries ... - -// To get the summary output: -val summaryOutput = qualApp.getSummary() - -// To get the detailed output: -val detailedOutput = qualApp.getDetailed() - -// print the output somewhere for user to see -println(summaryOutput) -println(detailedOutput) -``` - -If you need to specify the tools jar as a maven dependency to compile the Spark application: -``` - - com.nvidia - rapids-4-spark-tools_2.12 - ${version} - -``` - -### Run the Spark application -- Run your Spark application and include the tools jar you downloaded with the spark '--jars' options and -view the output wherever you had it printed. - -For example, if running the spark-shell: -``` -$SPARK_HOME/bin/spark-shell --jars rapids-4-spark-tools_2.12-.jar -``` - -## Understanding the Qualification tool output - -For each processed Spark application, the Qualification tool generates two main fields to help quantify the expected -acceleration of migrating a Spark application or query to GPU. - -1. `Estimated GPU Duration`: predicted runtime of the app if it was run on GPU. It is the sum of the accelerated - operator durations and ML functions duration(if applicable) along with durations that could not run on GPU because - they are unsupported operators or not SQL/Dataframe. -2. `Estimated Speed-up`: the estimated speed-up is simply the original CPU duration of the app divided by the - estimated GPU duration. That will estimate how much faster the application would run on GPU. - -The lower the estimated GPU duration, the higher the "_Estimated Speed-up_". -The processed applications or queries are ranked by the "_Estimated Speed-up_". Based on how high the estimated speed-up, -the tool classifies the applications into the following different categories: - -- `Strongly Recommended` -- `Recommended` -- `Not Recommended` -- `Not Applicable`: indicates that the app has job or stage failures. - -As mentioned before, the tool does not guarantee the applications or queries with the highest _recommendation_ will actually be -accelerated the most. Please refer to [Supported Operators](https://github.com/NVIDIA/spark-rapids/blob/main/docs/supported_ops.md) section. - -In addition to the _recommendation_, the Qualification tool reports a set of metrics in tasks of SQL Dataframe operations -within the scope of: "_Entire App_"; "_Stages_"; and "_Execs_". The report is divided into three main levels. The fields -of each level are described in details in the following sections: [Detailed App Report](#detailed-app-report), -[Stages report](#stages-report), and [Execs report](#execs-report). Then we describe the output formats and their file -locations in [Output Formats](#output-formats) section. - -There is an option to print a report at the SQL query level in addition to the application level. - -### Detailed App report - -The report represents the entire app execution, including unsupported operators and non-SQL operations. - -1. _App Name_ -2. _App ID_ -3. _Recommendation_: recommendation based on `Estimated Speed-up Factor`, where - an app can be "_Strongly Recommended_", "_Recommended_", "_Not Recommended_", - or "_Not Applicable_". The latter indicates that the app has job or stage failures. -4. _App Duration_: wall-Clock time measured since the application starts till it is completed. - If an app is not completed an estimated completion time would be computed. -5. _SQL DF duration_: wall-Clock time duration that includes only SQL-Dataframe queries. -6. _GPU Opportunity_: wall-Clock time that shows how much of the SQL duration and ML functions(if applicable) can be accelerated on the GPU. -7. _Estimated GPU Duration_: predicted runtime of the app if it was run on GPU. It is the sum of the accelerated - operator durations and ML functions durations(if applicable) along with durations that could not run on GPU because they are unsupported operators or not SQL/Dataframe. -8. _Estimated GPU Speed-up_: the speed-up is simply the original CPU duration of the app divided by the - estimated GPU duration. That will estimate how much faster the application would run on GPU. -9. _Estimated GPU Time Saved_: estimated wall-Clock time saved if it was run on the GPU. -10. _SQL Dataframe Task Duration_: amount of time spent in tasks of SQL Dataframe operations. -11. _Executor CPU Time Percent_: this is an estimate at how much time the tasks spent doing processing on the CPU vs waiting on IO. - This is not always a good indicator because sometimes the IO that is encrypted and the CPU has to do work to decrypt it, - so the environment you are running on needs to be taken into account. -12. _SQL Ids with Failures_: SQL Ids of queries with failed jobs. -13. _Unsupported Read File Formats and Types_: looks at the Read Schema and - reports the file formats along with types which may not be fully supported. - Example: `JDBC[*]`. Note that this is based on the current version of the plugin and - future versions may add support for more file formats and types. -14. _Unsupported Write Data Format_: reports the data format which we currently don’t support, i.e. - if the result is written in JSON or CSV format. -15. _Complex Types_: looks at the Read Schema and reports if there are any complex types(array, struct or maps) in the schema. -16. _Nested Complex Types_: nested complex types are complex types which - contain other complex types (Example: `array>`). - Note that it can read all the schemas for DataSource V1. The Data Source V2 truncates the schema, - so if you see "`...`", then the full schema is not available. - For such schemas we read until `...` and report if there are any complex types and nested complex types in that. -17. _Potential Problems_: some UDFs and nested complex types. Please keep in mind that the tool is only able to detect certain issues. -18. _Longest SQL Duration_: the maximum amount of time spent in a single task of SQL Dataframe operations. -19. _NONSQL Task Duration Plus Overhead_: Time duration that does not span any running SQL task. -20. _Unsupported Task Duration_: sum of task durations for any unsupported operators. -21. _Supported SQL DF Task Duration_: sum of task durations that are supported by RAPIDS GPU acceleration. -22. _Task Speedup Factor_: the average speed-up of all stages. -23. _App Duration Estimated_: True or False indicates if we had to estimate the application duration. - If we had to estimate it, the value will be `True` and it means the event log was missing the application finished - event, so we will use the last job or sql execution time we find as the end time used to calculate the duration. -24. _Unsupported Execs_: reports all the execs that are not supported by GPU in this application. Note that an Exec name may be - printed in this column if any of the expressions within this Exec is not supported by GPU. If the resultant string - exceeds maximum limit (25), then ... is suffixed to the STDOUT and full output can be found in the CSV file. -25. _Unsupported Expressions_: reports all expressions not supported by GPU in this application. -26. _Read Schema_: shows the datatypes and read formats. This field is only listed when the argument `--report-read-schema` - is passed to the CLI. -27. _Estimated Frequency_: application executions per month assuming uniform distribution, default frequency is daily (30 times per month) - and minimum frequency is monthly (1 time per month). For a given log set, determines a logging window using the earliest start time - and last end time of all logged applications. Counts the number of executions of a specific `App Name` over the logging window - and converts the frequency to per month (30 days). Applications that are only ran once are assigned the default frequency. - -**Note:** the Qualification tool won't catch all UDFs, and some of the UDFs can be handled with additional steps. -Please refer to [Supported Operators](https://github.com/NVIDIA/spark-rapids/blob/main/docs/supported_ops.md) for more details on UDF. - -By default, the applications and queries are sorted in descending order by the following fields: -- _Recommendation_; -- _Estimated GPU Speed-up_; -- _Estimated GPU Time Saved_; and -- _End Time_. - -### Stages report - -For each stage used in SQL operations, the Qualification tool generates the following information: - -1. _App ID_ -2. _Stage ID_ -3. _Average Speedup Factor_: the average estimated speed-up of all the operators in the given stage. -4. _Stage Task Duration_: amount of time spent in tasks of SQL Dataframe operations for the given stage. -5. _Unsupported Task Duration_: sum of task durations for the unsupported operators. For more details, - see [Supported Operators](https://github.com/NVIDIA/spark-rapids/blob/main/docs/supported_ops.md). -6. _Stage Estimated_: True or False indicates if we had to estimate the stage duration. - -### Execs report - -The Qualification tool generates a report of the "Exec" in the "_SparkPlan_" or "_Executor Nodes_" along with the estimated -acceleration on the GPU. Please refer to the [Supported Operators](https://github.com/NVIDIA/spark-rapids/blob/main/docs/supported_ops.md) guide for more -details on limitations on UDFs and unsupported operators. - -1. _App ID_ -2. _SQL ID_ -3. _Exec Name_: example `Filter`, `HashAggregate` -4. _Expression Name_ -5. _Task Speedup Factor_: it is simply the average acceleration of the operators - based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at - an individual operator level to calculate how much a specific operator would accelerate on GPU. -6. _Exec Duration_: wall-Clock time measured since the operator starts till it is completed. -7. _SQL Node Id_ -8. _Exec Is Supported_: whether the Exec is supported by RAPIDS or not. Please refer to the - [Supported Operators](https://github.com/NVIDIA/spark-rapids/blob/main/docs/supported_ops.md) section. -9. _Exec Stages_: an array of stage IDs -10. _Exec Children_ -11. _Exec Children Node Ids_ -12. _Exec Should Remove_: whether the Op is removed from the migrated plan. - -**Parsing Expressions within each Exec** - -The Qualification tool looks at the expressions in each _Exec_ to provide a fine-grained assessment of -RAPIDS' support. -Note that it is not possible to extract the expressions for each available _Exec_: -- some Execs do not take any expressions, and -- some execs may not show the expressions in the _eventlog_. - -The following table lists the exec's name and the status of parsing their expressions where: -- "_Expressions Unavailable_" marks the _Execs_ that do not show expressions in the _eventlog_; -- "_Fully Parsed_" marks the _Execs_ that have their expressions fully parsed by the Qualification tool; -- "_In Progress_" marks the _Execs_ that are still being investigated; therefore, a set of the - marked _Execs_ may be fully parsed in future releases. - -| **Exec** | **Expressions Unavailable** | **Fully Parsed** | **In Progress** | -|---------------------------------------|:---------------------------:|:----------------:|:---------------:| -| AggregateInPandasExec | - | - | x | -| AQEShuffleReadExec | - | - | x | -| ArrowEvalPythonExec | - | - | x | -| BatchScanExec | - | - | x | -| BroadcastExchangeExec | - | - | x | -| BroadcastHashJoinExec | - | - | x | -| BroadcastNestedLoopJoinExec | - | - | x | -| CartesianProductExec | - | - | x | -| CoalesceExec | - | - | x | -| CollectLimitExec | x | - | - | -| CreateDataSourceTableAsSelectCommand | - | - | x | -| CustomShuffleReaderExec | - | - | x | -| DataWritingCommandExec | - | - | x | -| ExpandExec | - | - | x | -| FileSourceScanExec | - | - | x | -| FilterExec | - | x | - | -| FlatMapGroupsInPandasExec | - | - | x | -| GenerateExec | - | - | x | -| GlobalLimitExec | x | - | - | -| HashAggregateExec | - | x | - | -| InMemoryTableScanExec | - | - | x | -| InsertIntoHadoopFsRelationCommand | - | - | x | -| LocalLimitExec | x | - | - | -| MapInPandasExec | - | - | x | -| ObjectHashAggregateExec | - | x | - | -| ProjectExec | - | x | - | -| RangeExec | x | - | - | -| SampleExec | - | - | x | -| ShuffledHashJoinExec | - | - | x | -| ShuffleExchangeExec | - | - | x | -| SortAggregateExec | - | x | - | -| SortExec | - | x | - | -| SortMergeJoinExec | - | - | x | -| SubqueryBroadcastExec | - | - | x | -| TakeOrderedAndProjectExec | - | - | x | -| UnionExec | x | - | - | -| WindowExec | - | x | - | -| WindowInPandasExec | - | - | x | - -### MLFunctions report -The Qualification tool generates a report if there are SparkML or Spark XGBoost functions used in the eventlog. -The functions in "*spark.ml.*" or "*spark.XGBoost.*" packages are displayed in the report. - -1. _App ID_ -2. _Stage ID_ -3. _ML Functions_: List of ML functions used in the corresponding stage. -4. _Stage Task Duration_: amount of time spent in tasks containing ML functions for the given stage. - -### MLFunctions total duration report -The Qualification tool generates a report of total duration across all stages for ML functions which -are supported on GPU. - -1. _App ID_ -2. _Stage_Ids : Stage Id's corresponding to the given ML function. -3. _ML Function Name_: ML function name supported on GPU. -4. _Total Duration_: total duration across all stages for the corresponding ML function. - -## Output Formats - -The Qualification tool generates the output as CSV/log files. Starting from "_22.06_", the default -is to generate the report into two different formats: CSV/log files; and HTML. - -### HTML Report - -Starting with release _"22.06"_, the HTML report is generated by default under the output directory -`${OUTPUT_FOLDER}/rapids_4_spark_qualification_output/ui`. -The HTML report is disabled by passing `--no-html-report` as described in the -[Qualification tool options](#Qualification-tool-options) section above. -To browse the content of the html report: - -1. For HDFS or remote node, copy the directory of `${OUTPUT_FOLDER}/rapids_4_spark_qualification_output/ui` to your local node. -2. Open `rapids_4_spark_qualification_output/ui/index.html` in your local machine's web-browser (Chrome/Firefox are recommended). - -The HTML view renders the detailed information into tables that allow following features: - -- searching -- ordering by specific column -- exporting table into CSV file -- interactive filter by recommendations and/or user-name. - -By default, all tables show 20 entries per page, which can be changed by selecting a different page-size in the table's navigation bar. - -The following sections describe the HTML views. - -#### Application Recommendations Summary - -`index.html` shows the summary of the estimated GPU performance. The "_GPU Recommendations Table_" -lists the processed applications ranked by the "_Estimated GPU Speed-up_" along with the ability to search, and filter -the results. By clicking the "_App ID_" link of a specific app, you navigate to the details view of that app which is -described in [App-Details View](#app-details-view) section. - -The summary report contains the following components: - -1. **Stats-Row**: statistics card summarizing the following information: - 1. "_Total Applications_": total number of applications analyzed by the Qualification tool and the total execution - time. - 2. "_RAPIDS Candidates_": marks the number applications that are either "_Recommended_", or "_Strongly Recommended_". - 3. "_GPU Opportunity_": shows the total of "_GPU Opportunity_" and "_SQL DF duration_" fields across all the apps. -2. **GPU Recommendations Table**: this table lists all the analyzed applications along with subset of fields that are - directly involved in calculating the GPU performance estimate. Each row expands showing more fields by clicking on - the control column. -3. The _searchPanes_ with the capability to search the app list by selecting rows in the panes. - The "_Recommendations_" and "_Spark User_" filters are cascaded which allows the panes to be filtered based on the - values selected in the other pane. -4. Text Search field that allows further filtering, removing data from the result set as keywords are entered. The - search box will match on multiple columns including: "_App ID_", "_App Name_", "_Recommendation_" -5. HTML5 export button saves the table to CSV file into the browser's default download folder. -6. The `Raw Data` link in the left navigation bar redirects to a detailed report. -7. The `Per-SQL Data` link in the left navigation bar redirects to a summary report that shows - the _per-SQL_ estimated GPU performance. - -![Qualification-HTML-Recommendation-View](img/Tools/qualification-tool-recommendation-indexview-with-persql.png) - -#### App-Details View - -When you click the "_App ID_" of a specific row in the "_GPU Recommendations Table_", the browser navigates to -this view which shows the metrics and estimated GPU performance for the given application. -It contains the following main components: - -1. **Card title**: contains the application name and the Recommendation. -2. **Stats-Row**: statistics card summarizing the following information: - 1. "_App Duration_": the total execution time of the app, marking the start and end time. - 2. "_GPU Opportunity_": the wall-Clock time that shows how much of the SQL duration can be accelerated on the GPU. It - shows the actual wall-Clock time duration that includes only SQL-Dataframe queries including non-supported ops, - dubbed "_SQL DF Duration_". This is followed by "_Task Speed-up Factor_" which represents the average speed-up - of all app stages. - 3. "_Estimated GPU Duration_": the predicted runtime of the app if it was run on GPU. For convenience, it calculates - the estimated wall-clock time difference between the CPU and GPU executions. The original CPU duration of the app - divided by the estimated GPU duration and displayed as "_App Speed-up_". -3. **Application Details**: this table lists all the fields described previously in - the [Detailed App report](#detailed-app-report) section. Note that this table has more columns than can fit in a - normal browser window. Therefore, the UI - application dynamically optimizes the layout of the table to fit the browser screen. By clicking on the control - column, the row expands to show the remaining hidden columns. - ![Qualification-HTML-App-Details-View-Header](img/Tools/qualification-tool-app-view-01.png) -4. **Stage Details Table**: lists all the app stages with set of columns listed in [Stages report](#stages-report) - section. The HTML5 export button saves the table to CSV file into the browser's default download folder. - ![Qualification-HTML-App-Details-View-Stages](img/Tools/qualification-tool-app-view-02.png) - The table has cascaded _searchPanes_, which means that the table allows the panes - to be filtered based on the values selected in the other panes. - There are three searchPanes: - 1. "_Is Stage Estimated_": it splits the stages into two groups based on whether the stage duration time was estimated - or not. - 2. "_Speed-up_": groups the stages by their "average speed-up". Each stage can belong to one of the following - predefined speed-up ranges: `1.0 (No Speed-up)`; `]1.0, 1.3[`; `[1.3, 2.5[`; `[2.5, 5[`; and `[5, _]`. The - search-pane does not show a range bucket if its count is 0. - 3. "_Tasks GPU Support_": this filter can be used to find stages having all their execs supported by the GPU. -5. **Execs Details Table**: lists all the app Execs with set of columns listed in [Execs report](#execs-report) - section. The HTML5 export button saves the table to CSV file into the browser's default - download folder. - ![Qualification-HTML-App-Details-View-Execs](img/Tools/qualification-tool-app-view-03.png) - The table has cascaded _searchPanes_, which means that the table allows the panes - to be filtered based on the values selected in the other panes. - There are three _searchPanes_: - 1. "_Exec_": filters the rows by exec name. This filter also allows text searching by typing into the filter-title as - a text input. - 2. "_Speed-up_": groups the stages by their "average speed-up". Each stage can belong to one of the following - predefined speed-up ranges: `1.0 (No Speed-up)`; `]1.0, 1.3[`; `[1.3, 2.5[`; `[2.5, 5[`; and `[5, _]`. The - search-pane does not show a range bucket if its count is 0. - 3. "_GPU Support_": filters the execs whether an exec is supported by GPU or not. - 4. "_Stage ID_": filters rows by the stage ID. It also allows text-searching by typing into the filter-title as a text - input. - 5. "_Is Exec Removed_": filters rows that were removed from the migrated plan. - 6. **SQL Details Table**: lists _Per-SQL_ GPU recommendation. The HTML5 export button saves the table to CSV file into - the browser's default download folder. The rows in the table can be filtered by "_SQL Description_", "_SQL ID_", - or "_Recommendation_". - -#### Raw Data - -`raw.html` displays all the fields listed in "_Detailed App Report_" in more readable format. -Columns representing "_time duration_" are rounded to nearest "ms", "seconds", "minutes", and "hours". -The search box will match on multiple columns including: "_App ID_", "_App Name_", "_Recommendation_", -"_User Name_", "_Unsupported Write Data Format_", "_Complex Types_", "_Nested Complex Types_", and "_Read Schema_". -The detailed table can also be exported as a CSV file into the browser's default download folder. - -Note that this table has more columns than can fit in a normal browser window. Therefore, the UI application dynamically -optimizes the layout of the table -to fit the browser screen. By clicking on the control column, the row expands to show the remaining hidden columns. - -#### Per-SQL Data - -`sql-recommendation.html` displays a summary of the estimate GPU performance for each query. Note that the -SQL queries across all the apps are combined in a single view; therefore, the "_SQL ID_" field may not be -unique. - -### Text and CSV files - -The Qualification tool generates a set of log/CSV files in the output folder -`${OUTPUT_FOLDER}/rapids_4_spark_qualification_output`. The content of each -file is summarized in the following two sections. - -#### Application Report Summary - -The Qualification tool generates a brief summary that includes the projected application's performance -if the application is run on the GPU. Beside sending the summary to `STDOUT`, the Qualification tool -generates _text_ as `rapids_4_spark_qualification_output.log` - -The summary report outputs the following information: "_App Name_", "_App ID_", "_App Duration_", "_SQL DF duration_", -"_GPU Opportunity_", "_Estimated GPU Duration_", "_Estimated GPU Speed-up_", "_Estimated GPU Time Saved_", and -"_Recommendation_". - -Note: the duration(s) reported are in milliseconds. -Sample output in text: - -``` -+------------+--------------+----------+----------+-------------+-----------+-----------+-----------+--------------------+-------------------------------------------------------+ -| App Name | App ID | App | SQL DF | GPU | Estimated | Estimated | Estimated | Recommendation | Unsupported Execs |Unsupported Expressions| -| | | Duration | Duration | Opportunity | GPU | GPU | GPU | | | | -| | | | | | Duration | Speedup | Time | | | | -| | | | | | | | Saved | | | | -+============+==============+==========+==========+=============+===========+===========+===========+====================+=======================================================+ -| appName-01 | app-ID-01-01 | 898429| 879422| 879422| 273911.92| 3.27| 624517.06|Strongly Recommended| | | -+------------+--------------+----------+----------+-------------+-----------+-----------+-----------+--------------------+-------------------------------------------------------+ -| appName-02 | app-ID-02-01 | 9684| 1353| 1353| 8890.09| 1.08| 793.9| Not Recommended|Filter;SerializeFromObject;S...| hex | -+------------+--------------+----------+----------+-------------+-----------+-----------+-----------+--------------------+-------------------------------------------------------+ -``` - -In the above example, two application event logs were analyzed. “app-ID-01-01” is "_Strongly Recommended_" -because `Estimated GPU Speedup` is ~3.27. On the other hand, the estimated acceleration running -“app-ID-02-01” on the GPU is not high enough; hence the app is not recommended. - -#### Per SQL Query Report Summary - -The Qualification tool has an option to generate a report at the per SQL query level. It generates a brief summary -that includes the projected queries performance if the query is run on the GPU. Beside sending the summary to `STDOUT`, -the Qualification tool generates _text_ as `rapids_4_spark_qualification_output_persql.log` - -The summary report outputs the following information: "_App Name_", "_App ID_", "_SQL ID_", "_SQL Description_", "_SQL DF duration_", -"_GPU Opportunity_", "_Estimated GPU Duration_", "_Estimated GPU Speed-up_", "_Estimated GPU Time Saved_", and -"_Recommendation_". - -Note: the duration(s) reported are in milliseconds. -Sample output in text: - -``` -+------------+--------------+----------+---------------+----------+-------------+-----------+-----------+-----------+--------------------+ -| App Name | App ID | SQL ID | SQL | SQL DF | GPU | Estimated | Estimated | Estimated | Recommendation | -| | | | Description | Duration | Opportunity | GPU | GPU | GPU | | -| | | | | | | Duration | Speedup | Time | | -| | | | | | | | | Saved | | -+============+==============+==========+===============+==========+=============+===========+===========+===========+====================+ -| appName-01 | app-ID-01-01 | 1| query41| 571| 571| 187.21| 3.05| 383.78|Strongly Recommended| -+------------+--------------+----------+---------------+----------+-------------+-----------+-----------+-----------+--------------------+ -| appName-02 | app-ID-02-01 | 3| query44| 1116| 0| 1115.98| 1.0| 0.01| Not Recommended| -+------------+--------------+----------+---------------+----------+-------------+-----------+-----------+-----------+--------------------+ -``` - -#### Detailed App Report - -**1. Entire App report** - -The first part of the detailed report is saved as `rapids_4_spark_qualification_output.csv`. -The apps are processed and ranked by the `Estimated GPU Speed-up`. -In addition to the fields listed in the "_Report Summary_", it shows all the app fields. -The duration(s) are reported are in milliseconds. - -**2. Per SQL report** - -The second file is saved as `rapids_4_spark_qualification_output_persql.csv`. This contains the -per SQL query report in CSV format. - -Sample output in text: -``` -+---------------+-----------------------+------+----------------------------------------------------------+---------------+---------------+----------------------+---------------------+------------------------+--------------------+ -| App Name| App ID|SQL ID| SQL Description|SQL DF Duration|GPU Opportunity|Estimated GPU Duration|Estimated GPU Speedup|Estimated GPU Time Saved| Recommendation| -+===============+=======================+======+==========================================================+===============+===============+======================+=====================+========================+====================+ -|NDS - Power Run|app-20220702220255-0008| 103| query87| 15871| 15871| 4496.03| 3.53| 11374.96|Strongly Recommended| -|NDS - Power Run|app-20220702220255-0008| 106| query38| 11077| 11077| 3137.96| 3.53| 7939.03|Strongly Recommended| -+---------------+-----------------------+------+----------------------------------------------------------+---------------+---------------+----------------------+---------------------+------------------------+--------------------+ -``` - - -**3. Stages report** - -The third file is saved as `rapids_4_spark_qualification_output_stages.csv`. - -Sample output in text: -``` -+--------------+----------+-----------------+------------+---------------+-----------+ -| App ID | Stage ID | Average Speedup | Stage Task | Unsupported | Stage | -| | | Factor | Duration | Task Duration | Estimated | -+==============+==========+=================+============+===============+===========+ -| app-ID-01-01 | 25 | 2.1 | 23 | 0 | false | -+--------------+----------+-----------------+------------+---------------+-----------+ -| app-ID-02-01 | 29 | 1.86 | 0 | 0 | true | -+--------------+----------+-----------------+------------+---------------+-----------+ -``` - -**4. Execs report** - -The last file is saved `rapids_4_spark_qualification_output_execs.csv`. Similar to the app and stage information, -the table shows estimated GPU performance of the SQL Dataframe operations. - -Sample output in text: -``` -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -| App ID | SQL ID | Exec Name | Expression Name | Task Speedup | Exec | SQL Node | Exec Is | Exec | Exec Children | Exec Children | Exec Should | -| | | | | Factor | Duration | Id | Supported | Stages | | Node Ids | Remove | -+==============+========+===========================+=======================+==============+==========+==========+===========+========+============================+===============+=============+ -| app-ID-02-01 | 7 | Execute CreateViewCommand | | 1.0 | 0 | 0 | false | | | | false | -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -| app-ID-02-01 | 24 | Project | | 2.0 | 0 | 21 | true | | | | false | -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -| app-ID-02-01 | 24 | Scan parquet | | 2.0 | 260 | 36 | true | 24 | | | false | -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -| app-ID-02-01 | 15 | Execute CreateViewCommand | | 1.0 | 0 | 0 | false | | | | false | -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -| app-ID-02-01 | 24 | Project | | 2.0 | 0 | 14 | true | | | | false | -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -| app-ID-02-01 | 24 | WholeStageCodegen (6) | WholeStageCodegen (6) | 2.8 | 272 | 2 | true | 30 | Project:BroadcastHashJoin: | 3:4:5 | false | -| | | | | | | | | | HashAggregate | | | -+--------------+--------+---------------------------+-----------------------+--------------+----------+----------+-----------+--------+----------------------------+---------------+-------------+ -``` - -## How to compile the tools jar - -See instructions here: https://github.com/NVIDIA/spark-rapids-tools/tree/main/core#build - -If any input is a S3 file path or directory path, 2 extra steps are needed to access S3 in Spark: -1. Download the matched jars based on the Hadoop version: - - `hadoop-aws-.jar` - - `aws-java-sdk-.jar` - -2. Take Hadoop 2.7.4 for example, we can download and include below jars in the '--jars' option to spark-shell or spark-submit: - [hadoop-aws-2.7.4.jar](https://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar) and - [aws-java-sdk-1.7.4.jar](https://repo.maven.apache.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar) - -3. In $SPARK_HOME/conf, create `hdfs-site.xml` with below AWS S3 keys inside: - -```xml - - - - fs.s3a.access.key - xxx - - - fs.s3a.secret.key - xxx - - -``` - -Please refer to this [doc](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) on -more options about integrating hadoop-aws module with S3. diff --git a/docs/supported_ops.md b/docs/supported_ops.md index 4e2be930b49..4da5ea4d2a2 100644 --- a/docs/supported_ops.md +++ b/docs/supported_ops.md @@ -19976,4 +19976,4 @@ dates or timestamps, or for a lack of type coercion support. ### Apache Iceberg Support Support for Apache Iceberg has additional limitations. See the -[Apache Iceberg Support](additional-functionality/iceberg-support.md) document. +[Apache Iceberg Support](https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/iceberg-support.html) document. diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md deleted file mode 100644 index 2e61e72c425..00000000000 --- a/docs/tuning-guide.md +++ /dev/null @@ -1,542 +0,0 @@ ---- -layout: page -title: Tuning -nav_order: 7 ---- -# RAPIDS Accelerator for Apache Spark Tuning Guide -Tuning a Spark job's configuration settings from the defaults can often improve job performance, -and this remains true for jobs leveraging the RAPIDS Accelerator plugin for Apache Spark. This -document provides guidelines on how to tune a Spark job's configuration settings for improved -performance when using the RAPIDS Accelerator plugin. - -## Number of Executors -The RAPIDS Accelerator plugin only supports a one-to-one mapping between GPUs and executors. - -## Number of Tasks per Executor -Running multiple, concurrent tasks per executor is supported in the same manner as standard -Apache Spark. For example, if the cluster nodes each have 24 CPU cores and 4 GPUs then setting -`spark.executor.cores=6` will run each executor with 6 cores and 6 concurrent tasks per executor, -assuming the default setting of one core per task, i.e.: `spark.task.cpus=1`. - -Note that when Apache Spark schedules GPU resources then the GPU resource amount per task, -controlled by `spark.task.resource.gpu.amount`, can limit the number of concurrent tasks further. -For example, if Apache Spark is scheduling for GPUs and `spark.task.resource.gpu.amount=1` then -only one task will run concurrently per executor since the RAPIDS Accelerator only supports -one GPU per executor. When Apache Spark is scheduling for GPUs, set -`spark.task.resource.gpu.amount` to the reciprocal of the desired executor task concurrency, e.g.: -`spark.task.resource.gpu.amount=0.125` will allow up to 8 concurrent tasks per executor. - -It is recommended to run more than one concurrent task per executor as this allows overlapping -I/O and computation. For example one task can be communicating with a distributed filesystem to -fetch an input buffer while another task is decoding an input buffer on the GPU. Configuring too -many concurrent tasks on an executor can lead to excessive I/O, overload host memory, and spilling -from GPU memory. -Counter-intuitively leaving some CPU cores idle may actually speed up your overall job. We -typically find that two times the number of concurrent GPU tasks is a good starting point. - -The [number of concurrent tasks running on a GPU](#number-of-concurrent-tasks-per-gpu) -is configured separately. - -Be aware that even if you restrict the concurrency of GPU tasks having many tasks -per executor can result in spilling from GPU memory. Because of I/O and the fairness of the -semaphore that restricts GPU access, tasks tend to run on the GPU in a round-robin fashion. -Some algorithms, in order to be able to support processing more data than can fit in GPU memory -like sort or join, can keep part of the data cached on the GPU in between outputting batches. -If there are too many tasks this can increase the memory pressure on the GPU and result in more -spilling. - -## Pooled Memory -Configuration key: [`spark.rapids.memory.gpu.pooling.enabled`](additional-functionality/advanced_configs.md#memory.gpu.pooling.enabled) - -Default value: `true` - -Configuration key: [`spark.rapids.memory.gpu.allocFraction`](additional-functionality/advanced_configs.md#memory.gpu.allocFraction) - -Default value: `1.0` - -Allocating memory on a GPU can be an expensive operation. RAPIDS uses a pooling allocator -called [RMM](https://github.com/rapidsai/rmm) to mitigate this overhead. By default, on startup -the plugin will allocate almost `100%` (`1.0`) of the _available_ memory on the GPU and keep it as a pool -that can be allocated from. We reserve 640 MiB by default for system use such as memory needed for kernels -and kernel launches. If the pool is exhausted more memory will be allocated and added to the pool. - -Most of the time this is a huge win, but if you need to share the GPU with other -[libraries](additional-functionality/ml-integration.md) that are not aware of RMM this can lead -to memory issues, and you may need to disable pooling. - -## Pinned Memory -Configuration key: [`spark.rapids.memory.pinnedPool.size`](configs.md#memory.pinnedPool.size) - -Default value: `0` - -Pinned memory refers to memory pages that the OS will keep in system RAM and will not relocate -or swap to disk. Using pinned memory significantly improves performance of data transfers between -the GPU and host memory as the transfer can be performed asynchronously from the CPU. Pinned -memory is relatively expensive to allocate and can cause system issues if too much memory is -pinned, so by default no pinned memory is allocated. - -It is recommended to use some amount of pinned memory when using the RAPIDS Accelerator. -Ideally the amount of pinned memory allocated would be sufficient to hold the input -partitions for the number of concurrent tasks that Spark can schedule for the executor. - -Note that the specified amount of pinned memory is allocated _per executor_. For example, if -each node in the cluster has 4 GPUs and therefore 4 executors per node, a configuration setting -of `spark.rapids.memory.pinnedPool.size=4G` will allocate a total of 16 GiB of memory on the -system. - -When running on YARN, make sure to account for the extra memory consumed by setting -`spark.executor.memoryOverhead` to a value at least as large as the amount of pinned memory -allocated by each executor. Note that pageable, off-heap host memory is used once the pinned -memory pool is exhausted, and that would also need to be accounted for in the memory overhead -setting. - -Pinned memory can also be used to speed up spilling from GPU memory if there is more data -than can fit in GPU memory. [The spill storage](#spill-storage) is configured separately, but -for best performance spill storage should be taken into account when allocating pinned memory. -For example if I have 4 concurrent tasks per executor, each processing 1 GiB batches, I will -want about 4 GiB of pinned memory to handle normal input/output, but if I am going to sort a -large amount of data I might want to increase the pinned memory pool to 8 GiB and give 4 GiB to -the spill storage, so it can use pinned memory too. - -## Spill Storage - -Configuration key: [`spark.rapids.memory.host.spillStorageSize`](configs.md#memory.host.spillStorageSize) - -Default value: `-1` - -This is the amount of host memory that is used to cache spilled data before it is flushed to disk. -The default value is '-1', it is the combined size of [pinned](configs.md#memory.pinnedPool.size) -and [pageable](additional-functionality/advanced_configs.md#memory.host.pageablePool.size) memory pools. -If there is no spilling, the default value for the spill storage is fine. But it is recommended to -use a few gigabytes pinned memory in both spilling and no-spilling cases. -The GPU Accelerator employs different algorithms that allow it to process more data than can fit in -the GPU's memory. We do not support this for all operations, and are constantly trying to add more. -The way that this can work is by spilling parts of the data to host memory or to disk, and then -reading them back in when needed. Spilling in general is more expensive than not spilling, but -spilling to a slow disk can be especially expensive. If you see spilling to disk happening in -your query, and you are not using GPU Direct storage for spilling, you may want to add more -spill storage. - -You can configure this larger than the amount of pinned memory. This number is just an upper limit. -If the pinned memory is used up then it allocates and uses non-pinned memory. - -## Locality Wait -Configuration key: -[`spark.locality.wait`](http://spark.apache.org/docs/latest/configuration.html#scheduling) - -Default value: `3s` - -This configuration setting controls how long Spark should wait to obtain a better locality for tasks. -If your tasks are long and see poor locality, you can increase this value. If the data sets are small -and the cost of waiting will have less impact on the job's overall completion time, you can reduce this -value to get higher parallelization. In a cluster with high I/O bandwidth you can set it to 0 because it -will be faster to not wait when you can get the data across the network fast enough. - -## Number of Concurrent Tasks per GPU -Configuration key: [`spark.rapids.sql.concurrentGpuTasks`](configs.md#sql.concurrentGpuTasks) - -Default value: `2` - -The RAPIDS Accelerator can further limit the number of tasks that are actively sharing the GPU. -It does this using a semaphore. When metrics or documentation refers to the GPU semaphore it -is referring to this. This restriction is useful for avoiding GPU out of memory errors while -still allowing full concurrency for the portions of the job that are not executing on the GPU. -Care is taken to try and avoid doing I/O or other CPU operations while the GPU semaphore is held. -But in the case of a join two batches are required for processing, and it is not always possible -to avoid this case. - -Some queries benefit significantly from -setting this to a value between `2` and `4`, with `2` typically providing the most benefit, and -higher numbers giving diminishing returns, but a lot of it depends on the size of the GPU you have. -An 80 GiB A100 will be able to run a lot more in parallel without seeing degradation -compared to a 16 GiB T4. This is both because of the amount of memory available and also the -raw computing power. - -Setting this value too high can lead to GPU out of memory errors or poor runtime -performance. Running multiple tasks concurrently on the GPU will reduce the memory available -to each task as they will be sharing the GPU's total memory. As a result, some queries that fail -to run with a higher concurrent task setting may run successfully with a lower setting. - -As of the 23.04 release of the RAPIDS Accelerator for Apache Spark -many out of memory errors result in parts of the query being rolled back and retried instead -of a task failure. The fact that this is happening will show up in the task metrics. -These metrics include `gpuRetryCount` which is the number of times that a retry was attempted. -As a part of this the normal `OutOfMemoryError` is thrown much less. Instead a `RetryOOM` -or `SplitAndRetryOOM` exception is thrown. - -To mitigate the out of memory errors you can often reduce the batch size, which will keep less -data active in a batch at a time, but can increase the overall runtime as less data is being -processed per batch. - -Note that when Apache Spark is scheduling GPUs as a resource, the configured GPU resource amount -per task may be too low to achieve the desired concurrency. See the -[section on configuring the number of tasks per executor](#number-of-tasks-per-executor) for more -details. - -This value can be set on a per-job basis like most other configs. If multiple tasks are running -at the same time with different concurrency levels, then it is interpreted as how much of the -GPU each task is allowed to use (1/concurrentGpuTasks). For example if the concurrency is set -to 1, then each task is assumed to take the entire GPU, so only one of those tasks will -be allowed on the GPU at a time. If it is set to 2, then each task takes 1/2 of the GPU -and up to 2 of them could be on the GPU at once. This also works for mixing tasks with -different settings. For example 1 task with a concurrency of 2 could share the GPU with -2 tasks that have a concurrency of 4. In practice this is not likely to show up frequently. - -## Shuffle Partitions -Configuration key: -[`spark.sql.shuffle.partitions`](https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options) - -Default value: `200` - -The number of partitions produced between Spark stages can have a significant performance impact -on a job. Too few partitions and a task may run out of memory as some operations require all of -the data for a task to be in memory at once. Too many partitions and partition processing -overhead dominates the task runtime. - -Partitions have a higher incremental cost for GPU processing than CPU processing, so it is -recommended to keep the number of partitions as low as possible without running out of memory in a -task. This also has the benefit of increasing the amount of data each task will process which -reduces the overhead costs of GPU processing. Note that setting the partition count too low could -result in GPU out of memory errors. In that case the partition count will need to be increased. - -Try setting the number of partitions to either the number of GPUs or the number of concurrent GPU -tasks in the cluster. The number of concurrent GPU tasks is computed by multiplying the number of -GPUs in the Spark cluster multiplied by the -[number of concurrent tasks allowed per GPU](#number-of-concurrent-tasks-per-gpu). This will -set the number of partitions to matching the computational width of the Spark cluster which -provides work for all GPUs. - -## Input Files -GPUs process data much more efficiently when they have a large amount of data to process in -parallel. Loading data from fewer, large input files will perform better than loading data -from many small input files. Ideally input files should be on the order of a few gigabytes -rather than megabytes or smaller. The `spark.sql.files.openCostInBytes` config can be tuned to -a larger value than the default (4 MB) to reduce the number of tasks in a data scan stage -to improve performance if there are many small files in a table. - -Note that the GPU can encode Parquet and ORC data much faster than the CPU, so the costs of -writing large files can be significantly lower. - -Use Hive Parquet or ORC tables instead of Hive Text tables as the intermediate tables for -CTAS(`Create Table As Select`) queries. The suggested change is to add `stored as parquet` for CTAS. -Alternatively, you can set `hive.default.fileformat=Parquet` to create Parquet files by default. -Refer to this -[Hive Doc](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.default.fileformat) -for more details. - -If the query scans Hive ORC tables, make sure `spark.sql.hive.convertMetastoreOrc=true` to avoid CPU -fallback. - -## Input Files' column order -When there are a large number of columns for file formats like Parquet and ORC the size of the -contiguous data for each individual column can be very small. This can result in doing lots of very -small random reads to the file system to read the data for the subset of columns that are needed. - -We suggest reordering the columns needed by the queries and then rewrite the files to make those -columns adjacent. This could help both Spark on CPU and GPU. - -## Input Partition Size - -Similar to the discussion on [input file size](#input-files), many queries can benefit from using -a larger input partition size than the default setting. This allows the GPU to process more data -at once, amortizing overhead costs across a larger set of data. Many queries perform better when -this is set to 256MB or 512MB. Note that setting this value too high can cause tasks to fail with -GPU out of memory errors. - -The configuration settings that control the input partition size depend upon the method used -to read the input data. - -### Input Partition Size with DataSource API - -Configuration key: -[`spark.sql.files.maxPartitionBytes`](https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options) - - -Default value: `128MB` - -Using the `SparkSession` methods to read data (e.g.: `spark.read.`...) will go through the -DataSource API. - -### Input Partition Size with Hive API - -Configuration keys: -- `spark.hadoop.mapreduce.input.fileinputformat.split.minsize` -- `spark.hadoop.mapred.min.split.size` - -Default value: `0` - -## Input File Caching - -If the Spark application accesses the same data multiple times, it may benefit from the -RAPIDS Accelerator file cache. See the [filecache documentation](additional-functionality/filecache.md) -for details. - -## Columnar Batch Size -Configuration key: [`spark.rapids.sql.batchSizeBytes`](configs.md#sql.batchSizeBytes) - -Default value: `1073741824` (just under 1 GiB) - -The RAPIDS Accelerator plugin processes data on the GPU in a columnar format. Data is processed -in a series of columnar batches. During processing multiple batches may be concatenated -into a single batch to make the GPU processing more efficient, or in other cases the output size -of an operation like join or sort will target that batch size for output. This setting controls the upper -limit on batches that are output by these tasks. Setting this value too low can result in a -large amount of GPU processing overhead and slower task execution. Setting this value too high -can lead to GPU out of memory errors. If tasks fail due to GPU out of memory errors after the -query input partitions have been read, try setting this to a lower value. The maximum size is just -under 2 GiB. In general, we recommend setting the batch size to - -``` -min(2GiB - 1 byte, (gpu_memory - 1 GiB) / gpu_concurrency / 4) -``` - -Where `gpu_memory` is the amount of memory on the GPU, or the maximum pool size if using pooling. -We then subtract from that 1 GiB for overhead in holding CUDA kernels/etc and divide by the -[`gpu_concurrency`](#number-of-concurrent-tasks-per-gpu). Finally, we divide by 4. This is set -by the amount of working memory that different algorithms need to succeed. Joins need a -batch for the left-hand side, one for the right-hand side, one for working memory to do the join, -and finally one for the output batch, which results in 4 times the target batch size. Not all -algorithms are exact with this. Some will require all of the data for a given key to be in memory -at once, others require all of the data for a task to be in memory at once. We are working on -getting as many algorithms as possible to stay under this 4 batch size limit, but depending on your -query you many need to adjust the batch size differently from this algorithm. - -As an example processing on a 16 GiB T4 with a concurrency of 1 it is recommended to set the batch -size to `(16 GiB - 1 GiB) / 1 / 4` which results in 3.75 GiB, but the maximum size is just -under to 2 GiB, so use that instead. If we are processing on a 16 GiB V100 with a concurrency -of 2 we get `(16 GiB - 1 GiB) / 2 / 4` which results in 1920 MiB. Finally, for an 80 GiB A100 -with a concurrency of 8 we get `(80 GiB - 1 GiB) / 8 / 4` we get about 2.5 GiB which is over 2 GiB -so we stick with the maximum. - -In the future we may adjust the default value to follow this pattern once we have enough -algorithms updated to keep within this 4 batch size limit. - -### File Reader Batch Size -Configuration key: [`spark.rapids.sql.reader.batchSizeRows`](configs.md#sql.reader.batchSizeRows) - -Default value: `2147483647` - -Configuration key: [`spark.rapids.sql.reader.batchSizeBytes`](configs.md#sql.reader.batchSizeBytes) - -Default value: `2147483647` - -When reading data from a file, this setting is used to control the maximum batch size separately -from the main [columnar batch size](#columnar-batch-size) setting. Some transcoding jobs (e.g.: -load CSV files then write Parquet files) need to lower this setting when using large task input -partition sizes to avoid GPU out of memory errors. - -## Metrics - -### SQL - -Custom Spark SQL Metrics are available which can help identify performance bottlenecks in a query. - -| Key | Name | Description | -|-------------------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| bufferTime | buffer time | Time spent buffering input from file data sources. This buffering time happens on the CPU, typically with no GPU semaphore held. For Multi-threaded readers, `bufferTime` measures the amount of time we are blocked on while the threads that are buffering are busy. | -| readFsTime | time to read fs data | Time spent actually reading the data and writing it to on-heap memory. This is a part of `bufferTime` | -| writeBufferTime | time to write data to buffer | Time spent moving the on-heap buffered data read from the file system to off-heap memory so the GPU can access it. This is a part of `bufferTime` | -| buildDataSize | build side size | Size in bytes of the build-side of a join. | -| buildTime | build time | Time to load the build-side of a join. | -| collectTime | collect time | For a broadcast the amount of time it took to collect the broadcast data back to the driver before broadcasting it back out. | -| computeAggTime | aggregation time | Time computing an aggregation. | -| concatTime | concat batch time | Time to concatenate batches. Runs on CPU. | -| copyBufferTime | copy buffer time | Time spent on copying upstreaming data into Rapids buffers. | -| filterTime | filter time | Time spent applying filters within other operators, such as joins. | -| gpuDecodeTime | GPU decode time | Time spent on GPU decoding encrypted or compressed data. | -| joinOutputRows | join output rows | The number of rows produced by a join before any filter expression is applied. | -| joinTime | join time | Time doing a join operation. | -| numInputBatches | input columnar batches | Number of columnar batches that the operator received from its child operator(s). | -| numInputRows | input rows | Number of rows that the operator received from its child operator(s). | -| numOutputBatches | output columnar batches | Number of columnar batches that the operator outputs. | -| numOutputRows | output rows | Number of rows that the operator outputs. | -| numPartitions | partitions | Number of output partitions from a file scan or shuffle exchange. | -| opTime | op time | Time that an operator takes, exclusive of the time for executing or fetching results from child operators, and typically outside of the time it takes to acquire the GPU semaphore.
Note: Sometimes contains CPU times, e.g.: concatTime | -| partitionSize | partition data size | Total size in bytes of output partitions. | -| sortTime | sort time | Time spent in sort operations in GpuSortExec and GpuTopN. | | -| streamTime | stream time | Time spent reading data from a child. This generally happens for the stream side of a hash join or for columnar to row and row to columnar operations. | - -Not all metrics are enabled by default. The configuration setting `spark.rapids.sql.metrics.level` can be set -to `DEBUG`, `MODERATE`, or `ESSENTIAL`, with `MODERATE` being the default value. More information about this -configuration option is available in the [configuration documentation](configs.md#sql.metrics.level). - -Output row and batch counts show up for operators where the number of output rows or batches are -expected to change. For example a filter operation would show the number of rows that passed the -filter condition. These can be used to detect small batches. The GPU is much more efficient when the -batch size is large enough to offset the overhead of launching CUDA kernels. - -Input rows and batches are really only for debugging and can mostly be ignored. - -Many of the questions people really want to answer with the metrics are around how long various -operators take. Where is the bottleneck in my query? How much of my query is executing on the GPU? -How long does operator X take on the GPU vs the CPU? - -### Task - -Custom Task level accumulators are also included. These metrics are not for individual -operators in the SQL plan, but are per task and roll up to stages in the plan. Timing metrics -are reported in the format of HH:MM:SS.sss. It should be noted that spill metrics, -including the spill to memory and disk sizes, are not isolated to a single -task, or even a single stage in the plan. The amount of data spilled is the amount of -data that this particular task needed to spill in order to make room for the task to -allocate new memory. The spill time metric is how long it took that task to spill -that memory. It could have spilled memory associated with a different task, -or even a different stage or job in the plan. The spill read time metric is how -long it took to read back in the data it needed to complete the task. This does not -correspond to the data that was spilled by this task. - -| Name | Description | -|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| gpuSemaphoreWait | The time the task spent waiting on the GPU semaphore. | -| gpuSpillBlockTime | The time that this task was blocked spilling data from the GPU. | -| gpuSpillReadTime | The time that this task was blocked reading data to the GPU that was spilled previously. | -| gpuRetryCount | The number of times that a retry exception was thrown in an attempt to roll back processing to free memory. | -| gpuSplitAndRetryCount | The number of times that a split and retry exception was thrown in an attempt to roll back processing to free memory, and split the input to make more room. | -| gpuRetryBlockTime | The amount of time that this task was blocked either hoping that other tasks will free up more memory or after a retry exception was thrown to wait until the task can go on. | -| gpuRetryComputationTime | The amount of time that this task spent doing computation that arguably was lost when a retry happened. This does not include time that the task was blocked on retry.| - -The spill data sizes going to host/CPU memory and disk are the same as used by Spark task level -metrics. - -### Time taken on the GPU - -`opTime` mainly convey the GPU time. -If GPU operators have some workload on CPU, the GPU time is: `opTime` - CPU time, e.g.: -`opTime` - `concatTime`. -Nearly all GPU operators will have an `op time` metric. This metric times how long a given -operation took to complete on the GPU separate from anything upstream or down stream of the -operator. By looking at the `op time` for each operator you should be able to get a feeling of -how long each section of a query took in terms of time on the GPU. - -For many complex operations, like joins and aggregations, the time taken can be broken down further. -These metrics typically only appear in `DEBUG` mode for the metrics though. But can provide extra -information when trying to understand what is happening in a query without having to do profiling -of the GPU query execution. - -#### Spilling - -Some operators provide out of core algorithms, or algorithms that can process data that is larger -than can fit in GPU memory. This is often done by breaking the problem up into smaller pieces and -letting some of those pieces be moved out of GPU memory when not being worked on. Apache Spark does -similar things when processing data on the CPU. When these types of algorithms are used -the task level spill metrics will indicate that spilling happened. Be aware that -the same metrics are used both for both the GPU code and the original Spark CPU code. The -GPU spills will always be timed and show up as `gpuSpillBlockTime` in the task level -metrics. - -### Time taken on the CPU - -Operations that deal with the CPU as well as the GPU will often have multiple metrics broken out, -like in the case of reading data from a parquet file. There will be metrics for how long it took -to read the data to CPU memory, `buffer time`, along with how much time was taken to transfer the -data to the GPU and decode it, `GPU decode time`. - -There is also a metric for how long an operation was blocked waiting on the GPU semaphore before -it got a chance to run on the GPU. This metric is enabled in `DEBUG` mode mostly because it is not -complete. Spark does not provide a way for us to accurately report that metric during a shuffle. - -Apache Spark provides a `duration` metric for code generation blocks that is intended to measure how -long it took to do the given processing. However, `duration` is measured from the time that the -first row is processed to the time that the operation is done. In most cases this is very close to -the total runtime for the task, and ends up not being that useful in practice. Apache Spark does -not want to try and measure it more accurately, because it processes data one row -at a time with multiple different operators intermixed in the same generated code. In this case -the overhead of measuring how long a single row took to process would likely be very large compared -to the amount of time to actually process the data. - -But the RAPIDS Accelerator for Apache Spark does provide a workaround. When data is transferred -from the CPU to the GPU or from the GPU to the CPU the `stream time` is reported. When going from -the CPU to the GPU it is the amount of time take to collect a batches worth of data on CPU before -sending it to the GPU. When going from the GPU to the CPU it is the amount of time taken to get the -batch and put it into a format that the GPU can start to process one row at a time. This can allow -you to get an approximate measurement of the amount of time taken to process a section of the query -on the CPU by subtracting the `stream time` before going to the CPU from the `stream time` after -coming back to the GPU. Please note that this is really only valid in showing the amount of time -a section of a mixed GPU/CPU query took. It should not be used to indicate how long an operation on -the CPU is likely to take in a pure CPU only workload. This is because the memory access patterns -when going from the GPU to the CPU and vise versa are very different from when the data stays on -the CPU the entire time. This can result in very different timings between the different situations. - -## Window Operations - -Apache Spark supports a few optimizations for different windows patterns. Generally Spark -buffers all the data for a partition by key in memory and then loops through the rows looking for -boundaries. When it finds a boundary change, it will then calculate the aggregation on that window. -This ends up being `O(N^2)` where N is the size of the window. In a few cases in can improve on -that. These optimizations include. - - * Lead/Lag. In this case Spark keeps an offset pointer and can output the result from the - buffered data in linear time. Lead and Lag only support row based windows and set up the - row ranges automatically based off of the lead/lag requested. - * Unbounded Preceding to Unbounded Following. In this case Spark will do a single aggregation - and duplicate the result multiple times. This works for both row and range based windows. - There is no difference in the calculation in this case because the window is the size of - the partition by group. - * Unbounded Preceding to some specific bound. For this case Spark keeps running state as it - walks through each row and outputs an updated result each time. This also works for both - row and range based windows. For row based queries it just adds a row at a time. For - range based queries it adds rows until the order by column changes. - * Some specific bound to Unbounded Following. For this case Spark will still recalculate - aggregations for each window group. The complexity of this is `O(N^2)` but it only has to - check lower bounds when doing the aggregation. This also works for row or range based - windows, and for row based windows the performance improvement is very minimal because it - is just removing a row each time instead of doing a check for equality on the order by - columns. - -Some proprietary implementations have further optimizations. For example Databricks has a special -case for running windows (rows between unbounded preceding to current row) which allows it to -avoid caching the entire window in memory and just cache the running state in between rows. - -CUDF and the RAPIDS Accelerator do not have these same set of optimizations yet and so the -performance can be different based off of the window sizes and the aggregation operations. Most -of the time the window size is small enough that the parallelism of the GPU can offset the -difference in the complexity of the algorithm and beat the CPU. In the general case if `N` is -the size of the window and `G` is the parallelism of the GPU then the complexity of a window -operations is `O(N^2/G)`. The main optimization currently supported by the RAPIDS Accelerator -is for running window (rows between unbounded preceding and current row). This is only for a -specific set of aggregations. - - * MIN - * MAX - * SUM - * COUNT - * ROW_NUMBER - * RANK - * DENSE_RANK - -For these operations the GPU can use specialized hardware to do the computation in approximately -`O(N/G * LOG(N))` time. The details of how this works is a bit complex, but it is described -somewhat generally -[here](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda). - -Some aggregations can be done in constant time like `count` on a non-nullable -column/value, `lead` or `lag`. These allow us to compute the result in approximately `O(N/G)` time. -For all other cases large windows, including skewed values in partition by and order by data, can -result in slow performance. If you do run into one of these situations please file an -[issue](https://github.com/NVIDIA/spark-rapids/issues/new/choose) so we can properly prioritize -our work to support more optimizations. - -## Shuffle Disks -Dataproc: [Local SSD](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds) -is recommended for Spark scratch space to improve IO. For example, when creating Dataproc cluster, -you can add below: - -``` ---worker-local-ssd-interface=NVME ---num-secondary-worker-local-ssds=2 ---worker-local-ssd-interface=NVME ---secondary-worker-local-ssd-interface=NVME -``` - -Refer to [Getting Started on GCP Dataproc](./get-started/getting-started-gcp.md) for more details. - -On-Prem cluster: Try to use enough NVME or SSDs as shuffling disks to avoid local disk IO bottleneck. - -## Exclude bad nodes from YARN resource - -Just in case there are bad nodes due to hardware failure issues (including GPU failure), we suggest -setting `spark.yarn.executor.launch.excludeOnFailure.enabled=true` so that the problematic node can -be excluded. - -Refer to [Running Spark on YARN](https://spark.apache.org/docs/latest/running-on-yarn.html) for more -details. diff --git a/integration_tests/README.md b/integration_tests/README.md index 41d4c53861d..321bfa68c45 100644 --- a/integration_tests/README.md +++ b/integration_tests/README.md @@ -217,7 +217,7 @@ To run the tests separate from the build go to the `integration_tests` directory `pytest-xdist` you will need to submit it as a regular python application and have `findspark` installed. Be sure to include the necessary jars for the RAPIDS plugin either with `spark-submit` or with the cluster when it is -[setup](../docs/get-started/getting-started-on-prem.md). +[setup](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html). The command line arguments to `runtests.py` are the same as for [pytest](https://docs.pytest.org/en/latest/usage.html). The only reason we have a separate script is that `spark-submit` uses python if the file name ends with `.py`. @@ -603,4 +603,4 @@ the valid boundaries for dates and timestamps. ## Scale Test Scale Test is a test suite to do stress test and estimate the stablity of the spark-rapids plugin when running in large -scale data. For more information please refer to [Scale Test](./ScaleTest.md) \ No newline at end of file +scale data. For more information please refer to [Scale Test](./ScaleTest.md) diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala index c502a5ae36f..5b69d7068ee 100644 --- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala +++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala @@ -1435,7 +1435,8 @@ object RapidsConf { val SHUFFLE_MANAGER_ENABLED = conf("spark.rapids.shuffle.enabled") .doc("Enable or disable the RAPIDS Shuffle Manager at runtime. " + - "The [RAPIDS Shuffle Manager](rapids-shuffle.md) must " + + "The [RAPIDS Shuffle Manager](https://docs.nvidia.com/spark-rapids/user-guide/latest" + + "/additional-functionality/rapids-shuffle.html) must " + "already be configured. When set to `false`, the built-in Spark shuffle will be used. ") .booleanConf .createWithDefault(true) diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala index b50029ab344..b785a0ccc2d 100644 --- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala +++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala @@ -2214,7 +2214,7 @@ object SupportedOpsDocs { println() println("### Apache Iceberg Support") println("Support for Apache Iceberg has additional limitations. See the") - println("[Apache Iceberg Support](additional-functionality/iceberg-support.md) document.") + println("[Apache Iceberg Support](https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/iceberg-support.html) document.") // scalastyle:on line.size.limit }