diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md index 4e8a4b6e4f0..3231b7b3069 100644 --- a/docs/additional-functionality/advanced_configs.md +++ b/docs/additional-functionality/advanced_configs.md @@ -129,12 +129,12 @@ Name | Description | Default Value | Applicable at spark.rapids.sql.json.read.decimal.enabled|When reading a quoted string as a decimal Spark supports reading non-ascii unicode digits, and the RAPIDS Accelerator does not.|true|Runtime spark.rapids.sql.json.read.double.enabled|JSON reading is not 100% compatible when reading doubles.|true|Runtime spark.rapids.sql.json.read.float.enabled|JSON reading is not 100% compatible when reading floats.|true|Runtime -spark.rapids.sql.json.read.mixedTypesAsString.enabled|JSON reading is not 100% compatible when reading mixed types as string.|false|Runtime spark.rapids.sql.mode|Set the mode for the Rapids Accelerator. The supported modes are explainOnly and executeOnGPU. This config can not be changed at runtime, you must restart the application for it to take affect. The default mode is executeOnGPU, which means the RAPIDS Accelerator plugin convert the Spark operations and execute them on the GPU when possible. The explainOnly mode allows running queries on the CPU and the RAPIDS Accelerator will evaluate the queries as if it was going to run on the GPU. The explanations of what would have run on the GPU and why are output in log messages. When using explainOnly mode, the default explain output is ALL, this can be changed by setting spark.rapids.sql.explain. See that config for more details.|executeongpu|Startup spark.rapids.sql.optimizer.joinReorder.enabled|When enabled, joins may be reordered for improved query performance|true|Runtime spark.rapids.sql.python.gpu.enabled|This is an experimental feature and is likely to change in the future. Enable (true) or disable (false) support for scheduling Python Pandas UDFs with GPU resources. When enabled, pandas UDFs are assumed to share the same GPU that the RAPIDs accelerator uses and will honor the python GPU configs|false|Runtime -spark.rapids.sql.reader.chunked|Enable a chunked reader where possible. A chunked reader allows reading highly compressed data that could not be read otherwise, but at the expense of more GPU memory, and in some cases more GPU computation.|true|Runtime -spark.rapids.sql.reader.chunked.subPage|Enable a chunked reader where possible for reading data that is smaller than the typical row group/page limit. Currently this only works for parquet.|true|Runtime +spark.rapids.sql.reader.chunked|Enable a chunked reader where possible. A chunked reader allows reading highly compressed data that could not be read otherwise, but at the expense of more GPU memory, and in some cases more GPU computation. Currently this only supports ORC and Parquet formats.|true|Runtime +spark.rapids.sql.reader.chunked.limitMemoryUsage|Enable a soft limit on the internal memory usage of the chunked reader (if being used). Such limit is calculated as the multiplication of 'spark.rapids.sql.batchSizeBytes' and 'spark.rapids.sql.reader.chunked.memoryUsageRatio'.For example, if batchSizeBytes is set to 1GB and memoryUsageRatio is 4, the chunked reader will try to keep its memory usage under 4GB.|None|Runtime +spark.rapids.sql.reader.chunked.subPage|Enable a chunked reader where possible for reading data that is smaller than the typical row group/page limit. Currently deprecated and replaced by 'spark.rapids.sql.reader.chunked.limitMemoryUsage'.|None|Runtime spark.rapids.sql.reader.multithreaded.combine.sizeBytes|The target size in bytes to combine multiple small files together when using the MULTITHREADED parquet or orc reader. With combine disabled, the MULTITHREADED reader reads the files in parallel and sends individual files down to the GPU, but that can be inefficient for small files. When combine is enabled, files that are ready within spark.rapids.sql.reader.multithreaded.combine.waitTime together, up to this threshold size, are combined before sending down to GPU. This can be disabled by setting it to 0. Note that combine also will not go over the spark.rapids.sql.reader.batchSizeRows or spark.rapids.sql.reader.batchSizeBytes limits.|67108864|Runtime spark.rapids.sql.reader.multithreaded.combine.waitTime|When using the multithreaded parquet or orc reader with combine mode, how long to wait, in milliseconds, for more files to finish if haven't met the size threshold. Note that this will wait this amount of time from when the last file was available, so total wait time could be larger then this.|200|Runtime spark.rapids.sql.reader.multithreaded.read.keepOrder|When using the MULTITHREADED reader, if this is set to true we read the files in the same order Spark does, otherwise the order may not be the same. Now it is supported only for parquet and orc.|true|Runtime @@ -184,6 +184,7 @@ Name | SQL Function(s) | Description | Default Value | Notes spark.rapids.sql.expression.ArrayContains|`array_contains`|Returns a boolean if the array contains the passed in key|true|None| spark.rapids.sql.expression.ArrayExcept|`array_except`|Returns an array of the elements in array1 but not in array2, without duplicates|true|This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+| spark.rapids.sql.expression.ArrayExists|`exists`|Return true if any element satisfies the predicate LambdaFunction|true|None| +spark.rapids.sql.expression.ArrayFilter|`filter`|Filter an input array using a given predicate|true|None| spark.rapids.sql.expression.ArrayIntersect|`array_intersect`|Returns an array of the elements in the intersection of array1 and array2, without duplicates|true|This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+| spark.rapids.sql.expression.ArrayMax|`array_max`|Returns the maximum value in the array|true|None| spark.rapids.sql.expression.ArrayMin|`array_min`|Returns the minimum value in the array|true|None| @@ -206,6 +207,7 @@ Name | SQL Function(s) | Description | Default Value | Notes spark.rapids.sql.expression.BitwiseNot|`~`|Returns the bitwise NOT of the operands|true|None| spark.rapids.sql.expression.BitwiseOr|`\|`|Returns the bitwise OR of the operands|true|None| spark.rapids.sql.expression.BitwiseXor|`^`|Returns the bitwise XOR of the operands|true|None| +spark.rapids.sql.expression.BoundReference| |Reference to a bound variable|true|None| spark.rapids.sql.expression.CaseWhen|`when`|CASE WHEN expression|true|None| spark.rapids.sql.expression.Cast|`bigint`, `binary`, `boolean`, `cast`, `date`, `decimal`, `double`, `float`, `int`, `smallint`, `string`, `timestamp`, `tinyint`|Convert a column of one type of data into another type|true|None| spark.rapids.sql.expression.Cbrt|`cbrt`|Cube root|true|None| @@ -269,7 +271,7 @@ Name | SQL Function(s) | Description | Default Value | Notes spark.rapids.sql.expression.IsNotNull|`isnotnull`|Checks if a value is not null|true|None| spark.rapids.sql.expression.IsNull|`isnull`|Checks if a value is null|true|None| spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|false|This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case| -spark.rapids.sql.expression.JsonTuple|`json_tuple`|Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.|false|This is disabled by default because JsonTuple on the GPU does not support all of the normalization that the CPU supports.| +spark.rapids.sql.expression.JsonTuple|`json_tuple`|Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.|false|This is disabled by default because Experimental feature that could be unstable or have performance issues.| spark.rapids.sql.expression.KnownFloatingPointNormalized| |Tag to prevent redundant normalization|true|None| spark.rapids.sql.expression.KnownNotNull| |Tag an expression as known to not be null|true|None| spark.rapids.sql.expression.Lag|`lag`|Window function that returns N entries behind this one|true|None| diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_no_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_no_rdma index adf28f5fea2..fe5c64b1dfc 100644 --- a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_no_rdma +++ b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_no_rdma @@ -1,5 +1,5 @@ # -# Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -24,7 +24,7 @@ # - ROCKY_VER: Rocky Linux OS version ARG CUDA_VER=11.8.0 -ARG UCX_VER=1.15.0 +ARG UCX_VER=1.16.0 ARG UCX_CUDA_VER=11 ARG UCX_ARCH=x86_64 ARG ROCKY_VER=8 @@ -38,6 +38,5 @@ RUN ls /usr/lib RUN mkdir /tmp/ucx_install && cd /tmp/ucx_install && \ wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-$UCX_VER-centos8-mofed5-cuda$UCX_CUDA_VER-$UCX_ARCH.tar.bz2 && \ tar -xvf *.bz2 && \ - rpm -i ucx-$UCX_VER*.rpm && \ - rpm -i ucx-cuda-$UCX_VER*.rpm --nodeps && \ + rpm -i `ls ucx-[0-9]*.rpm ucx-cuda-[0-9]*.rpm` --nodeps && \ rm -rf /tmp/ucx_install diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_rdma index 9083e1561b5..f88c4212a92 100644 --- a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_rdma +++ b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.rocky_rdma @@ -1,5 +1,5 @@ # -# Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -24,7 +24,7 @@ # - ROCKY_VER: Rocky Linux OS version ARG CUDA_VER=11.8.0 -ARG UCX_VER=1.15.0 +ARG UCX_VER=1.16.0 ARG UCX_CUDA_VER=11 ARG UCX_ARCH=x86_64 ARG ROCKY_VER=8 @@ -37,7 +37,5 @@ RUN yum update -y && yum install -y wget bzip2 rdma-core numactl-libs libgomp li RUN mkdir /tmp/ucx_install && cd /tmp/ucx_install && \ wget https://github.com/openucx/ucx/releases/download/v$UCX_VER/ucx-$UCX_VER-centos8-mofed5-cuda$UCX_CUDA_VER-$UCX_ARCH.tar.bz2 && \ tar -xvf *.bz2 && \ - rpm -i ucx-$UCX_VER*.rpm && \ - rpm -i ucx-cuda-$UCX_VER*.rpm --nodeps && \ - rpm -i ucx-ib-$UCX_VER-1.el8.x86_64.rpm ucx-rdmacm-$UCX_VER-1.el8.x86_64.rpm && \ + rpm -i `ls ucx-[0-9]*.rpm ucx-cuda-[0-9]*.rpm ucx-ib-[0-9]*.rpm ucx-rdmacm-[0-9]*.rpm` --nodeps && \ rm -rf /tmp/ucx_install diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_no_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_no_rdma index e0318a0de60..792e7848e56 100644 --- a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_no_rdma +++ b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_no_rdma @@ -1,5 +1,5 @@ # -# Copyright (c) 2021-2023, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021-2024, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -25,7 +25,7 @@ # ARG CUDA_VER=11.8.0 -ARG UCX_VER=1.15.0 +ARG UCX_VER=1.16.0 ARG UCX_CUDA_VER=11 ARG UCX_ARCH=x86_64 ARG UBUNTU_VER=20.04 diff --git a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_rdma b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_rdma index 55281fc4b1b..42014c67251 100644 --- a/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_rdma +++ b/docs/additional-functionality/shuffle-docker-examples/Dockerfile.ubuntu_rdma @@ -1,5 +1,5 @@ # -# Copyright (c) 2021-2023, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021-2024, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -35,7 +35,7 @@ ARG RDMA_CORE_VERSION=32.1 ARG CUDA_VER=11.8.0 -ARG UCX_VER=1.15.0 +ARG UCX_VER=1.16.0 ARG UCX_CUDA_VER=11 ARG UCX_ARCH=x86_64 ARG UBUNTU_VER=20.04 diff --git a/docs/archive.md b/docs/archive.md index 6cce30557f4..f4eeab11a40 100644 --- a/docs/archive.md +++ b/docs/archive.md @@ -5,11 +5,143 @@ nav_order: 15 --- Below are archived releases for RAPIDS Accelerator for Apache Spark. +## Release v24.04.1 +### Hardware Requirements: + +The plugin is tested on the following architectures: + + GPU Models: NVIDIA V100, T4, A10/A100, L4 and H100 GPUs + +### Software Requirements: + + OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8 + + NVIDIA Driver*: R470+ + + Runtime: + Scala 2.12, 2.13 + Python, Java Virtual Machine (JVM) compatible with your spark-version. + + * Check the Spark documentation for Python and Java version compatibility with your specific + Spark version. For instance, visit `https://spark.apache.org/docs/3.4.1` for Spark 3.4.1. + + Supported Spark versions: + Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4 + Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4 + Apache Spark 3.4.0, 3.4.1, 3.4.2 + Apache Spark 3.5.0, 3.5.1 + + Supported Databricks runtime versions for Azure and AWS: + Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0) + Databricks 12.2 ML LTS (GPU, Scala 2.12, Spark 3.3.2) + Databricks 13.3 ML LTS (GPU, Scala 2.12, Spark 3.4.1) + + Supported Dataproc versions (Debian/Ubuntu): + GCP Dataproc 2.0 + GCP Dataproc 2.1 + + Supported Dataproc Serverless versions: + Spark runtime 1.1 LTS + Spark runtime 2.0 + Spark runtime 2.1 + +*Some hardware may have a minimum driver version greater than R470. Check the GPU spec sheet +for your hardware's minimum driver version. + +*For Cloudera and EMR support, please refer to the +[Distributions](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#which-distributions-are-supported) section of the FAQ. + +### RAPIDS Accelerator's Support Policy for Apache Spark +The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html) + +### Download RAPIDS Accelerator for Apache Spark v24.04.1 + +| Processor | Scala Version | Download Jar | Download Signature | +|-----------|---------------|--------------|--------------------| +| x86_64 | Scala 2.12 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1.jar.asc) | +| x86_64 | Scala 2.13 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1.jar.asc) | +| arm64 | Scala 2.12 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1-cuda11-arm64.jar.asc) | +| arm64 | Scala 2.13 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1-cuda11-arm64.jar.asc) | + +This package is built against CUDA 11.8. It is tested on V100, T4, A10, A100, L4 and H100 GPUs with +CUDA 11.8 through CUDA 12.0. + +### Verify signature +* Download the [PUB_KEY](https://keys.openpgp.org/search?q=sw-spark@nvidia.com). +* Import the public key: `gpg --import PUB_KEY` +* Verify the signature for Scala 2.12 jar: + `gpg --verify rapids-4-spark_2.12-24.04.1.jar.asc rapids-4-spark_2.12-24.04.1.jar` +* Verify the signature for Scala 2.13 jar: + `gpg --verify rapids-4-spark_2.13-24.04.1.jar.asc rapids-4-spark_2.13-24.04.1.jar` + +The output of signature verify: + + gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) " + +### Release Notes +* New functionality and performance improvements for this release include: +* Performance improvements for S3 reading. +Refer to perfio.s3.enabled in [advanced_configs](./additional-functionality/advanced_configs.md) for more details. +* Performance improvements when doing a joins on unique keys. +* Enhanced decompression kernels for zstd and snappy. +* Enhanced Parquet reading performance with modular kernels. +* Added compatibility with Spark version 3.5.1. +* Deprecated support for Databricks 10.4 ML LTS. +* For updates on RAPIDS Accelerator Tools, please visit [this link](https://github.com/NVIDIA/spark-rapids-tools/releases). + +For a detailed list of changes, please refer to the +[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md). + +## Archived releases + +As new releases come out, previous ones will still be available in [archived releases](./archive.md). + ## Release v24.04.0 ### Hardware Requirements: The plugin is tested on the following architectures: - @@ -67,14 +67,14 @@ for your hardware's minimum driver version. + + GPU Models: NVIDIA V100, T4, A10/A100, L4 and H100 GPUs + +### Software Requirements: + + OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8 + + NVIDIA Driver*: R470+ + + Runtime: + Scala 2.12, 2.13 + Python, Java Virtual Machine (JVM) compatible with your spark-version. + + * Check the Spark documentation for Python and Java version compatibility with your specific + Spark version. For instance, visit `https://spark.apache.org/docs/3.4.1` for Spark 3.4.1. + + Supported Spark versions: + Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4 + Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4 + Apache Spark 3.4.0, 3.4.1, 3.4.2 + Apache Spark 3.5.0, 3.5.1 + + Supported Databricks runtime versions for Azure and AWS: + Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0) + Databricks 12.2 ML LTS (GPU, Scala 2.12, Spark 3.3.2) + Databricks 13.3 ML LTS (GPU, Scala 2.12, Spark 3.4.1) + + Supported Dataproc versions (Debian/Ubuntu): + GCP Dataproc 2.0 + GCP Dataproc 2.1 + + Supported Dataproc Serverless versions: + Spark runtime 1.1 LTS + Spark runtime 2.0 + Spark runtime 2.1 + +*Some hardware may have a minimum driver version greater than R470. Check the GPU spec sheet +for your hardware's minimum driver version. + +*For Cloudera and EMR support, please refer to the +[Distributions](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#which-distributions-are-supported) section of the FAQ. + ### RAPIDS Accelerator's Support Policy for Apache Spark The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html) @@ -74,7 +206,7 @@ The plugin is tested on the following architectures: Supported Spark versions: Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4 Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3 - Apache Spark 3.4.0, 3.4.1 + Apache Spark 3.4.0, 3.4.1, 3.4.2 Apache Spark 3.5.0 Supported Databricks runtime versions for Azure and AWS: diff --git a/docs/compatibility.md b/docs/compatibility.md index 9975f48b43d..f9af6764498 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -368,10 +368,8 @@ In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON no matter what version of Spark is used. If the nesting level is over this the JSON is considered invalid and all values will be returned as nulls. -Only structs are supported for nested types. There are also some issues with arrays of structs. If -your data includes this, even if you are not reading it, you might get an exception. You can -try to set `spark.rapids.sql.json.read.mixedTypesAsString.enabled` to true to work around this, -but it also has some issues with it. +Mixed types can have some problems. If an item being read could have some lines that are arrays +and others that are structs/dictionaries it is possible an error will be thrown. Dates and Timestamps have some issues and may return values for technically invalid inputs. @@ -497,6 +495,7 @@ The following regular expression patterns are not yet supported on the GPU and w - Character classes that use union, intersection, or subtraction semantics, such as `[a-d[m-p]]`, `[a-z&&[def]]`, or `[a-z&&[^bc]]` - Empty groups: `()` +- Empty pattern: `""` Work is ongoing to increase the range of regular expressions that can run on the GPU. diff --git a/docs/configs.md b/docs/configs.md index f1dd3bc335b..f2785d352c6 100644 --- a/docs/configs.md +++ b/docs/configs.md @@ -10,7 +10,7 @@ The following is the list of options that `rapids-plugin-4-spark` supports. On startup use: `--conf [conf key]=[conf value]`. For example: ``` -${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.04.0-cuda11.jar \ +${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.06.0-cuda11.jar \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.rapids.sql.concurrentGpuTasks=2 ``` diff --git a/docs/dev/get-json-object-dump-tool.md b/docs/dev/get-json-object-dump-tool.md new file mode 100644 index 00000000000..6cbf5ad7c9f --- /dev/null +++ b/docs/dev/get-json-object-dump-tool.md @@ -0,0 +1,99 @@ +--- +layout: page +title: Dump tool for get_json_object +nav_order: 12 +parent: Developer Overview +--- + +# Dump tool for get_json_object + +## Overview +In order to help debug the issues with the `get_json_object` function, the RAPIDS Accelerator provides a +dump tool to save debug information to try and reproduce the issues. Note, the dumped data will be masked +to protect the customer data. + +## How to enable +This assumes that the RAPIDs Accelerator has already been enabled. + +The `get_json_object` expression may be off by default so enable it first +``` +'spark.rapids.sql.expression.GetJsonObject': 'true' +``` + +To enable debugging just set the path to dump the data to. Note that this +path is interpreted using the Hadoop FileSystem APIs. This means that +a path with no schema will go to the default file system. + +``` +'spark.rapids.sql.expression.GetJsonObject.debugPath': '/tmp/DEBUG_JSON_DUMP/' +``` + +This path should be a directory or someplace that we can create a directory to +store files in. Multiple files may be written out. Note that each instance of +`get_json_object` will mask the data in different ways, but the same +instance should mask the data in the same way. + +You may also set the max number of rows for each file/batch. Each time a new +batch of data comes into the `get_json_object` expression a new file is written +and this controls the maximum number of rows that may be written out. +``` +'spark.rapids.sql.test.get_json_object.saveRows': '1024' +``` +This config can be skipped, because default value works. + +## Masking +Please note that this cannot currently be disabled. +This tool should not dump the original input data. +The goal is to find out what types of issues are showing up, and ideally +give the RAPIDS team enough information to reproduce it. + +Digits `[0-9]` will be remapped to `[0-9]`, the mapping is chosen +randomly for each instance of the expression. This is done to preserve +the format of the numbers, even if they are not 100% the same. + +The characters `[a-zA-Z]` are also randomly remapped to `[a-zA-Z]` similar +to the numbers. But many of these are preserved because they are part of +special cases. + +The letters that are preserved are `a, b, c, d, e, f, l, n, r, s, t, u, A, B, C, D, E, F` + +These are preserved because they could be used as a part of + * special keywords like `true`, `false`, or `null` + * number formatting like `1.0E-3` + * escape characters defined in the JSON standard `\b\f\n\r\t\u` + * or hexadecimal numbers that are a part of the `\u` escape sequence + +All other characters are mapped to the letter `s` unless they are one of the following. + + * ASCII `[0 to 31]` are considered to be control characters in the JSON spec and in some cases are not allowed + * `-` for negative numbers + * `{ } [ ] , : " '` are part of the structure of JSON, or at least are considered that way + * `\` for escape sequences + * `$ [ ] . * '` which are part of JSON paths + * `?` which Spark has as a special case for JSON path, but no one else does. + +## Stored Data +The dumped data is stored in a CSV file, that should be compatible with Spark, +and most other CSV readers. CSV is a format that is not great at storing complex +data, like JSON in it, so there are likely to be some small compatibility issues. +The following shows you how to read the stored data using Spark with Scala. + +Spark wants the data to be stored with no line separators, but JSON can have this. +So we replace `\r` and `\n` with a character sequences that is not likely to show up +in practice. JSON data can also conflict with CSV escape handling, especially if the +input data is not valid JSON. As such we also replace double quotes and commas just in +case. + +```scala +// Replace this with the actual path to read from +val readPath = "/data/tmp/DEBUG_JSON_DUMP" + +val df = spark.read. + schema("isLegacy boolean, path string, originalInput string, cpuOutput string, gpuOutput string"). + csv(readPath) + +val strUnescape = Seq("isLegacy") ++ Seq("path", "originalInput", "cpuOutput", "gpuOutput"). + map(c => s"""replace(replace(replace(replace($c, '**CR**', '\r'), '**LF**', '\n'), '**QT**', '"'), '**COMMA**', ',') as $c""") + +val data = df.selectExpr(strUnescape : _*) +``` \ No newline at end of file diff --git a/docs/dev/shims.md b/docs/dev/shims.md index 0315e5bd963..9a8e09d8295 100644 --- a/docs/dev/shims.md +++ b/docs/dev/shims.md @@ -68,17 +68,17 @@ Using JarURLConnection URLs we create a Parallel World of the current version wi Spark 3.0.2's URLs: ```text -jar:file:/home/spark/rapids-4-spark_2.12-24.04.0.jar!/ -jar:file:/home/spark/rapids-4-spark_2.12-24.04.0.jar!/spark3xx-common/ -jar:file:/home/spark/rapids-4-spark_2.12-24.04.0.jar!/spark302/ +jar:file:/home/spark/rapids-4-spark_2.12-24.06.0.jar!/ +jar:file:/home/spark/rapids-4-spark_2.12-24.06.0.jar!/spark3xx-common/ +jar:file:/home/spark/rapids-4-spark_2.12-24.06.0.jar!/spark302/ ``` Spark 3.2.0's URLs : ```text -jar:file:/home/spark/rapids-4-spark_2.12-24.04.0.jar!/ -jar:file:/home/spark/rapids-4-spark_2.12-24.04.0.jar!/spark3xx-common/ -jar:file:/home/spark/rapids-4-spark_2.12-24.04.0.jar!/spark320/ +jar:file:/home/spark/rapids-4-spark_2.12-24.06.0.jar!/ +jar:file:/home/spark/rapids-4-spark_2.12-24.06.0.jar!/spark3xx-common/ +jar:file:/home/spark/rapids-4-spark_2.12-24.06.0.jar!/spark320/ ``` ### Late Inheritance in Public Classes diff --git a/docs/dev/testing.md b/docs/dev/testing.md index ed68f08392d..6a6c6a378eb 100644 --- a/docs/dev/testing.md +++ b/docs/dev/testing.md @@ -5,5 +5,5 @@ nav_order: 2 parent: Developer Overview --- An overview of testing can be found within the repository at: -* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-24.04/tests#readme) -* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-24.04/integration_tests#readme) +* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-24.06/tests#readme) +* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-24.06/integration_tests#readme) diff --git a/docs/download.md b/docs/download.md index e993b99bd01..f786f5a217d 100644 --- a/docs/download.md +++ b/docs/download.md @@ -18,7 +18,7 @@ cuDF jar, that is either preinstalled in the Spark classpath on all nodes or sub that uses the RAPIDS Accelerator For Apache Spark. See the [getting-started guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html) for more details. -## Release v24.04.1 +## Release v24.06.0 ### Hardware Requirements: The plugin is tested on the following architectures: @@ -41,7 +41,7 @@ The plugin is tested on the following architectures: Supported Spark versions: Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4 Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4 - Apache Spark 3.4.0, 3.4.1, 3.4.2 + Apache Spark 3.4.0, 3.4.1, 3.4.2, 3.4.3 Apache Spark 3.5.0, 3.5.1 Supported Databricks runtime versions for Azure and AWS: @@ -57,6 +57,7 @@ The plugin is tested on the following architectures: Spark runtime 1.1 LTS Spark runtime 2.0 Spark runtime 2.1 + Spark runtime 2.2 *Some hardware may have a minimum driver version greater than R470. Check the GPU spec sheet for your hardware's minimum driver version. @@ -67,14 +68,14 @@ for your hardware's minimum driver version. ### RAPIDS Accelerator's Support Policy for Apache Spark The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html) -### Download RAPIDS Accelerator for Apache Spark v24.04.1 +### Download RAPIDS Accelerator for Apache Spark v24.06.0 | Processor | Scala Version | Download Jar | Download Signature | |-----------|---------------|--------------|--------------------| -| x86_64 | Scala 2.12 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1.jar.asc) | -| x86_64 | Scala 2.13 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1.jar.asc) | -| arm64 | Scala 2.12 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.04.1/rapids-4-spark_2.12-24.04.1-cuda11-arm64.jar.asc) | -| arm64 | Scala 2.13 | [RAPIDS Accelerator v24.04.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.04.1/rapids-4-spark_2.13-24.04.1-cuda11-arm64.jar.asc) | +| x86_64 | Scala 2.12 | [RAPIDS Accelerator v24.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar.asc) | +| x86_64 | Scala 2.13 | [RAPIDS Accelerator v24.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.06.0/rapids-4-spark_2.13-24.06.0.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.06.0/rapids-4-spark_2.13-24.06.0.jar.asc) | +| arm64 | Scala 2.12 | [RAPIDS Accelerator v24.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0-cuda11-arm64.jar.asc) | +| arm64 | Scala 2.13 | [RAPIDS Accelerator v24.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.06.0/rapids-4-spark_2.13-24.06.0-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.06.0/rapids-4-spark_2.13-24.06.0-cuda11-arm64.jar.asc) | This package is built against CUDA 11.8. It is tested on V100, T4, A10, A100, L4 and H100 GPUs with CUDA 11.8 through CUDA 12.0. @@ -83,28 +84,24 @@ CUDA 11.8 through CUDA 12.0. * Download the [PUB_KEY](https://keys.openpgp.org/search?q=sw-spark@nvidia.com). * Import the public key: `gpg --import PUB_KEY` * Verify the signature for Scala 2.12 jar: - `gpg --verify rapids-4-spark_2.12-24.04.1.jar.asc rapids-4-spark_2.12-24.04.1.jar` + `gpg --verify rapids-4-spark_2.12-24.06.0.jar.asc rapids-4-spark_2.12-24.06.0.jar` * Verify the signature for Scala 2.13 jar: - `gpg --verify rapids-4-spark_2.13-24.04.1.jar.asc rapids-4-spark_2.13-24.04.1.jar` + `gpg --verify rapids-4-spark_2.13-24.06.0.jar.asc rapids-4-spark_2.13-24.06.0.jar` The output of signature verify: gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) " ### Release Notes -* New functionality and performance improvements for this release include: -* Performance improvements for S3 reading. -Refer to perfio.s3.enabled in [advanced_configs](./additional-functionality/advanced_configs.md) for more details. -* Performance improvements when doing a joins on unique keys. -* Enhanced decompression kernels for zstd and snappy. -* Enhanced Parquet reading performance with modular kernels. -* Added compatibility with Spark version 3.5.1. -* Deprecated support for Databricks 10.4 ML LTS. -* For updates on RAPIDS Accelerator Tools, please visit [this link](https://github.com/NVIDIA/spark-rapids-tools/releases). +* Improve support for Unity Catalog on Databricks +* Added support for parse_url PATH +* Added support for array_filter +* Added support for Spark 3.4.3 +* For updates on RAPIDS Accelerator Tools, please visit [this link](https://github.com/NVIDIA/spark-rapids-tools/releases) For a detailed list of changes, please refer to the [CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md). ## Archived releases -As new releases come out, previous ones will still be available in [archived releases](./archive.md). \ No newline at end of file +As new releases come out, previous ones will still be available in [archived releases](./archive.md). diff --git a/docs/supported_ops.md b/docs/supported_ops.md index a9cd9ec13cb..fbafcfbf81d 100644 --- a/docs/supported_ops.md +++ b/docs/supported_ops.md @@ -2288,6 +2288,74 @@ are limited. +ArrayFilter +`filter` +Filter an input array using a given predicate +None +project +argument + + + + + + + + + + + + + + +PS
UTC is only supported TZ for child TIMESTAMP;
unsupported child types BINARY, CALENDAR, UDT
+ + + + + +function +S + + + + + + + + + + + + + + + + + + + +result + + + + + + + + + + + + + + +PS
UTC is only supported TZ for child TIMESTAMP;
unsupported child types BINARY, CALENDAR, UDT
+ + + + + ArrayIntersect `array_intersect` Returns an array of the elements in the intersection of array1 and array2, without duplicates @@ -2518,6 +2586,32 @@ are limited. +Expression +SQL Functions(s) +Description +Notes +Context +Param/Output +BOOLEAN +BYTE +SHORT +INT +LONG +FLOAT +DOUBLE +DATE +TIMESTAMP +STRING +DECIMAL +NULL +BINARY +CALENDAR +ARRAY +MAP +STRUCT +UDT + + ArrayRepeat `array_repeat` Returns the array containing the given input value (left) count (right) times @@ -2586,32 +2680,6 @@ are limited. -Expression -SQL Functions(s) -Description -Notes -Context -Param/Output -BOOLEAN -BYTE -SHORT -INT -LONG -FLOAT -DOUBLE -DATE -TIMESTAMP -STRING -DECIMAL -NULL -BINARY -CALENDAR -ARRAY -MAP -STRUCT -UDT - - ArrayTransform `transform` Transform elements in an array using the transform function. This is similar to a `map` in functional programming @@ -2910,6 +2978,32 @@ are limited. +Expression +SQL Functions(s) +Description +Notes +Context +Param/Output +BOOLEAN +BYTE +SHORT +INT +LONG +FLOAT +DOUBLE +DATE +TIMESTAMP +STRING +DECIMAL +NULL +BINARY +CALENDAR +ARRAY +MAP +STRUCT +UDT + + Asin `asin` Inverse sine @@ -3000,32 +3094,6 @@ are limited. -Expression -SQL Functions(s) -Description -Notes -Context -Param/Output -BOOLEAN -BYTE -SHORT -INT -LONG -FLOAT -DOUBLE -DATE -TIMESTAMP -STRING -DECIMAL -NULL -BINARY -CALENDAR -ARRAY -MAP -STRUCT -UDT - - Asinh `asinh` Inverse hyperbolic sine @@ -3343,6 +3411,32 @@ are limited. +Expression +SQL Functions(s) +Description +Notes +Context +Param/Output +BOOLEAN +BYTE +SHORT +INT +LONG +FLOAT +DOUBLE +DATE +TIMESTAMP +STRING +DECIMAL +NULL +BINARY +CALENDAR +ARRAY +MAP +STRUCT +UDT + + AttributeReference References an input column @@ -3391,32 +3485,6 @@ are limited. NS -Expression -SQL Functions(s) -Description -Notes -Context -Param/Output -BOOLEAN -BYTE -SHORT -INT -LONG -FLOAT -DOUBLE -DATE -TIMESTAMP -STRING -DECIMAL -NULL -BINARY -CALENDAR -ARRAY -MAP -STRUCT -UDT - - BRound `bround` Round an expression to d decimal places using HALF_EVEN rounding mode @@ -4044,6 +4112,54 @@ are limited. +BoundReference + +Reference to a bound variable +None +project +result +S +S +S +S +S +S +S +S +PS
UTC is only supported TZ for TIMESTAMP
+S +S +S +S +NS +PS
UTC is only supported TZ for child TIMESTAMP;
unsupported child types CALENDAR, UDT
+PS
UTC is only supported TZ for child TIMESTAMP;
unsupported child types CALENDAR, UDT
+PS
UTC is only supported TZ for child TIMESTAMP;
unsupported child types CALENDAR, UDT
+NS + + +AST +result +S +S +S +S +S +S +S +S +PS
UTC is only supported TZ for TIMESTAMP
+S +NS +NS +NS +NS +NS +NS +NS +NS + + CaseWhen `when` CASE WHEN expression @@ -8222,7 +8338,7 @@ are limited. JsonTuple `json_tuple` Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string. -This is disabled by default because JsonTuple on the GPU does not support all of the normalization that the CPU supports. +This is disabled by default because Experimental feature that could be unstable or have performance issues. project json @@ -10817,7 +10933,7 @@ are limited. -PS
only support partToExtract = PROTOCOL | HOST | QUERY;
Literal value only
+PS
only support partToExtract = PROTOCOL | HOST | QUERY | PATH;
Literal value only