Skip to content

Commit

Permalink
merge main to ghpage (#11616)
Browse files Browse the repository at this point in the history
Signed-off-by: liyuan <[email protected]>
  • Loading branch information
nvliyuan authored Oct 22, 2024
1 parent 8d0ce56 commit 4f941b9
Show file tree
Hide file tree
Showing 9 changed files with 1,051 additions and 512 deletions.
5 changes: 4 additions & 1 deletion docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.ArrayExists"></a>spark.rapids.sql.expression.ArrayExists|`exists`|Return true if any element satisfies the predicate LambdaFunction|true|None|
<a name="sql.expression.ArrayFilter"></a>spark.rapids.sql.expression.ArrayFilter|`filter`|Filter an input array using a given predicate|true|None|
<a name="sql.expression.ArrayIntersect"></a>spark.rapids.sql.expression.ArrayIntersect|`array_intersect`|Returns an array of the elements in the intersection of array1 and array2, without duplicates|true|This is not 100% compatible with the Spark version because the GPU implementation treats -0.0 and 0.0 as equal, but the CPU implementation currently does not (see SPARK-39845). Also, Apache Spark 3.1.3 fixed issue SPARK-36741 where NaNs in these set like operators were not treated as being equal. We have chosen to break with compatibility for the older versions of Spark in this instance and handle NaNs the same as 3.1.3+|
<a name="sql.expression.ArrayJoin"></a>spark.rapids.sql.expression.ArrayJoin|`array_join`|Concatenates the elements of the given array using the delimiter and an optional string to replace nulls. If no value is set for nullReplacement, any null value is filtered.|true|None|
<a name="sql.expression.ArrayMax"></a>spark.rapids.sql.expression.ArrayMax|`array_max`|Returns the maximum value in the array|true|None|
<a name="sql.expression.ArrayMin"></a>spark.rapids.sql.expression.ArrayMin|`array_min`|Returns the minimum value in the array|true|None|
<a name="sql.expression.ArrayRemove"></a>spark.rapids.sql.expression.ArrayRemove|`array_remove`|Returns the array after removing all elements that equal to the input element (right) from the input array (left)|true|None|
Expand Down Expand Up @@ -255,7 +256,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.FromUnixTime"></a>spark.rapids.sql.expression.FromUnixTime|`from_unixtime`|Get the string from a unix timestamp|true|None|
<a name="sql.expression.GetArrayItem"></a>spark.rapids.sql.expression.GetArrayItem| |Gets the field at `ordinal` in the Array|true|None|
<a name="sql.expression.GetArrayStructFields"></a>spark.rapids.sql.expression.GetArrayStructFields| |Extracts the `ordinal`-th fields of all array elements for the data with the type of array of struct|true|None|
<a name="sql.expression.GetJsonObject"></a>spark.rapids.sql.expression.GetJsonObject|`get_json_object`|Extracts a json object from path|false|This is disabled by default because Experimental feature that could be unstable or have performance issues.|
<a name="sql.expression.GetJsonObject"></a>spark.rapids.sql.expression.GetJsonObject|`get_json_object`|Extracts a json object from path|true|None|
<a name="sql.expression.GetMapValue"></a>spark.rapids.sql.expression.GetMapValue| |Gets Value from a Map based on a key|true|None|
<a name="sql.expression.GetStructField"></a>spark.rapids.sql.expression.GetStructField| |Gets the named field of the struct|true|None|
<a name="sql.expression.GetTimestamp"></a>spark.rapids.sql.expression.GetTimestamp| |Gets timestamps from strings using given pattern.|true|None|
Expand Down Expand Up @@ -403,7 +404,9 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.First"></a>spark.rapids.sql.expression.First|`first_value`, `first`|first aggregate operator|true|None|
<a name="sql.expression.Last"></a>spark.rapids.sql.expression.Last|`last_value`, `last`|last aggregate operator|true|None|
<a name="sql.expression.Max"></a>spark.rapids.sql.expression.Max|`max`|Max aggregate operator|true|None|
<a name="sql.expression.MaxBy"></a>spark.rapids.sql.expression.MaxBy|`max_by`|MaxBy aggregate operator. It may produce different results than CPU when multiple rows in a group have same minimum value in the ordering column and different associated values in the value column.|true|None|
<a name="sql.expression.Min"></a>spark.rapids.sql.expression.Min|`min`|Min aggregate operator|true|None|
<a name="sql.expression.MinBy"></a>spark.rapids.sql.expression.MinBy|`min_by`|MinBy aggregate operator. It may produce different results than CPU when multiple rows in a group have same minimum value in the ordering column and different associated values in the value column.|true|None|
<a name="sql.expression.Percentile"></a>spark.rapids.sql.expression.Percentile|`percentile`|Aggregation computing exact percentile|true|None|
<a name="sql.expression.PivotFirst"></a>spark.rapids.sql.expression.PivotFirst| |PivotFirst operator|true|None|
<a name="sql.expression.StddevPop"></a>spark.rapids.sql.expression.StddevPop|`stddev_pop`|Aggregation computing population standard deviation|true|None|
Expand Down
93 changes: 89 additions & 4 deletions docs/archive.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,95 @@ nav_order: 15
---
Below are archived releases for RAPIDS Accelerator for Apache Spark.

## Release v24.08.1
### Hardware Requirements:

The plugin is tested on the following architectures:

GPU Models: NVIDIA V100, T4, A10/A100, L4 and H100 GPUs

### Software Requirements:

OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8

NVIDIA Driver*: R470+

Runtime:
Scala 2.12, 2.13
Python, Java Virtual Machine (JVM) compatible with your spark-version.

* Check the Spark documentation for Python and Java version compatibility with your specific
Spark version. For instance, visit `https://spark.apache.org/docs/3.4.1` for Spark 3.4.1.

Supported Spark versions:
Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4
Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4
Apache Spark 3.4.0, 3.4.1, 3.4.2, 3.4.3
Apache Spark 3.5.0, 3.5.1

Supported Databricks runtime versions for Azure and AWS:
Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0)
Databricks 12.2 ML LTS (GPU, Scala 2.12, Spark 3.3.2)
Databricks 13.3 ML LTS (GPU, Scala 2.12, Spark 3.4.1)

Supported Dataproc versions (Debian/Ubuntu/Rocky):
GCP Dataproc 2.1
GCP Dataproc 2.2

Supported Dataproc Serverless versions:
Spark runtime 1.1 LTS
Spark runtime 2.0
Spark runtime 2.1
Spark runtime 2.2

*Some hardware may have a minimum driver version greater than R470. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#which-distributions-are-supported) section of the FAQ.

### RAPIDS Accelerator's Support Policy for Apache Spark
The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html)

### Download RAPIDS Accelerator for Apache Spark v24.08.1

| Processor | Scala Version | Download Jar | Download Signature |
|-----------|---------------|--------------|--------------------|
| x86_64 | Scala 2.12 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar.asc) |
| x86_64 | Scala 2.13 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1.jar.asc) |
| arm64 | Scala 2.12 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1-cuda11-arm64.jar.asc) |
| arm64 | Scala 2.13 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1-cuda11-arm64.jar.asc) |

This package is built against CUDA 11.8. It is tested on V100, T4, A10, A100, L4 and H100 GPUs with
CUDA 11.8 through CUDA 12.0.

### Verify signature
* Download the [PUB_KEY](https://keys.openpgp.org/[email protected]).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature for Scala 2.12 jar:
`gpg --verify rapids-4-spark_2.12-24.08.1.jar.asc rapids-4-spark_2.12-24.08.1.jar`
* Verify the signature for Scala 2.13 jar:
`gpg --verify rapids-4-spark_2.13-24.08.1.jar.asc rapids-4-spark_2.13-24.08.1.jar`

The output of signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <[email protected]>"

### Release Notes
* Support timezones with daylight savings shifts
* Improve metrics in Spark UI
* Refactor Parquet decode microkernels and support load balancing RLE runs
* Improve get_json performance
* Support dynamic scan filtering
* Improve UCX shuffle
* For updates on RAPIDS Accelerator Tools, please visit [this link](https://github.com/NVIDIA/spark-rapids-tools/releases)

Note: There is a known issue in the 24.08.1 release when decompressing gzip files on H100 GPUs.
Please find more details in [issue-16661](https://github.com/rapidsai/cudf/issues/16661).

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

## Release v24.06.1
### Hardware Requirements:

Expand Down Expand Up @@ -260,10 +349,6 @@ Refer to perfio.s3.enabled in [advanced_configs](./additional-functionality/adva
For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

## Archived releases

As new releases come out, previous ones will still be available in [archived releases](./archive.md).

## Release v24.04.0
### Hardware Requirements:

Expand Down
3 changes: 3 additions & 0 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -651,6 +651,7 @@ guaranteed to produce the same results as the CPU:
- `dd/MM/yyyy`
- `yyyy/MM/dd`
- `yyyy-MM-dd`
- `yyyyMMdd`
- `yyyy/MM/dd HH:mm:ss`
- `yyyy-MM-dd HH:mm:ss`

Expand All @@ -659,6 +660,8 @@ LEGACY timeParserPolicy support has the following limitations when running on th
- Only 4 digit years are supported
- The proleptic Gregorian calendar is used instead of the hybrid Julian+Gregorian calendar
that Spark uses in legacy mode
- When format is `yyyyMMdd`, GPU only supports 8 digit strings. Spark supports like 7 digit
`2024101` string while GPU does not support. Only tested `UTC` and `Asia/Shanghai` timezones.

## Formatting dates and timestamps as strings

Expand Down
2 changes: 1 addition & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The following is the list of options that `rapids-plugin-4-spark` supports.
On startup use: `--conf [conf key]=[conf value]`. For example:

```
${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.08.1-cuda11.jar \
${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.10.0-cuda11.jar \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.concurrentGpuTasks=2
```
Expand Down
13 changes: 10 additions & 3 deletions docs/dev/lore.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,19 @@ By default, LORE id will always be generated for operators, but user could disab
by setting `spark.rapids.sql.lore.tag.enabled` to `false`.

To tell LORE the LORE ids of the operators you are interested in, you need to set
`spark.rapids.sql.lore.idsToDump`. For example, you could set it to "1[*], 2[*], 3[*]" to tell
`spark.rapids.sql.lore.idsToDump`. For example, you could set it to "1[\*], 2[\*], 3[\*]" to tell
LORE to dump all partitions of input data of operators with id 1, 2, or 3. You can also only dump
some partition of the operator's input by appending partition numbers to lore ids. For example,
"1[0 4-6 7], 2[*]" tell LORE to dump operator with LORE id 1, but only dump partition 0, 4, 5,
"1[0 4-6 7], 2[\*]" tell LORE to dump operator with LORE id 1, but only dump partition 0, 4, 5,
and 7, e.g. the end of the range is exclusive. But for operator with LORE id 2, it will dump all
partitions.

You also need to set `spark.rapids.sql.lore.dumpPath` to tell LORE where to dump the data, the
value of which should point to a directory. All dumped data of a query will live in this
directory. A typical directory hierarchy would look like this:
directory. Note, the directory may either not exist, in which case it will be created, or it should be empty.
If the directory exists and contains files, an `IllegalArgumentException` will be thrown to prevent overwriting existing data.

A typical directory hierarchy would look like this:

```console
+ loreId-10/
Expand Down Expand Up @@ -68,4 +71,8 @@ directory. A typical directory hierarchy would look like this:
- batch-0.parquet
```

# Limitations

1. Currently, the LORE id is missed when the RDD of a `DataFrame` is used directly.
2. Not all operators are supported by LORE. For example, shuffle related operator (e.g.
`GpuShuffleExchangeExec`), leaf operator (e.g. `GpuFileSourceScanExec`) are not supported.
12 changes: 6 additions & 6 deletions docs/dev/shims.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,17 +68,17 @@ Using JarURLConnection URLs we create a Parallel World of the current version wi
Spark 3.0.2's URLs:

```text
jar:file:/home/spark/rapids-4-spark_2.12-24.08.1.jar!/
jar:file:/home/spark/rapids-4-spark_2.12-24.08.1.jar!/spark-shared/
jar:file:/home/spark/rapids-4-spark_2.12-24.08.1.jar!/spark302/
jar:file:/home/spark/rapids-4-spark_2.12-24.10.0.jar!/
jar:file:/home/spark/rapids-4-spark_2.12-24.10.0.jar!/spark-shared/
jar:file:/home/spark/rapids-4-spark_2.12-24.10.0.jar!/spark302/
```

Spark 3.2.0's URLs :

```text
jar:file:/home/spark/rapids-4-spark_2.12-24.08.1.jar!/
jar:file:/home/spark/rapids-4-spark_2.12-24.08.1.jar!/spark-shared/
jar:file:/home/spark/rapids-4-spark_2.12-24.08.1.jar!/spark320/
jar:file:/home/spark/rapids-4-spark_2.12-24.10.0.jar!/
jar:file:/home/spark/rapids-4-spark_2.12-24.10.0.jar!/spark-shared/
jar:file:/home/spark/rapids-4-spark_2.12-24.10.0.jar!/spark320/
```

### Late Inheritance in Public Classes
Expand Down
4 changes: 2 additions & 2 deletions docs/dev/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ nav_order: 2
parent: Developer Overview
---
An overview of testing can be found within the repository at:
* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-24.08/tests#readme)
* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-24.08/integration_tests#readme)
* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-24.10/tests#readme)
* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-24.10/integration_tests#readme)
32 changes: 16 additions & 16 deletions docs/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ cuDF jar, that is either preinstalled in the Spark classpath on all nodes or sub
that uses the RAPIDS Accelerator For Apache Spark. See the [getting-started
guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html) for more details.

## Release v24.08.1
## Release v24.10.0
### Hardware Requirements:

The plugin is tested on the following architectures:
Expand All @@ -42,7 +42,7 @@ The plugin is tested on the following architectures:
Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4
Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4
Apache Spark 3.4.0, 3.4.1, 3.4.2, 3.4.3
Apache Spark 3.5.0, 3.5.1
Apache Spark 3.5.0, 3.5.1, 3.5.2

Supported Databricks runtime versions for Azure and AWS:
Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0)
Expand All @@ -68,14 +68,14 @@ for your hardware's minimum driver version.
### RAPIDS Accelerator's Support Policy for Apache Spark
The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html)

### Download RAPIDS Accelerator for Apache Spark v24.08.1
### Download RAPIDS Accelerator for Apache Spark v24.10.0

| Processor | Scala Version | Download Jar | Download Signature |
|-----------|---------------|--------------|--------------------|
| x86_64 | Scala 2.12 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar.asc) |
| x86_64 | Scala 2.13 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1.jar.asc) |
| arm64 | Scala 2.12 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1-cuda11-arm64.jar.asc) |
| arm64 | Scala 2.13 | [RAPIDS Accelerator v24.08.1](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.08.1/rapids-4-spark_2.13-24.08.1-cuda11-arm64.jar.asc) |
| x86_64 | Scala 2.12 | [RAPIDS Accelerator v24.10.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0.jar.asc) |
| x86_64 | Scala 2.13 | [RAPIDS Accelerator v24.10.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.10.0/rapids-4-spark_2.13-24.10.0.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.10.0/rapids-4-spark_2.13-24.10.0.jar.asc) |
| arm64 | Scala 2.12 | [RAPIDS Accelerator v24.10.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.10.0/rapids-4-spark_2.12-24.10.0-cuda11-arm64.jar.asc) |
| arm64 | Scala 2.13 | [RAPIDS Accelerator v24.10.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.10.0/rapids-4-spark_2.13-24.10.0-cuda11-arm64.jar) | [Signature](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/24.10.0/rapids-4-spark_2.13-24.10.0-cuda11-arm64.jar.asc) |

This package is built against CUDA 11.8. It is tested on V100, T4, A10, A100, L4 and H100 GPUs with
CUDA 11.8 through CUDA 12.0.
Expand All @@ -84,24 +84,24 @@ CUDA 11.8 through CUDA 12.0.
* Download the [PUB_KEY](https://keys.openpgp.org/[email protected]).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature for Scala 2.12 jar:
`gpg --verify rapids-4-spark_2.12-24.08.1.jar.asc rapids-4-spark_2.12-24.08.1.jar`
`gpg --verify rapids-4-spark_2.12-24.10.0.jar.asc rapids-4-spark_2.12-24.10.0.jar`
* Verify the signature for Scala 2.13 jar:
`gpg --verify rapids-4-spark_2.13-24.08.1.jar.asc rapids-4-spark_2.13-24.08.1.jar`
`gpg --verify rapids-4-spark_2.13-24.10.0.jar.asc rapids-4-spark_2.13-24.10.0.jar`

The output of signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <[email protected]>"

### Release Notes
* Support timezones with daylight savings shifts
* Improve metrics in Spark UI
* Refactor Parquet decode microkernels and support load balancing RLE runs
* Improve get_json performance
* Support dynamic scan filtering
* Improve UCX shuffle
* Optimize scheduling policy for GPU Semaphore
* Support distinct join for right outer joins
* Support MinBy and MaxBy for non-float ordering
* Support ArrayJoin expression
* Optimize Expand and Aggregate expression performance
* Improve JSON related expressions
* For updates on RAPIDS Accelerator Tools, please visit [this link](https://github.com/NVIDIA/spark-rapids-tools/releases)

Note: There is a known issue in the 24.08.1 release when decompressing gzip files on H100 GPUs.
Note: There is a known issue in the 24.10.0 release when decompressing gzip files on H100 GPUs.
Please find more details in [issue-16661](https://github.com/rapidsai/cudf/issues/16661).

For a detailed list of changes, please refer to the
Expand Down
Loading

0 comments on commit 4f941b9

Please sign in to comment.