layout | title | nav_order |
---|---|---|
page |
Compatibility |
5 |
The SQL plugin tries to produce results that are bit for bit identical with Apache Spark. There are a number of cases where there are some differences. In most cases operators that produce different results are off by default, and you can look at the configs for more information on how to enable them. In some cases we felt that enabling the incompatibility by default was worth the performance gain. All of those operators can be disabled through configs if it becomes a problem. Please also look at the current list of bugs which are typically incompatibilities that we have not yet addressed.
There are some operators where Spark does not guarantee the order of the output.
These are typically things like aggregates and joins that may use a hash to distribute the work
load among downstream tasks. In these cases the plugin does not guarantee that it will
produce the same output order that Spark does. In cases such as an order by
operation
where the ordering is explicit the plugin will produce an ordering that is compatible with
Spark's guarantee. It may not be 100% identical if the ordering is ambiguous.
In versions of Spark prior to 3.1.0 -0.0
is always < 0.0
but in 3.1.0 and above this is
not true for sorting. For all versions of the plugin -0.0
== 0.0
for sorting.
Spark's sorting is typically a stable
sort. Sort stability cannot be guaranteed in distributed work loads because the order in which
upstream data arrives to a task is not guaranteed. Sort stability is only
guaranteed in one situation which is reading and sorting data from a file using a single
task/partition. The RAPIDS Accelerator does an unstable
out of core sort by default. This
simply means that the sort algorithm allows for spilling parts of the data if it is larger than
can fit in the GPU's memory, but it does not guarantee ordering of rows when the ordering of the
keys is ambiguous. If you do rely on a stable sort in your processing you can request this by
setting spark.rapids.sql.stableSort.enabled to true
and
RAPIDS will try to sort all the data for a given task/partition at once on the GPU. This may change
in the future to allow for a spillable stable sort.
For most basic floating-point operations like addition, subtraction, multiplication, and division
the plugin will produce a bit for bit identical result as Spark does. For other functions like
sin
, cos
, etc. the output may be different, but within the rounding error inherent in
floating-point calculations. The ordering of operations to calculate the value may differ between the
underlying JVM implementation used by the CPU and the C++ standard library implementation used by
the GPU.
In the case of round
and bround
the results can be off by more because they can enlarge the
difference. This happens in cases where a binary floating-point representation cannot exactly
capture a decimal value. For example 1.025
cannot exactly be represented and ends up being closer
to 1.02499
. The Spark implementation of round
converts it first to a decimal value with complex
logic to make it 1.025
and then does the rounding. This results in round(1.025, 2)
under pure
Spark getting a value of 1.03
but under the RAPIDS accelerator it produces 1.02
. As a side note
Python will produce 1.02
, Java does not have the ability to do a round like this built in, but if
you do the simple operation of Math.round(1.025 * 100.0)/100.0
you also get 1.02
.
For aggregations the underlying implementation is doing the aggregations in parallel and due to race
conditions within the computation itself the result may not be the same each time the query is
run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If a query
joins on a floating point value, which is not wise to do anyways, and the value is the result of a
floating point aggregation then the join may fail to work properly with the plugin but would have
worked with plain Spark. Because of this most floating point aggregations are off by default but can
be enabled with the config
spark.rapids.sql.variableFloatAgg.enabled
.
Additionally, some aggregations on floating point columns that contain NaN
can produce results
different from Spark in versions prior to Spark 3.1.0. If it is known with certainty that the
floating point columns do not contain NaN
, set
spark.rapids.sql.hasNans
to false
to run GPU enabled aggregations on
them.
In the case of a distinct count on NaN
values, prior to Spark 3.1.0, the issue only shows up if
you have different NaN
values. There are several different binary values that are all considered
to be NaN
by floating point. The plugin treats all of these as the same value, where as Spark
treats them all as different values. Because this is considered to be rare we do not disable
distinct count for floating point values even if
spark.rapids.sql.hasNans
is true
.
Floating point allows zero to be encoded as 0.0
and -0.0
, but the IEEE standard says that they
should be interpreted as the same. Most databases normalize these values to always be 0.0
. Spark
does this in some cases but not all as is documented
here. The underlying implementation of this
plugin treats them as the same for essentially all processing. This can result in some differences
with Spark for operations, prior to Spark 3.1.0, like sorting, and distinct count. There are still
differences with joins, and comparisons even
after Spark 3.1.0.
We do not disable operations that produce different results due to -0.0
in the data because it is
considered to be a rare occurrence.
Spark delegates Unicode operations to the underlying JVM. Each version of Java complies with a specific version of the Unicode standard. The SQL plugin does not use the JVM for Unicode support and is compatible with Unicode version 12.1. Because of this there may be corner cases where Spark will produce a different result compared to the plugin.
Due to inconsistencies between how CSV data is parsed CSV parsing is off by default. Each data type can be enabled or disabled independently using the following configs.
- spark.rapids.sql.csv.read.bool.enabled
- spark.rapids.sql.csv.read.byte.enabled
- spark.rapids.sql.csv.read.date.enabled
- spark.rapids.sql.csv.read.double.enabled
- spark.rapids.sql.csv.read.float.enabled
- spark.rapids.sql.csv.read.integer.enabled
- spark.rapids.sql.csv.read.long.enabled
- spark.rapids.sql.csv.read.short.enabled
- spark.rapids.sql.csvTimestamps.enabled
If you know that your particular data type will be parsed correctly enough, you may enable each type you expect to use. Often the performance improvement is so good that it is worth checking if it is parsed correctly.
Spark is generally very strict when reading CSV and if the data does not conform with the
expected format exactly it will result in a null
value. The underlying parser that the RAPIDS Accelerator
uses is much more lenient. If you have badly formatted CSV data you may get data back instead of
nulls.
Spark allows for stripping leading and trailing white space using various options that are off by default. The plugin will strip leading and trailing space for all values except strings.
There are also discrepancies/issues with specific types that are detailed below.
Invalid values like BAD
show up as true
as described by this
issue
This is the same for all other types, but because that is the only issue with boolean parsing we have called it out specifically here.
Writing strings to a CSV file in general for Spark can be problematic unless you can ensure that
your data does not have any line deliminators in it. The GPU accelerated CSV parser handles quoted
line deliminators similar to multiLine
mode. But there are still a number of issues surrounding
it and they should be avoided.
Escaped quote characters '\"'
are not supported well as described by this
issue.
Parsing a timestamp
as a date
does not work. The details are documented in this
issue.
Only a limited set of formats are supported when parsing dates.
"yyyy-MM-dd"
"yyyy/MM/dd"
"yyyy-MM"
"yyyy/MM"
"MM-yyyy"
"MM/yyyy"
"MM-dd-yyyy"
"MM/dd/yyyy"
The reality is that all of these formats are supported at the same time. The plugin will only disable itself if you set a format that it does not support.
As a workaround you can parse the column as a timestamp and then cast it to a date.
Invalid dates in Spark, values that have the correct format, but the numbers produce invalid dates, can result in an exception by default, and how they are parsed can be controlled through a config. The RAPIDS Accelerator does not support any of this and will produce an incorrect date. Typically, one that overflowed.
The CSV parser does not support time zones. It will ignore any trailing time zone information,
despite the format asking for a XXX
or [XXX]
. As such it is off by default and you can enable it
by setting spark.rapids.sql.csvTimestamps.enabled
to
true
.
The formats supported for timestamps are limited similar to dates. The first part of the format
must be a supported date format. The second part must start with a 'T'
to separate the time
portion followed by one of the following formats:
HH:mm:ss.SSSXXX
HH:mm:ss[.SSS][XXX]
HH:mm
HH:mm:ss
HH:mm[:ss]
HH:mm:ss.SSS
HH:mm:ss[.SSS]
Just like with dates all timestamp formats are actually supported at the same time. The plugin will disable itself if it sees a format it cannot support.
Invalid timestamps in Spark, ones that have the correct format, but the numbers produce invalid dates or times, can result in an exception by default and how they are parsed can be controlled through a config. The RAPIDS Accelerator does not support any of this and will produce an incorrect date. Typically, one that overflowed.
The CSV parser is not able to parse NaN
values. These are
likely to be turned into null values, as described in this
issue.
Some floating-point values also appear to overflow but do not for the CPU as described in this issue.
Any number that overflows will not be turned into a null value.
Also parsing of some values will not produce bit for bit identical results to what the CPU does. They are within round-off errors except when they are close enough to overflow to Inf or -Inf which then results in a number being returned when the CPU would have returned null.
Any number that overflows will not be turned into a null value.
The ORC format has fairly complete support for both reads and writes. There are only a few known issues. The first is for reading timestamps and dates around the transition between Julian and Gregorian calendars as described here. A similar issue exists for writing dates as described here. Writing timestamps, however only appears to work for dates after the epoch as described here.
The plugin supports reading uncompressed
, snappy
and zlib
ORC files and writing uncompressed
and snappy
ORC files. At this point, the plugin does not have the ability to fall back to the
CPU when reading an unsupported compression format, and will error out in that case.
The Parquet format has more configs because there are multiple versions with some compatibility
issues between them. Dates and timestamps are where the known issues exist. For reads when
spark.sql.legacy.parquet.datetimeRebaseModeInWrite
is set to CORRECTED
timestamps before the transition between the
Julian and Gregorian calendars are wrong, but dates are fine. When
spark.sql.legacy.parquet.datetimeRebaseModeInWrite
is set to LEGACY
, the read may fail for
values occurring before the transition between the Julian and Gregorian calendars, i.e.: date <= 1582-10-04.
When writing spark.sql.legacy.parquet.datetimeRebaseModeInWrite
is currently ignored as described
here.
When spark.sql.parquet.outputTimestampType
is set to INT96
, the timestamps will overflow and
result in an IllegalArgumentException
thrown, if any value is before
September 21, 1677 12:12:43 AM or it is after April 11, 2262 11:47:17 PM. To get around this
issue, turn off the ParquetWriter acceleration for timestamp columns by either setting
spark.rapids.sql.format.parquet.writer.int96.enabled
to false or
set spark.sql.parquet.outputTimestampType
to TIMESTAMP_MICROS
or TIMESTAMP_MILLIS
to by
-pass the issue entirely.
The plugin supports reading uncompressed
, snappy
and gzip
Parquet files and writing
uncompressed
and snappy
Parquet files. At this point, the plugin does not have the ability to
fall back to the CPU when reading an unsupported compression format, and will error out in that
case.
The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard matches.
If a null char '\0' is in a string that is being matched by a regular expression, LIKE
sees it as
the end of the string. This will be fixed in a future release. The issue is
here.
Spark stores timestamps internally relative to the JVM time zone. Converting an arbitrary timestamp between time zones is not currently supported on the GPU. Therefore operations involving timestamps will only be GPU-accelerated if the time zone used by the JVM is UTC.
Because of ordering differences between the CPU and the GPU window functions especially row based
window functions like row_number
, lead
, and lag
can produce different results if the ordering
includes both -0.0
and 0.0
, or if the ordering is ambiguous. Spark can produce different results
from one run to another if the ordering is ambiguous on a window function too.
When the order-by column of a range based window is numeric type like byte/short/int/long
and
the range boundary calculated for a value has overflow, CPU and GPU will get different results.
For example, consider the following dataset:
+------+---------+
| id | dollars |
+------+---------+
| 1 | NULL |
| 1 | 13 |
| 1 | 14 |
| 1 | 15 |
| 1 | 15 |
| 1 | 17 |
| 1 | 18 |
| 1 | 52 |
| 1 | 53 |
| 1 | 61 |
| 1 | 65 |
| 1 | 72 |
| 1 | 73 |
| 1 | 75 |
| 1 | 78 |
| 1 | 84 |
| 1 | 85 |
| 1 | 86 |
| 1 | 92 |
| 1 | 98 |
+------+---------+
After executing the SQL statement:
SELECT
COUNT(dollars) over
(PARTITION BY id
ORDER BY CAST (dollars AS Byte) ASC
RANGE BETWEEN 127 PRECEDING AND 127 FOLLOWING)
FROM table
The results will differ between the CPU and GPU due to overflow handling.
CPU: WrappedArray([0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0])
GPU: WrappedArray([0], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19])
To enable byte-range windowing on the GPU, set
spark.rapids.sql.window.range.byte.enabled
to true.
We also provide configurations for other integral range types:
spark.rapids.sql.window.range.short.enabled
spark.rapids.sql.window.range.int.enabled
spark.rapids.sql.window.range.long.enabled
The reason why we default the configurations to false for byte/short and to true for int/long is that we think the most real-world queries are based on int or long.
When converting strings to dates or timestamps using functions like to_date
and unix_timestamp
,
the specified format string will fall into one of three categories:
- Supported on GPU and 100% compatible with Spark
- Supported on GPU but may produce different results to Spark
- Unsupported on GPU
The formats which are supported on GPU and 100% compatible with Spark are :
dd/MM/yyyy
yyyy/MM
yyyy/MM/dd
yyyy-MM
yyyy-MM-dd
yyyy-MM-dd HH:mm:ss
MM-dd
MM/dd
dd-MM
dd/MM
Examples of supported formats that may produce different results are:
- Trailing characters (including whitespace) may return a non-null value on GPU and Spark will return null
To attempt to use other formats on the GPU, set
spark.rapids.sql.incompatibleDateFormats.enabled
to true
.
Formats that contain any of the following characters are unsupported and will fall back to CPU:
'k', 'K','z', 'V', 'c', 'F', 'W', 'Q', 'q', 'G', 'A', 'n', 'N',
'O', 'X', 'p', '\'', '[', ']', '#', '{', '}', 'Z', 'w', 'e', 'E', 'x', 'Z', 'Y'
Formats that contain any of the following words are unsupported and will fall back to CPU:
"u", "uu", "uuu", "uuuu", "uuuuu", "uuuuuu", "uuuuuuu", "uuuuuuuu", "uuuuuuuuu", "uuuuuuuuuu",
"y", "yy", yyy", "yyyyy", "yyyyyy", "yyyyyyy", "yyyyyyyy", "yyyyyyyyy", "yyyyyyyyyy",
"D", "DD", "DDD", "s", "m", "H", "h", "M", "MMM", "MMMM", "MMMMM", "L", "LLL", "LLLL", "LLLLL",
"d", "S", "SS", "SSS", "SSSS", "SSSSS", "SSSSSSSSS", "SSSSSSS", "SSSSSSSS"
When formatting dates and timestamps as strings using functions such as from_unixtime
, only a
subset of valid format strings are supported on the GPU.
Formats that contain any of the following characters are unsupported and will fall back to CPU:
'k', 'K','z', 'V', 'c', 'F', 'W', 'Q', 'q', 'G', 'A', 'n', 'N',
'O', 'X', 'p', '\'', '[', ']', '#', '{', '}', 'Z', 'w', 'e', 'E', 'x', 'Z', 'Y'
Formats that contain any of the following words are unsupported and will fall back to CPU:
"u", "uu", "uuu", "uuuu", "uuuuu", "uuuuuu", "uuuuuuu", "uuuuuuuu", "uuuuuuuuu", "uuuuuuuuuu",
"y", yyy", "yyyyy", "yyyyyy", "yyyyyyy", "yyyyyyyy", "yyyyyyyyy", "yyyyyyyyyy",
"D", "DD", "DDD", "s", "m", "H", "h", "M", "MMM", "MMMM", "MMMMM", "L", "LLL", "LLLL", "LLLLL",
"d", "S", "SS", "SSS", "SSSS", "SSSSS", "SSSSSSSSS", "SSSSSSS", "SSSSSSSS"
Note that this list differs very slightly from the list given in the previous section for parsing
strings to dates because the two-digit year format "yy"
is supported when formatting dates as
strings but not when parsing strings to dates.
In general, performing cast
and ansi_cast
operations on the GPU is compatible with the same
operations on the CPU. However, there are some exceptions. For this reason, certain casts are
disabled on the GPU by default and require configuration options to be specified to enable them.
The GPU will use a different strategy from Java's BigDecimal to handle/store decimal values, which leads to restrictions:
- It is only available when
ansiMode
is on. - Float values cannot be larger than
1e18
or smaller than-1e18
after conversion. - The results produced by GPU slightly differ from the default results of Spark.
To enable this operation on the GPU, set
spark.rapids.sql.castFloatToDecimal.enabled
to true
and set spark.sql.ansi.enabled
to true
.
With both cast
and ansi_cast
, Spark uses the expression Math.floor(x) <= MAX && Math.ceil(x) >= MIN
to determine whether a floating-point value can be converted to an integral type. Prior to
Spark 3.1.0 the MIN and MAX values were floating-point values such as Int.MaxValue.toFloat
but
starting with 3.1.0 these are now integral types such as Int.MaxValue
so this has slightly
affected the valid range of values and now differs slightly from the behavior on GPU in some cases.
To enable this operation on the GPU when using Spark 3.1.0 or later, set
spark.rapids.sql.castFloatToIntegralTypes.enabled
to true
.
This configuration setting is ignored when using Spark versions prior to 3.1.0.
The GPU will use different precision than Java's toString method when converting floating-point data types to strings and this can produce results that differ from the default behavior in Spark.
To enable this operation on the GPU, set
spark.rapids.sql.castFloatToString.enabled
to true
.
Casting from string to floating-point types on the GPU returns incorrect results when the string
represents any number in the following ranges. In both cases the GPU returns Double.MaxValue
. The
default behavior in Apache Spark is to return +Infinity
and -Infinity
, respectively.
1.7976931348623158E308 <= x < 1.7976931348623159E308
-1.7976931348623159E308 < x <= -1.7976931348623158E308
Also, the GPU does not support casting from strings containing hex values.
To enable this operation on the GPU, set
spark.rapids.sql.castStringToFloat.enabled
to true
.
The following formats/patterns are supported on the GPU. Timezone of UTC is assumed.
Format or Pattern | Supported on GPU? |
---|---|
"yyyy" |
Yes |
"yyyy-[M]M" |
Yes |
"yyyy-[M]M " |
Yes |
"yyyy-[M]M-[d]d" |
Yes |
"yyyy-[M]M-[d]d " |
Yes |
"yyyy-[M]M-[d]d *" |
Yes |
"yyyy-[M]M-[d]d T*" |
Yes |
"epoch" |
Yes |
"now" |
Yes |
"today" |
Yes |
"tomorrow" |
Yes |
"yesterday" |
Yes |
To allow casts from string to timestamp on the GPU, enable the configuration property
spark.rapids.sql.castStringToTimestamp.enabled
.
Casting from string to timestamp currently has the following limitations.
Format or Pattern | Supported on GPU? |
---|---|
"yyyy" |
Yes |
"yyyy-[M]M" |
Yes |
"yyyy-[M]M " |
Yes |
"yyyy-[M]M-[d]d" |
Yes |
"yyyy-[M]M-[d]d " |
Yes |
"yyyy-[M]M-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"yyyy-[M]M-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"epoch" |
Yes |
"now" |
Yes |
"today" |
Yes |
"tomorrow" |
Yes |
"yesterday" |
Yes |
- [1] The timestamp portion must be complete in terms of hours, minutes, seconds, and milliseconds, with 2 digits each for hours, minutes, and seconds, and 6 digits for milliseconds. Only timezone 'Z' (UTC) is supported. Casting unsupported formats will result in null values.
ConstantFolding is an operator optimization rule in Catalyst that replaces expressions that can
be statically evaluated with their equivalent literal values. The RAPIDS Accelerator relies
on constant folding and parts of the query will not be accelerated if
org.apache.spark.sql.catalyst.optimizer.ConstantFolding
is excluded as a rule.
The 0.5 release introduces the get_json_object
operation. The JSON specification only allows
double quotes around strings in JSON data, whereas Spark allows single quotes around strings in JSON
data. The RAPIDS Spark get_json_object
operation on the GPU will return None
in PySpark or
Null
in Scala when trying to match a string surrounded by single quotes. This behavior will be
updated in a future release to more closely match Spark.