Skip to content

Latest commit

 

History

History
298 lines (287 loc) · 31.1 KB

CHANGELOG.md

File metadata and controls

298 lines (287 loc) · 31.1 KB

Change log

Generated on 2024-06-13

Release 24.06

Features

#10850 [FEA] Refine the test framework introduced in #10745
#6969 [FEA] Support parse_url
#10496 [FEA] Drop support for CentOS7
#10760 [FEA]Support ArrayFilter
#10721 [FEA] Dump the complete set of build-info properties to the Spark eventLog
#10666 [FEA] Create Spark 3.4.3 shim

Performance

#8963 [FEA] Use custom kernel for parse_url
#10817 [FOLLOW ON] Combining regex parsing in transpiling and regex rewrite in rlike
#10821 Rewrite pattern[A-B]{X,Y} (a pattern string followed by X to Y chars in range A - B) in RLIKE to a custom kernel

Bugs Fixed

#10928 [BUG] 24.06 test_conditional_with_side_effects_case_when test failed on Scala 2.13 with DATAGEN_SEED=1716656294
#10941 [BUG] Failed to build on databricks due to GpuOverrides.scala:4264: not found: type GpuSubqueryBroadcastMeta
#10902 Spark UT failed: SPARK-37360: Timestamp type inference for a mix of TIMESTAMP_NTZ and TIMESTAMP_LTZ
#10899 [BUG] format_number Spark UT failed because Type conversion is not allowed
#10913 [BUG] rlike with empty pattern failed with 'NoSuchElementException' when enabling regex rewrite
#10774 [BUG] Issues found by Spark UT Framework on RapidsRegexpExpressionsSuite
#10606 [BUG] Update Plugin to use the new getPartitionedFile method
#10806 [BUG] orc_write_test.py::test_write_round_trip_corner failed with DATAGEN_SEED=1715517863
#10831 [BUG] Failed to read data from iceberg
#10810 [BUG] NPE when running ParseUrl tests in RapidsStringExpressionsSuite
#10797 [BUG] udf_test test_single_aggregate_udf, test_group_aggregate_udf and test_group_apply_udf_more_types failed on DB 13.3
#10719 [BUG] test_exact_percentile_groupby FAILED: hash_aggregate_test.py::test_exact_percentile_groupby with DATAGEN seed 1713362217
#10738 [BUG] test_exact_percentile_groupby_partial_fallback_to_cpu failed with DATAGEN_SEED=1713928179
#10768 [DOC] Dead links with tools pages
#10751 [BUG] Cascaded Pandas UDFs not working as expected on Databricks when plugin is enabled
#10318 [BUG] fs.azure.account.keyInvalid configuration issue while reading from Unity Catalog Tables on Azure DB
#10722 [BUG] "Could not find any rapids-4-spark jars in classpath" error when debugging UT in IDEA
#10724 [BUG] Failed to convert string with invisible characters to float
#10633 [BUG] ScanJson and JsonToStructs can give almost random errors
#10659 [BUG] from_json ArrayIndexOutOfBoundsException in 24.02
#10656 [BUG] Databricks cache tests failing with host memory OOM

PRs

#11052 Add spark343 shim for scala2.13 dist jar
#10981 Update latest changelog [skip ci]
#10984 [DOC] Update docs for 24.06.0 release [skip ci]
#10974 Update rapids JNI and private dependency to 24.06.0
#10947 Prevent contains-PrefixRange optimization if not preceded by wildcards
#10934 Revert "Add Support for Multiple Filtering Keys for Subquery Broadcast "
#10870 Add support for self-contained profiling
#10903 Use upper case for LEGACY_TIME_PARSER_POLICY to fix a spark UT
#10900 Fix type convert error in format_number scalar input
#10868 Disable default cuDF pinned pool
#10914 Fix NoSuchElementException when rlike with empty pattern
#10858 Add Support for Multiple Filtering Keys for Subquery Broadcast
#10861 refine ut framework including Part 1 and Part 2
#10872 [DOC] ignore released plugin links to reduce the bother info [skip ci]
#10839 Replace anonymous classes for SortOrder and FIlterExec overrides
#10873 Auto merge PRs to branch-24.08 from branch-24.06 [skip ci]
#10860 [Spark 4.0] Account for PartitionedFileUtil.getPartitionedFile signature change.
#10822 Rewrite regex pattern literal[a-b]{x} to custom kernel in rlike
#10833 Filter out unused json_path tokens
#10855 Fix auto merge conflict 10845 [[skip ci]]
#10826 Add NVTX ranges to identify Spark stages and tasks
#10846 Update latest changelog [skip ci]
#10836 Catch exceptions when trying to examine Iceberg scan for metadata queries
#10824 Support zstd for GPU shuffle compression
#10828 Added DateTimeUtilsShims [Databricks]
#10829 Fix Inheritance Shadowing to add support for Spark 4.0.0
#10811 Fix NPE in GpuParseUrl for null keys.
#10723 Implement chunked ORC reader
#10715 Rewrite some rlike expression to StartsWith/Contains
#10820 workaround #10801 temporally
#10812 Replace ThreadPoolExecutor creation with ThreadUtils API
#10816 Fix a test error for DB13.3
#10813 Fix the errors for Pandas UDF tests on DB13.3
#10795 Remove fixed seed for exact percentile integration tests
#10805 Drop Support for CentOS 7
#10800 Add number normalization test and address followup for getJsonObject
#10796 fixing build break on DBR
#10791 Fix auto merge conflict 10779 [skip ci]
#10636 Update actions version [skip ci]
#10743 initial PR for the framework reusing Vanilla Spark's unit tests
#10767 Add rows-only batches support to RebatchingRoundoffIterator
#10763 Add in the GpuArrayFilter command
#10766 Fix dead links related to tools documentation [skip ci]
#10644 Add logging to Integration test runs in local and local-cluster mode
#10756 Fix Authorization Failure While Reading Tables From Unity Catalog
#10752 Add SparkRapidsBuildInfoEvent to the event log
#10754 Substitute whoami for $USER
#10755 [DOC] Update README for prioritize-commits script [skip ci]
#10728 Let big data gen set nullability recursively
#10740 Use parse_url kernel for PATH parsing
#10734 Add short circuit path for get-json-object when there is separate wildcard path
#10725 Initial definition for Spark 4.0.0 shim
#10635 Use new getJsonObject kernel for json_tuple
#10739 Use fixed seed for some random failed tests
#10720 Add Shims for Spark 3.4.3
#10716 Remove the mixedType config for JSON as it has no downsides any longer
#10733 Fix "Could not find any rapids-4-spark jars in classpath" error when debugging UT in IDEA
#10718 Change parameters for memory limit in Parquet chunked reader
#10292 Upgrade to UCX 1.16.0
#10709 Removing some authorizations for departed users [skip ci]
#10726 Append new authorized user to blossom-ci whitelist [skip ci]
#10708 Updated dump tool to verify get_json_object
#10706 Fix auto merge conflict 10704 [skip ci]
#10675 Fix merge conflict with branch-24.04 [skip ci]
#10678 Append new authorized user to blossom-ci whitelist [skip ci]
#10662 Audit script - Check commits from shuffle and storage directories [skip ci]
#10655 Update rapids jni/private dependency to 24.06
#10652 Substitute murmurHash32 for spark32BitMurmurHash3

Release 24.04

Features

#10263 [FEA] Add support for reading JSON containing structs where rows are not consistent
#10436 [FEA] Move Spark 3.5.1 out of snapshot once released
#10430 [FEA] Error out when running on an unsupported GPU architecture
#9750 [FEA] Review JsonToStruct and JsonScan and consolidate some testing and implementation
#8680 [AUDIT][SPARK-42779][SQL] Allow V2 writes to indicate advisory shuffle partition size
#10429 [FEA] Drop support for Databricks 10.4 ML LTS
#10334 [FEA] Turn on memory limits for parquet reader
#10344 [FEA] support barrier mode for mapInPandas/mapInArrow

Performance

#10578 [FEA] Support project expression rewrite for the case stringinstr(str_col, substr) > 0 to contains(str_col, substr)
#10570 [FEA] See if we can optimize sort for a single batch
#10531 [FEA] Support "WindowGroupLimit" optimization on GPU for Databricks 13.3 ML LTS+
#5553 [FEA][Audit] - Push down StringEndsWith/Contains to Parquet
#8208 [FEA][AUDIT][SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation
#10249 [FEA] Support common subexpression elimination for expand operator
#10301 [FEA] Improve performance of from_json

Bugs Fixed

#10700 [BUG] get_json_object cannot handle ints or boolean values
#10645 [BUG] java.lang.IllegalStateException: Expected to only receive a single batch
#10665 [BUG] Need to update private jar's version to v24.04.1 for spark-rapids v24.04.0 release
#10589 [BUG] ZSTD version mismatch in integration tests
#10255 [BUG] parquet_tests are skipped on Dataproc CI
#10624 [BUG] Deploy script "gpg:sign-and-deploy-file failed: 401 Unauthorized
#10631 [BUG] pending BlockState leaks blocks if the shuffle read doesn't finish successfully
#10349 [BUG]Test in json_test.py failed: test_from_json_struct_decimal
#9033 [BUG] GpuGetJsonObject does not expand escaped characters
#10216 [BUG] GetJsonObject fails at spark unit test $.store.book[*].reader
#10217 [BUG] GetJsonObject fails at spark unit test $.store.basket[0][*].b
#10537 [BUG] GetJsonObject throws exception when json path contains a name starting with '
#10194 [BUG] GetJsonObject does not validate the input is JSON in the same way as Spark
#10196 [BUG] GetJsonObject does not process escape sequences in returned strings or queries
#10212 [BUG] GetJsonObject should return null for invalid query instead of throwing an exception
#10218 [BUG] GetJsonObject does not normalize non-string output
#10591 [BUG] test_column_add_after_partition failed on EGX Standalone cluster
#10277 Add monitoring for GH action deprecations
#10627 [BUG] Integration tests FAILED on: "nvCOMP 2.3/2.4 or newer is required for Zstandard compression"
#10585 [BUG]Test simple pinned blocking alloc Failed nightly tests
#10586 [BUG] YARN EGX IT build failing parquet_testing_test can't find file
#10133 [BUG] test_hash_reduction_collect_set_on_nested_array_type failed in a distributed environment
#10378 [BUG] test_range_running_window_float_decimal_sum_runs_batched fails intermittently
#10486 [BUG] StructsToJson does not fall back to the CPU for unsupported timeZone options
#10484 [BUG] JsonToStructs does not fallback when columnNameOfCorruptRecord is set
#10460 [BUG] JsonToStructs should reject float numbers for integer types
#10468 [BUG] JsonToStructs and ScanJson should not treat quoted strings as valid integers
#10470 [BUG] ScanJson and JsonToStructs should support parsing quoted decimal strings that are formatted by local (at least for en-US)
#10494 [BUG] JsonToStructs parses INF wrong when nonNumericNumbers is enabled
#10456 [BUG] allowNonNumericNumbers OFF supported for JSON Scan, but not JsonToStructs
#10467 [BUG] JsonToStructs should reject 1. as a valid number
#10469 [BUG] ScanJson should accept "1." as a valid Decimal
#10559 [BUG] test_spark_from_json_date_with_format FAILED on : Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec
#10209 [BUG] Test failure hash_aggregate_test.py::test_hash_reduction_collect_set_on_nested_array_type DATAGEN_SEED=1705515231
#10319 [BUG] Shuffled join OOM with 4GB of GPU memory
#10507 [BUG] regexp_test.py FAILED test_regexp_extract_all_idx_positive[DATAGEN_SEED=1709054829, INJECT_OOM]
#10527 [BUG] Build on Databricks failed with GpuGetJsonObject.scala:19: object parsing is not a member of package util
#10509 [BUG] scalar leaks when running nds query51
#10214 [BUG] GetJsonObject does not support unquoted array like notation
#10215 [BUG] GetJsonObject removes leading space characters
#10213 [BUG] GetJsonObject supports array index notation without a root
#10452 [BUG] JsonScan and from_json share fallback checks, but have hard coded names in the results
#10455 [BUG] JsonToStructs and ScanJson do not fall back/support it properly if single quotes are disabled
#10219 [BUG] GetJsonObject sees a double quote in a single quoted string as invalid
#10431 [BUG] test_casting_from_overflow_double_to_timestamp DID NOT RAISE <class 'Exception'>
#10499 [BUG] Unit tests core dump as below
#9325 [BUG] test_csv_infer_schema_timestamp_ntz fails
#10422 [BUG] test_get_json_object_single_quotes failure
#10411 [BUG] Some fast parquet tests fail if the time zone is not UTC
#10410 [BUG]delta_lake_update_test.py::test_delta_update_partitions[['a', 'b']-False] failed by DATAGEN_SEED=1707683137
#10404 [BUG] GpuJsonTuple memory leak
#10382 [BUG] Complile failed on branch-24.04 : literals.scala:32: object codec is not a member of package org.apache.commons

PRs

#10844 Update rapids private dependency to 24.04.3
#10788 [DOC] Update archive page for v24.04.1 [skip ci]
#10784 Update latest changelog [skip ci]
#10782 Update latest changelog [skip ci]
#10780 [DOC]Update download page for v24.04.1 [skip ci]
#10778 Update version to 24.04.1-SNAPSHOT
#10777 Update rapids JNI dependency: private to 24.04.2
#10683 Update latest changelog [skip ci]
#10681 Update rapids JNI dependency to 24.04.0, private to 24.04.1
#10660 Ensure an executor broadcast is in a single batch
#10676 [DOC] Update docs for 24.04.0 release [skip ci]
#10654 Add a config to switch back to old impl for getJsonObject
#10667 Update rapids private dependency to 24.04.1
#10664 Remove build link from the premerge-CI workflow
#10657 Revert "Host Memory OOM handling for RowToColumnarIterator (#10617)"
#10625 Pin to 3.1.0 maven-gpg-plugin in deploy script [skip ci]
#10637 Cleanup async state when multi-threaded shuffle readers fail
#10617 Host Memory OOM handling for RowToColumnarIterator
#10614 Use random seed for test_from_json_struct_decimal
#10581 Use new jni kernel for getJsonObject
#10630 Fix removal of internal metadata information in 350 shim
#10623 Auto merge PRs to branch-24.06 from branch-24.04 [skip ci]
#10616 Pass metadata extractors to FileScanRDD
#10620 Remove unused shared lib in Jenkins files
#10615 Turn off state logging in HostAllocSuite
#10610 Do not replace TableCacheQueryStageExec
#10599 Call globStatus directly via PY4J in hdfs_glob to avoid calling hadoop command
#10602 Remove InMemoryTableScanExec support for Spark 3.5+
#10608 Update perfio.s3.enabled doc to fix build failure [skip ci]
#10598 Update CI script to build and deploy using the same CUDA classifier[skip ci]
#10575 Update JsonToStructs and ScanJson to have white space normalization
#10597 add guardword to hide cloud info
#10540 Handle minimum GPU architecture supported
#10584 Add in small optimization for instr comparison
#10590 Turn on transition logging in HostAllocSuite
#10572 Improve performance of Sort for the common single batch use case
#10568 Add configuration to share JNI pinned pool with cuIO
#10550 Enable window-group-limit optimization on
#10542 Make JSON parsing common between JsonToStructs and ScanJson
#10562 Fix test_spark_from_json_date_with_format when run in a non-UTC TZ
#10564 Enable specifying specific integration test methods via TESTS environment
#10563 Append new authorized user to blossom-ci safelist [skip ci]
#10520 Distinct left join
#10538 Move K8s cloud name into common lib for Jenkins CI
#10552 Fix issues when no value can be extracted from a regular expression
#10522 Fix missing scala-parser-combinators dependency on Databricks
#10549 Update to latest branch-24.02 [skip ci]
#10544 Fix merge conflict from branch-24.02
#10503 Distinct inner join
#10512 Move to parsing from_json input preserving quoted strings.
#10528 Fix auto merge conflict 10523
#10519 Replicate HostColumnVector.ColumnBuilder in plugin to enable host memory oom work
#10521 Fix Spark 3.5.1 build
#10516 One more metric for expand
#10500 Support "WindowGroupLimit" optimization on GPU
#10508 Move 351 shims into noSnapshot buildvers
#10510 Fix scalar leak in SumBinaryFixer
#10466 Use parser from spark to normalize json path in GetJsonObject
#10490 Start working on a more complete json test matrix json
#10497 Add minValue overflow check in ORC double-to-timestamp cast
#10501 Fix scalar leak in WindowRetrySuite
#10474 Remove Support for Databricks 10.4
#10418 Enable GpuShuffledSymmetricHashJoin by default
#10450 Improve internal row to columnar host memory by using a combined spillable buffer
#10440 Generate CSV data per Spark version for tools
#10449 [DOC] Fix table rendering issue in github.io download UI page [skip ci]
#10438 Integrate perfio.s3 reader
#10423 Disable Integration Test:test_get_json_object_single_quotes on DB 10.4
#10419 Export TZ in tests when default TZ is used
#10426 Fix auto merge conflict 10425 [skip ci]
#10427 Update test doc for 24.04 [skip ci]
#10396 Remove inactive user from github workflow [skip ci]
#10421 Use withRetry when manifesting spillable batch in GpuShuffledHashJoinExec
#10420 Disable JsonTuple by default
#10407 Enable Single Quote Support in getJSONObject API with GetJsonObjectOptions
#10415 Avoid comparing Delta logs when writing partitioned tables
#10247 Improve GpuExpand by pre-projecting some columns
#10248 Group-by aggregation based optimization for UNBOUNDED collect_set window function
#10406 Enabled subPage chunking by default
#10361 Add in basic support for JSON generation in BigDataGen and improve performance of from_json
#10158 Add in framework for unbounded to unbounded window agg optimization
#10394 Fix auto merge conflict 10393 [skip ci]
#10375 Support barrier mode for mapInPandas/mapInArrow
#10356 Update locate_parquet_testing_files function to support hdfs input path for dataproc CI
#10369 Revert "Support barrier mode for mapInPandas/mapInArrow (#10364)"
#10358 Disable Spark UI by default for integration tests
#10360 Fix a memory leak in json tuple
#10364 Support barrier mode for mapInPandas/mapInArrow
#10348 Remove redundant joinOutputRows metric
#10321 Bump up dependency version to 24.04.0-SNAPSHOT
#10330 Add tryAcquire to GpuSemaphore
#10258 Init project version 24.04.0-SNAPSHOT

Older Releases

Changelog of older releases can be found at docs/archives