diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md
index f577daaf10f..3644b09951e 100644
--- a/docs/additional-functionality/advanced_configs.md
+++ b/docs/additional-functionality/advanced_configs.md
@@ -245,7 +245,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
 <a name="sql.expression.FromUnixTime"></a>spark.rapids.sql.expression.FromUnixTime|`from_unixtime`|Get the string from a unix timestamp|true|None|
 <a name="sql.expression.GetArrayItem"></a>spark.rapids.sql.expression.GetArrayItem| |Gets the field at `ordinal` in the Array|true|None|
 <a name="sql.expression.GetArrayStructFields"></a>spark.rapids.sql.expression.GetArrayStructFields| |Extracts the `ordinal`-th fields of all array elements for the data with the type of array of struct|true|None|
-<a name="sql.expression.GetJsonObject"></a>spark.rapids.sql.expression.GetJsonObject|`get_json_object`|Extracts a json object from path|true|None|
+<a name="sql.expression.GetJsonObject"></a>spark.rapids.sql.expression.GetJsonObject|`get_json_object`|Extracts a json object from path|false|This is disabled by default because escape sequences are not processed correctly, the input is not validated, and the output is not normalized the same as Spark|
 <a name="sql.expression.GetMapValue"></a>spark.rapids.sql.expression.GetMapValue| |Gets Value from a Map based on a key|true|None|
 <a name="sql.expression.GetStructField"></a>spark.rapids.sql.expression.GetStructField| |Gets the named field of the struct|true|None|
 <a name="sql.expression.GetTimestamp"></a>spark.rapids.sql.expression.GetTimestamp| |Gets timestamps from strings using given pattern.|true|None|
diff --git a/docs/compatibility.md b/docs/compatibility.md
index 8060866dc3b..2644c873e98 100644
--- a/docs/compatibility.md
+++ b/docs/compatibility.md
@@ -441,6 +441,44 @@ parse some variants of `NaN` and `Infinity` even when this option is disabled
 ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
 Spark version 3.3.0 and later.
 
+### get_json_object
+
+The `GetJsonObject` operator takes a JSON formatted string and a JSON path string as input. The
+code base for this is currently separate from GPU parsing of JSON for files and `FromJsonObject`.
+Because of this the results can be different from each other. Because of several incompatibilities
+and bugs in the GPU version of `GetJsonObject` it will be on the CPU by default. If you are
+aware of the current limitations with the GPU version, you might see a significant performance
+speedup if you enable it by setting `spark.rapids.sql.expression.GetJsonObject` to `true`.
+
+The following is a list of known differences.
+  * [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
+    is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
+  * [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
+    When returning a result for a quoted string Apache Spark will remove the quotes and replace
+    any escape sequences with the proper characters. The escape sequence processing does not happen
+    on the GPU.
+  * [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
+    If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
+    and fail the query.
+  * [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
+    When returning a result for things other than strings, a number of things are normalized by
+    Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
+    parsing and then serializing floating point numbers, turning single quotes to double quotes,
+    and removing unneeded escapes for single quotes.
+
+The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
+   * https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings
+   * https://github.com/NVIDIA/spark-rapids/issues/10213 array index notation works without root
+   * https://github.com/NVIDIA/spark-rapids/issues/10214 unquoted array index notation is not
+     supported
+   * https://github.com/NVIDIA/spark-rapids/issues/10215 leading spaces can be stripped from named
+     keys.
+   * https://github.com/NVIDIA/spark-rapids/issues/10216 It appears that Spark is flattening some
+     output, which is different from other implementations including the GPU version.
+   * https://github.com/NVIDIA/spark-rapids/issues/10217 a JSON path execution bug
+   * https://issues.apache.org/jira/browse/SPARK-46761 Apache Spark does not allow the `?` character in
+     a quoted JSON path string.
+
 ## Avro
 
 The Avro format read is a very experimental feature which is expected to have some issues, so we disable
diff --git a/docs/supported_ops.md b/docs/supported_ops.md
index 5af0f356627..c23349467b9 100644
--- a/docs/supported_ops.md
+++ b/docs/supported_ops.md
@@ -6856,7 +6856,7 @@ are limited.
 <td rowSpan="3">GetJsonObject</td>
 <td rowSpan="3">`get_json_object`</td>
 <td rowSpan="3">Extracts a json object from path</td>
-<td rowSpan="3">None</td>
+<td rowSpan="3">This is disabled by default because escape sequences are not processed correctly, the input is not validated, and the output is not normalized the same as Spark</td>
 <td rowSpan="3">project</td>
 <td>json</td>
 <td> </td>
diff --git a/integration_tests/pom.xml b/integration_tests/pom.xml
index 21432f5161b..5f4c41a1b9b 100644
--- a/integration_tests/pom.xml
+++ b/integration_tests/pom.xml
@@ -1,6 +1,6 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <!--
-  Copyright (c) 2020-2023, NVIDIA CORPORATION.
+  Copyright (c) 2020-2024, NVIDIA CORPORATION.
 
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
@@ -80,7 +80,9 @@
         <plugins>
             <plugin>
                 <artifactId>maven-assembly-plugin</artifactId>
+                <version>3.6.0</version>
                 <configuration>
+                    <tarLongFileMode>posix</tarLongFileMode>
                     <finalName>rapids-4-spark-integration-tests_${scala.binary.version}-${project.version}-${spark.version.classifier}</finalName>
                     <descriptorRefs>
                         <descriptorRef>jar-with-dependencies</descriptorRef>
diff --git a/integration_tests/run_pyspark_from_build.sh b/integration_tests/run_pyspark_from_build.sh
index f6e32c72161..cc983d49b3c 100755
--- a/integration_tests/run_pyspark_from_build.sh
+++ b/integration_tests/run_pyspark_from_build.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020-2023, NVIDIA CORPORATION.
+# Copyright (c) 2020-2024, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -333,10 +333,15 @@ EOF
                 --driver-class-path "${PYSP_TEST_spark_driver_extraClassPath}"
                 --conf spark.executor.extraClassPath="${PYSP_TEST_spark_driver_extraClassPath}"
             )
+        elif [[ -n "$PYSP_TEST_spark_jars_packages" ]]; then
+            SPARK_SHELL_ARGS_ARR+=(--packages "${PYSP_TEST_spark_jars_packages}")
         else
             SPARK_SHELL_ARGS_ARR+=(--jars "${PYSP_TEST_spark_jars}")
         fi
 
+        if [[ -n "$PYSP_TEST_spark_jars_repositories" ]]; then
+            SPARK_SHELL_ARGS_ARR+=(--repositories "${PYSP_TEST_spark_jars_repositories}")
+        fi
         # NOTE grep is used not only for checking the output but also
         # to workaround the fact that spark-shell catches all failures.
         # In this test it exits not because of the failure but because it encounters
diff --git a/integration_tests/src/main/python/aqe_test.py b/integration_tests/src/main/python/aqe_test.py
index 06759954631..b7968f8e902 100755
--- a/integration_tests/src/main/python/aqe_test.py
+++ b/integration_tests/src/main/python/aqe_test.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022-2023, NVIDIA CORPORATION.
+# Copyright (c) 2022-2024, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -16,7 +16,7 @@
 from pyspark.sql.functions import when, col, current_date, current_timestamp
 from pyspark.sql.types import *
 from asserts import assert_gpu_and_cpu_are_equal_collect, assert_cpu_and_gpu_are_equal_collect_with_capture
-from conftest import is_not_utc
+from conftest import is_databricks_runtime, is_not_utc
 from data_gen import *
 from marks import ignore_order, allow_non_gpu
 from spark_session import with_cpu_session, is_databricks113_or_later
@@ -243,3 +243,58 @@ def do_it(spark):
 
     assert_gpu_and_cpu_are_equal_collect(do_it, conf=_adaptive_conf)
 
+
+# this is specifically to reproduce the issue found in
+# https://github.com/NVIDIA/spark-rapids/issues/10165 where it has an executor broadcast
+# but the exchange going into the BroadcastHashJoin is an exchange with multiple partitions
+# and goes into AQEShuffleRead that uses CoalescePartitions to go down to a single partition
+db_133_cpu_bnlj_join_allow=["ShuffleExchangeExec"] if is_databricks113_or_later() else []
+@ignore_order(local=True)
+@pytest.mark.skipif(not (is_databricks_runtime()), \
+    reason="Executor side broadcast only supported on Databricks")
+@allow_non_gpu('BroadcastHashJoinExec', 'ColumnarToRowExec', *db_113_cpu_bnlj_join_allow)
+def test_aqe_join_executor_broadcast_not_single_partition(spark_tmp_path):
+    data_path = spark_tmp_path + '/PARQUET_DATA'
+    bhj_disable_conf = copy_and_update(_adaptive_conf,
+        { "spark.rapids.sql.exec.BroadcastHashJoinExec": "false"}) 
+
+    def prep(spark):
+        data = [
+            (("Adam ", "", "Green"), "1", "M", 1000),
+            (("Bob ", "Middle", "Green"), "2", "M", 2000),
+            (("Cathy ", "", "Green"), "3", "F", 3000)
+        ]
+        schema = (StructType()
+                  .add("name", StructType()
+                       .add("firstname", StringType())
+                       .add("middlename", StringType())
+                       .add("lastname", StringType()))
+                  .add("id", StringType())
+                  .add("gender", StringType())
+                  .add("salary", IntegerType()))
+        df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
+        df.write.format("parquet").mode("overwrite").save(data_path)
+        data_school= [
+            ("1", "school1"),
+            ("2", "school1"),
+            ("3", "school2")
+        ]
+        schema_school = (StructType()
+                  .add("id", StringType())
+                  .add("school", StringType()))
+        df_school = spark.createDataFrame(spark.sparkContext.parallelize(data_school),schema_school)
+        df_school.createOrReplaceTempView("df_school")
+
+    with_cpu_session(prep)
+
+    def do_it(spark):
+        newdf = spark.read.parquet(data_path)
+        newdf.createOrReplaceTempView("df")
+        return spark.sql(
+            """
+                select /*+ BROADCAST(df_school) */ * from df a left outer join df_school b on a.id == b.id
+            """
+        )
+
+    assert_gpu_and_cpu_are_equal_collect(do_it, conf=bhj_disable_conf)
+
diff --git a/integration_tests/src/main/python/date_time_test.py b/integration_tests/src/main/python/date_time_test.py
index 99651750f3e..9e2e98006ab 100644
--- a/integration_tests/src/main/python/date_time_test.py
+++ b/integration_tests/src/main/python/date_time_test.py
@@ -291,7 +291,6 @@ def test_unsupported_fallback_to_unix_timestamp(data_gen):
 unsupported_timezones = ["PST", "NST", "AST", "America/Los_Angeles", "America/New_York", "America/Chicago"]
 
 @pytest.mark.parametrize('time_zone', supported_timezones, ids=idfn)
-@allow_non_gpu(*non_utc_allow)
 def test_from_utc_timestamp(time_zone):
     assert_gpu_and_cpu_are_equal_collect(
         lambda spark: unary_op_df(spark, timestamp_gen).select(f.from_utc_timestamp(f.col('a'), time_zone)))
@@ -311,7 +310,6 @@ def test_unsupported_fallback_from_utc_timestamp():
             "from_utc_timestamp(a, tzone)"),
         'FromUTCTimestamp')
 
-@allow_non_gpu(*non_utc_allow)
 @pytest.mark.parametrize('time_zone', supported_timezones, ids=idfn)
 def test_to_utc_timestamp(time_zone):
     assert_gpu_and_cpu_are_equal_collect(
@@ -413,7 +411,7 @@ def invalid_date_string_df(spark):
 
 @pytest.mark.parametrize('ansi_enabled', [True, False], ids=['ANSI_ON', 'ANSI_OFF'])
 @pytest.mark.parametrize('data_gen,date_form', str_date_and_format_gen, ids=idfn)
-@allow_non_gpu(*non_utc_tz_allow)
+@allow_non_gpu(*non_supported_tz_allow)
 def test_string_to_unix_timestamp(data_gen, date_form, ansi_enabled):
     assert_gpu_and_cpu_are_equal_collect(
         lambda spark : unary_op_df(spark, data_gen, seed=1).selectExpr("to_unix_timestamp(a, '{}')".format(date_form)),
@@ -427,7 +425,7 @@ def test_string_to_unix_timestamp_ansi_exception():
 
 @pytest.mark.parametrize('ansi_enabled', [True, False], ids=['ANSI_ON', 'ANSI_OFF'])
 @pytest.mark.parametrize('data_gen,date_form', str_date_and_format_gen, ids=idfn)
-@allow_non_gpu(*non_utc_tz_allow)
+@allow_non_gpu(*non_supported_tz_allow)
 def test_string_unix_timestamp(data_gen, date_form, ansi_enabled):
     assert_gpu_and_cpu_are_equal_collect(
         lambda spark : unary_op_df(spark, data_gen, seed=1).select(f.unix_timestamp(f.col('a'), date_form)),
diff --git a/integration_tests/src/main/python/dpp_test.py b/integration_tests/src/main/python/dpp_test.py
index 4e967262c14..6554e54f965 100644
--- a/integration_tests/src/main/python/dpp_test.py
+++ b/integration_tests/src/main/python/dpp_test.py
@@ -42,7 +42,7 @@ def fn(spark):
         df.write.format(table_format) \
             .mode("overwrite") \
             .saveAsTable(table_name)
-        return df.select('filter').first()[0]
+        return df.select('filter').where("value > 0").first()[0]
 
     return with_cpu_session(fn)
 
diff --git a/integration_tests/src/main/python/get_json_test.py b/integration_tests/src/main/python/get_json_test.py
index a6c0e00db0b..970617709a8 100644
--- a/integration_tests/src/main/python/get_json_test.py
+++ b/integration_tests/src/main/python/get_json_test.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2021-2023, NVIDIA CORPORATION.
+# Copyright (c) 2021-2024, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -36,7 +36,234 @@ def test_get_json_object(json_str_pattern):
             'get_json_object(a, "$.store.fruit[0]")',
             'get_json_object(\'%s\', "$.store.fruit[0]")' % scalar_json,
             ),
-        conf={'spark.sql.parser.escapedStringLiterals': 'true'})
+        conf={'spark.sql.parser.escapedStringLiterals': 'true',
+            'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+def test_get_json_object_quoted_index():
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [[r'{"a":"A"}'],
+            [r'{"b":"B"}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+        f.get_json_object('jsonStr',r'''$['a']''').alias('sub_a'),
+        f.get_json_object('jsonStr',r'''$['b']''').alias('sub_b')),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.parametrize('query',["$.store.bicycle",
+    "$['store'].bicycle",
+    "$.store['bicycle']",
+    "$['store']['bicycle']",
+    "$['key with spaces']",
+    "$.store.book",
+    "$.store.book[0]",
+    "$.store.book[*]",
+    pytest.param("$",marks=[
+        pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10218'),
+        pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10196'),
+        pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10194')]),
+    "$.store.book[0].category",
+    "$.store.book[*].category",
+    "$.store.book[*].isbn",
+    pytest.param("$.store.book[*].reader",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10216')),
+    "$.store.basket[0][1]",
+    "$.store.basket[*]",
+    "$.store.basket[*][0]",
+    "$.store.basket[0][*]",
+    "$.store.basket[*][*]",
+    "$.store.basket[0][2].b",
+    pytest.param("$.store.basket[0][*].b",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10217')),
+    "$.zip code",
+    "$.fb:testid",
+    pytest.param("$.a",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10196')),
+    "$.non_exist_key",
+    pytest.param("$..no_recursive", marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10212')),
+    "$.store.book[0].non_exist_key",
+    "$.store.basket[*].non_exist_key"])
+def test_get_json_object_spark_unit_tests(query):
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [
+            ['''{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}'''],
+            ['''{ "key with spaces": "it works" }'''],
+            ['''{"a":"b\nc"}'''],
+            ['''{"a":"b\"c"}'''],
+            ["\u0000\u0000\u0000A\u0001AAA"],
+            ['{"big": "' + ('x' * 3000) + '"}']]
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.get_json_object('jsonStr', query)),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")
+def test_get_json_object_normalize_non_string_output():
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [[' { "a": "A" } '],
+            ['''{'a':'A"'}'''],
+            [r'''{'a':"B\'"}'''],
+            ['''['a','b','"C"']'''],
+            ['[100.0,200.000,351.980]'],
+            ['[12345678900000000000.0]'],
+            ['[12345678900000000000]'],
+            ['[1' + '0'* 400 + ']'],
+            ['[1E308]'],
+            ['[1.0E309,-1E309,1E5000]'],
+            ['[true,false]'],
+            ['[100,null,10]'],
+            ['{"a":"A","b":null}']]
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.col('jsonStr'),
+            f.get_json_object('jsonStr', '$')),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://issues.apache.org/jira/browse/SPARK-46761")
+def test_get_json_object_quoted_question():
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [[r'{"?":"QUESTION"}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.get_json_object('jsonStr',r'''$['?']''').alias('question')),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10196")
+def test_get_json_object_escaped_string_data():
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [[r'{"a":"A\"B"}'],
+            [r'''{"a":"A\'B"}'''],
+            [r'{"a":"A\/B"}'],
+            [r'{"a":"A\\B"}'],
+            [r'{"a":"A\bB"}'],
+            [r'{"a":"A\fB"}'],
+            [r'{"a":"A\nB"}'],
+            [r'{"a":"A\tB"}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).selectExpr('get_json_object(jsonStr,"$.a")'),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10196")
+def test_get_json_object_escaped_key():
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [
+            [r'{"a\"":"Aq"}'],
+            [r'''{"\'a":"sqA1"}'''],
+            [r'''{"'a":"sqA2"}'''],
+            [r'{"a\/":"Afs"}'],
+            [r'{"a\\":"Abs"}'],
+            [r'{"a\b":"Ab1"}'],
+            ['{"a\b":"Ab2"}'],
+            [r'{"a\f":"Af1"}'],
+            ['{"a\f":"Af2"}'],
+            [r'{"a\n":"An1"}'],
+            ['{"a\n":"An2"}'],
+            [r'{"a\t":"At1"}'],
+            ['{"a\t":"At2"}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.col('jsonStr'),
+            f.get_json_object('jsonStr', r'$.a\"').alias('qaq1'),
+            f.get_json_object('jsonStr', '$.a"').alias('qaq2'),
+            f.get_json_object('jsonStr', r'''$.\'a''').alias('qsqa1'),
+            f.get_json_object('jsonStr', r'$.a\/').alias('qafs1'),
+            f.get_json_object('jsonStr', '$.a/').alias('qafs2'), 
+            f.get_json_object('jsonStr', r'''$['a\/']''').alias('qafs3'), 
+            f.get_json_object('jsonStr', r'$.a\\').alias('qabs1'),
+            f.get_json_object('jsonStr', r'$.a\b').alias('qab1'),
+            f.get_json_object('jsonStr','$.a\b').alias('qab2'),
+            f.get_json_object('jsonStr', r'$.a\f').alias('qaf1'),
+            f.get_json_object('jsonStr','$.a\f').alias('qaf2'),
+            f.get_json_object('jsonStr', r'$.a\n').alias('qan1'),
+            f.get_json_object('jsonStr','$.a\n').alias('qan2'),
+            f.get_json_object('jsonStr', r'$.a\t').alias('qat1'),
+            f.get_json_object('jsonStr','$.a\t').alias('qat2')
+            ),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10212")
+def test_get_json_object_invalid_path():
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [['{"a":"A"}'],
+            [r'{"a\"":"A"}'],
+            [r'''{"'a":"A"}'''],
+            ['{"b":"B"}'],
+            ['["A","B"]'],
+            ['{"c":["A","B"]}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.col('jsonStr'),
+            f.get_json_object('jsonStr', '''$ ['a']''').alias('with_space'),
+            f.get_json_object('jsonStr', r'''$['\'a']''').alias('qsqa2'),
+            f.get_json_object('jsonStr', '''$.'a''').alias('qsqa2'),
+            f.get_json_object('jsonStr', r'''$.['a\"']''').alias('qaq3'),
+            f.get_json_object('jsonStr', '''$['a]''').alias('qsqa2'), # jsonpath.com thinks it is fine and ignores uncompleted ' and ], but not Spark
+            f.get_json_object('jsonStr', 'a').alias('just_a'),
+            f.get_json_object('jsonStr', '[-1]').alias('neg_one_index'),
+            f.get_json_object('jsonStr', '$.c[-1]').alias('c_neg_one_index'),
+            ),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10213")
+def test_get_json_object_top_level_array_notation():
+    # This is a special version of invalid path. It is something that the GPU supports
+    # but the CPU thinks is invalid
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [['["A","B"]'],
+            ['{"a":"A","b":"B"}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.col('jsonStr'),
+            f.get_json_object('jsonStr', '[0]').alias('zero_index'),
+            f.get_json_object('jsonStr', '$[1]').alias('one_index'),
+            f.get_json_object('jsonStr', '''['a']''').alias('sub_a'),
+            f.get_json_object('jsonStr', '''$['b']''').alias('sub_b'),
+            ),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10214")
+def test_get_json_object_unquoted_array_notation():
+    # This is a special version of invalid path. It is something that the GPU supports
+    # but the CPU thinks is invalid
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [['{"a":"A","b":"B"}'],
+            ['{"1":"ONE","a1":"A_ONE"}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.col('jsonStr'),
+            f.get_json_object('jsonStr', '$[a]').alias('a_index'),
+            f.get_json_object('jsonStr', '$[1]').alias('one_index'),
+            f.get_json_object('jsonStr', '''$['1']''').alias('quoted_one_index'),
+            f.get_json_object('jsonStr', '$[a1]').alias('a_one_index')),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
+
+
+@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10215")
+def test_get_json_object_white_space_removal():
+    # This is a special version of invalid path. It is something that the GPU supports
+    # but the CPU thinks is invalid
+    schema = StructType([StructField("jsonStr", StringType())])
+    data = [['{" a":" A"," b":" B"}'],
+            ['{"a":"A","b":"B"}'],
+            ['{"a ":"A ","b ":"B "}'],
+            ['{" a ":" A "," b ":" B "}']]
+
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.createDataFrame(data,schema=schema).select(
+            f.col('jsonStr'),
+            f.get_json_object('jsonStr', '$.a').alias('dot_a'),
+            f.get_json_object('jsonStr', '$. a').alias('dot_space_a'),
+            f.get_json_object('jsonStr', '$.a ').alias('dot_a_space'),
+            f.get_json_object('jsonStr', '$. a ').alias('dot_space_a_space'),
+            f.get_json_object('jsonStr', "$['b']").alias('dot_b'),
+            f.get_json_object('jsonStr', "$[' b']").alias('dot_space_b'),
+            f.get_json_object('jsonStr', "$['b ']").alias('dot_b_space'),
+            f.get_json_object('jsonStr', "$[' b ']").alias('dot_space_b_space'),
+            ),
+        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})
 
 
 @allow_non_gpu('ProjectExec')
@@ -52,7 +279,8 @@ def assert_gpu_did_fallback(sql_text):
         assert_gpu_fallback_collect(lambda spark:
             gen_df(spark, [('a', gen), ('b', pattern)], length=10).selectExpr(sql_text),
         'GetJsonObject',
-        conf={'spark.sql.parser.escapedStringLiterals': 'true'})
+        conf={'spark.sql.parser.escapedStringLiterals': 'true',
+            'spark.rapids.sql.expression.GetJsonObject': 'true'})
 
     assert_gpu_did_fallback('get_json_object(a, b)')
     assert_gpu_did_fallback('get_json_object(\'%s\', b)' % scalar_json)
diff --git a/integration_tests/src/main/python/join_test.py b/integration_tests/src/main/python/join_test.py
index 6660e663c92..74d277fb905 100644
--- a/integration_tests/src/main/python/join_test.py
+++ b/integration_tests/src/main/python/join_test.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020-2023, NVIDIA CORPORATION.
+# Copyright (c) 2020-2024, NVIDIA CORPORATION.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -1151,3 +1151,67 @@ def do_join(spark):
         conf={"spark.sql.autoBroadcastJoinThreshold": "-1",
               "spark.sql.sources.useV1SourceList": "",
               "spark.rapids.sql.input." + scan_name: False})
+
+@ignore_order(local=True)
+@pytest.mark.parametrize("is_left_host_shuffle", [False, True], ids=idfn)
+@pytest.mark.parametrize("is_right_host_shuffle", [False, True], ids=idfn)
+@pytest.mark.parametrize("is_left_smaller", [False, True], ids=idfn)
+@pytest.mark.parametrize("batch_size", ["1024", "1g"], ids=idfn)
+def test_new_inner_join(is_left_host_shuffle, is_right_host_shuffle, is_left_smaller, batch_size):
+    join_conf = {
+        "spark.rapids.sql.join.useShuffledSymmetricHashJoin": "true",
+        "spark.sql.autoBroadcastJoinThreshold": "1",
+        "spark.rapids.sql.batchSizeBytes": batch_size
+    }
+    left_size, right_size = (2048, 1024) if is_left_smaller else (1024, 2048)
+    def do_join(spark):
+        left_df = gen_df(spark, [
+            ("key1", RepeatSeqGen([1, 2, 3, 4, None], data_type=IntegerType())),
+            ("ints", int_gen),
+            ("key2", RepeatSeqGen([5, 6, 7, None], data_type=LongType())),
+            ("floats", float_gen)], left_size)
+        right_df = gen_df(spark, [
+            ("doubles", double_gen),
+            ("key2", RepeatSeqGen([5, 7, None, 8], data_type=LongType())),
+            ("shorts", short_gen),
+            ("key1", RepeatSeqGen([1, 2, 3, 5, 7, None], data_type=IntegerType()))], right_size)
+        # The symmetric join code handles inputs differently based on whether they are coming from
+        # host memory or GPU memory. Simple joins produce inputs directly from a shuffle which
+        # covers the host memory case. For GPU memory cases, we insert an aggregation to force the
+        # respective join input to be from a prior GPU operation in the same stage.
+        if not is_left_host_shuffle:
+            left_df = left_df.groupBy("key1", "key2").max("ints", "floats")
+        if not is_right_host_shuffle:
+            right_df = right_df.groupBy("key1", "key2").max("doubles", "shorts")
+        return left_df.join(right_df, ["key1", "key2"], "inner")
+    assert_gpu_and_cpu_are_equal_collect(do_join, conf=join_conf)
+
+@ignore_order(local=True)
+@pytest.mark.parametrize("is_left_smaller", [False, True], ids=idfn)
+@pytest.mark.parametrize("is_ast_supported", [False, True], ids=idfn)
+@pytest.mark.parametrize("batch_size", ["1024", "1g"], ids=idfn)
+def test_new_inner_join_conditional(is_ast_supported, is_left_smaller, batch_size):
+    join_conf = {
+        "spark.rapids.sql.join.useShuffledSymmetricHashJoin": "true",
+        "spark.sql.autoBroadcastJoinThreshold": "1",
+        "spark.rapids.sql.batchSizeBytes": batch_size
+    }
+    left_size, right_size = (2048, 1024) if is_left_smaller else (1024, 2048)
+    def do_join(spark):
+        left_df = gen_df(spark, [
+            ("key1", RepeatSeqGen([1, 2, 3, 4, None], data_type=IntegerType())),
+            ("ints", RepeatSeqGen(IntegerGen(), length = 5)),
+            ("key2", RepeatSeqGen([5, 6, 7, None], data_type=LongType())),
+            ("floats", float_gen)], left_size)
+        right_df = gen_df(spark, [
+            ("key2", RepeatSeqGen([5, 7, None, 8], data_type=LongType())),
+            ("ints", RepeatSeqGen(IntegerGen(), length = 3)),
+            ("key1", RepeatSeqGen([1, 2, 3, 5, 7, None], data_type=IntegerType()))], right_size)
+        cond = [left_df.key1 == right_df.key1, left_df.key2 == right_df.key2]
+        if is_ast_supported:
+            cond.append(left_df.ints >= right_df.ints)
+        else:
+            # AST does not support logarithm yet
+            cond.append(left_df.ints >= f.log(right_df.ints))
+        return left_df.join(right_df, cond, "inner")
+    assert_gpu_and_cpu_are_equal_collect(do_join, conf=join_conf)
diff --git a/integration_tests/src/main/python/window_function_test.py b/integration_tests/src/main/python/window_function_test.py
index 371b31ab316..78393c51609 100644
--- a/integration_tests/src/main/python/window_function_test.py
+++ b/integration_tests/src/main/python/window_function_test.py
@@ -27,22 +27,22 @@
 _grpkey_longs_with_no_nulls = [
     ('a', RepeatSeqGen(LongGen(nullable=False), length=20)),
     ('b', IntegerGen()),
-    ('c', IntegerGen())]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_nulls = [
     ('a', RepeatSeqGen(LongGen(nullable=(True, 10.0)), length=20)),
     ('b', IntegerGen()),
-    ('c', IntegerGen())]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_dates = [
     ('a', RepeatSeqGen(LongGen(), length=2048)),
     ('b', DateGen(nullable=False, start=date(year=2020, month=1, day=1), end=date(year=2020, month=12, day=31))),
-    ('c', IntegerGen())]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_nullable_dates = [
     ('a', RepeatSeqGen(LongGen(nullable=False), length=20)),
     ('b', DateGen(nullable=(True, 5.0), start=date(year=2020, month=1, day=1), end=date(year=2020, month=12, day=31))),
-    ('c', IntegerGen())]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_timestamps = [
     ('a', RepeatSeqGen(LongGen(), length=2048)),
@@ -57,17 +57,17 @@
 _grpkey_longs_with_decimals = [
     ('a', RepeatSeqGen(LongGen(nullable=False), length=20)),
     ('b', DecimalGen(precision=18, scale=3, nullable=False)),
-    ('c', DecimalGen(precision=18, scale=3))]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_nullable_decimals = [
     ('a', RepeatSeqGen(LongGen(nullable=(True, 10.0)), length=20)),
     ('b', DecimalGen(precision=18, scale=10, nullable=True)),
-    ('c', DecimalGen(precision=18, scale=10, nullable=True))]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_nullable_larger_decimals = [
     ('a', RepeatSeqGen(LongGen(nullable=(True, 10.0)), length=20)),
     ('b', DecimalGen(precision=23, scale=10, nullable=True)),
-    ('c', DecimalGen(precision=23, scale=10, nullable=True))]
+    ('c', UniqueLongGen())]
 
 _grpkey_longs_with_nullable_largest_decimals = [
     ('a', RepeatSeqGen(LongGen(nullable=(True, 10.0)), length=20)),
diff --git a/jenkins/databricks/build.sh b/jenkins/databricks/build.sh
index 29be3bc43c0..79cc4586a16 100755
--- a/jenkins/databricks/build.sh
+++ b/jenkins/databricks/build.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2020-2024, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -52,7 +52,13 @@ declare -A artifacts
 initialize()
 {
     # install rsync to be used for copying onto the databricks nodes
-    sudo apt install -y maven rsync
+    sudo apt install -y rsync
+
+    if [[ ! -d $HOME/apache-maven-3.6.3 ]]; then
+        wget https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz -P /tmp
+        tar xf /tmp/apache-maven-3.6.3-bin.tar.gz -C $HOME
+        sudo ln -s $HOME/apache-maven-3.6.3/bin/mvn /usr/local/bin/mvn
+    fi
 
     # Archive file location of the plugin repository
     SPARKSRCTGZ=${SPARKSRCTGZ:-''}
diff --git a/jenkins/dependency-check.sh b/jenkins/dependency-check.sh
new file mode 100755
index 00000000000..4239c40c664
--- /dev/null
+++ b/jenkins/dependency-check.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+#
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# This file checks whether all the dependency jar or pom files for the specified
+# artifacts defined in the file "$ARTIFACT_FILE" are available
+# in the "$SERVER_ID::default::$SERVER_URL" maven repo
+
+
+# Argument(s):
+#   ARTIFACT_FILE :  Artifact(groupId:artifactId:version:[[packaging]:classifier]) list file
+#
+# Used environment(s):
+#   SERVER_ID:      The repository id for this deployment.
+#   SERVER_URL:     The url where to deploy artifacts.
+#   M2_CACHE:       Maven local repo
+###
+
+set -ex
+
+ARTIFACT_FILE=${1:-"/tmp/artifacts-list"}
+SERVER_ID=${SERVER_ID:-"snapshots"}
+SERVER_URL=${SERVER_URL:-"file:/tmp/local-release-repo"}
+M2_CACHE=${M2_CACHE:-"/tmp/m2-cache"}
+
+remote_maven_repo=$SERVER_ID::default::$SERVER_URL
+# Get the spark-rapids-jni and spark-rapids-private jars from OSS Snapshot maven repo
+if [ "$SERVER_ID" == "snapshots" ]; then
+    oss_snapshot_url="https://oss.sonatype.org/content/repositories/snapshots"
+    remote_maven_repo="$remote_maven_repo,$SERVER_ID::default::$oss_snapshot_url"
+fi
+while read line; do
+    artifact=$line # artifact=groupId:artifactId:version:[[packaging]:classifier]
+    mvn dependency:get -DremoteRepositories=$remote_maven_repo -Dmaven.repo.local=$M2_CACHE -Dartifact=$artifact
+done < $ARTIFACT_FILE
diff --git a/jenkins/deploy.sh b/jenkins/deploy.sh
index 16428e121dc..15bca120cb0 100755
--- a/jenkins/deploy.sh
+++ b/jenkins/deploy.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2020-2024, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -50,6 +50,12 @@ ART_VER=$(mvnEval $DIST_PL project.version)
 DEFAULT_CUDA_CLASSIFIER=$(mvnEval $DIST_PL cuda.version)
 CUDA_CLASSIFIERS=${CUDA_CLASSIFIERS:-"$DEFAULT_CUDA_CLASSIFIER"}
 CLASSIFIERS=${CLASSIFIERS:-"$CUDA_CLASSIFIERS"} # default as CUDA_CLASSIFIERS for compatibility
+SERVER_ID=${SERVER_ID:-"snapshots"}
+SERVER_URL=${SERVER_URL:-"file:/tmp/local-release-repo"}
+# Save to be deployed artifact list into the file, e.g.
+ARTIFACT_FILE=${ARTIFACT_FILE:-"/tmp/artifact-file"}
+# Clean rtifact list file befor saving
+rm -rf $ARTIFACT_FILE
 
 SQL_PL=${SQL_PL:-"sql-plugin"}
 POM_FILE=${POM_FILE:-"$DIST_PL/target/parallel-world/META-INF/maven/${ART_GROUP_ID}/${ART_ID}/pom.xml"}
@@ -63,7 +69,7 @@ DEPLOY_TYPES=$(echo $CLASSIFIERS | sed -e 's;[^,]*;jar;g')
 DEPLOY_FILES=$(echo $CLASSIFIERS | sed -e "s;\([^,]*\);${FPATH}-\1.jar;g")
 
 # dist does not have javadoc and sources jars, use 'sql-plugin' instead
-source jenkins/version-def.sh >/dev/null 2&>1
+source jenkins/version-def.sh >/dev/null 2>&1
 echo $SPARK_BASE_SHIM_VERSION
 SQL_ART_ID=$(mvnEval $SQL_PL project.artifactId)
 SQL_ART_VER=$(mvnEval $SQL_PL project.version)
@@ -97,6 +103,10 @@ echo "Deploy CMD: $DEPLOY_CMD"
 ###### Deploy the parent pom file ######
 $DEPLOY_CMD -Dfile=./pom.xml -DpomFile=./pom.xml
 
+###### Deploy the jdk-profile pom file ######
+JDK_PROFILES=${JDK_PROFILES:-"jdk-profiles"}
+$DEPLOY_CMD -Dfile=$JDK_PROFILES/pom.xml -DpomFile=$JDK_PROFILES/pom.xml
+
 ###### Deploy the artifact jar(s) ######
 $DEPLOY_CMD -DpomFile=$POM_FILE \
             -Dfile=$FPATH-$DEFAULT_CUDA_CLASSIFIER.jar \
@@ -105,3 +115,10 @@ $DEPLOY_CMD -DpomFile=$POM_FILE \
             -Dfiles=$DEPLOY_FILES \
             -Dtypes=$DEPLOY_TYPES \
             -Dclassifiers=$CLASSIFIERS
+
+echo "$ART_GROUP_ID:$ART_ID:$ART_VER:jar" >> $ARTIFACT_FILE
+CLASSLIST="$CLASSIFIERS,sources,javadoc"
+CLASSLIST=(${CLASSLIST//','/' '})
+for class in ${CLASSLIST[@]}; do
+    echo "$ART_GROUP_ID:$ART_ID:$ART_VER:jar:$class" >> $ARTIFACT_FILE
+done
diff --git a/jenkins/spark-nightly-build.sh b/jenkins/spark-nightly-build.sh
index b038cdc1c08..c5ef53da47d 100755
--- a/jenkins/spark-nightly-build.sh
+++ b/jenkins/spark-nightly-build.sh
@@ -30,7 +30,7 @@ WORKSPACE=${WORKSPACE:-$(pwd)}
 export M2DIR=${M2DIR:-"$WORKSPACE/.m2"}
 
 ## MVN_OPT : maven options environment, e.g. MVN_OPT='-Dspark-rapids-jni.version=xxx' to specify spark-rapids-jni dependency's version.
-MVN="mvn -Dmaven.wagon.http.retryHandler.count=3 -DretryFailedDeploymentCount=3 ${MVN_OPT}"
+MVN="mvn -Dmaven.wagon.http.retryHandler.count=3 -DretryFailedDeploymentCount=3 ${MVN_OPT} -Psource-javadoc"
 
 DIST_PL="dist"
 function mvnEval {
diff --git a/jenkins/spark-tests.sh b/jenkins/spark-tests.sh
index 368a62ac1e8..0a455afcb10 100755
--- a/jenkins/spark-tests.sh
+++ b/jenkins/spark-tests.sh
@@ -304,6 +304,11 @@ if [[ $TEST_MODE == "DEFAULT" ]]; then
   PYSP_TEST_spark_shuffle_manager=com.nvidia.spark.rapids.${SHUFFLE_SPARK_SHIM}.RapidsShuffleManager \
     ./run_pyspark_from_build.sh
 
+  SPARK_SHELL_SMOKE_TEST=1 \
+  PYSP_TEST_spark_jars_packages=com.nvidia:rapids-4-spark_${SCALA_BINARY_VER}:${PROJECT_VER} \
+  PYSP_TEST_spark_jars_repositories=${PROJECT_REPO} \
+    ./run_pyspark_from_build.sh
+
   # ParquetCachedBatchSerializer cache_test
   PYSP_TEST_spark_sql_cache_serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
     ./run_pyspark_from_build.sh -k cache_test
diff --git a/scala2.13/integration_tests/pom.xml b/scala2.13/integration_tests/pom.xml
index 3157095dca2..fddcdfcb3f2 100644
--- a/scala2.13/integration_tests/pom.xml
+++ b/scala2.13/integration_tests/pom.xml
@@ -1,6 +1,6 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <!--
-  Copyright (c) 2020-2023, NVIDIA CORPORATION.
+  Copyright (c) 2020-2024, NVIDIA CORPORATION.
 
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
@@ -80,7 +80,9 @@
         <plugins>
             <plugin>
                 <artifactId>maven-assembly-plugin</artifactId>
+                <version>3.6.0</version>
                 <configuration>
+                    <tarLongFileMode>posix</tarLongFileMode>
                     <finalName>rapids-4-spark-integration-tests_${scala.binary.version}-${project.version}-${spark.version.classifier}</finalName>
                     <descriptorRefs>
                         <descriptorRef>jar-with-dependencies</descriptorRef>
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Arm.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Arm.scala
index a9dd4f4787c..926f770a683 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Arm.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Arm.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -15,6 +15,7 @@
  */
 package com.nvidia.spark.rapids
 
+import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer
 import scala.util.control.ControlThrowable
 
@@ -68,6 +69,15 @@ object Arm extends ArmScalaSpecificImpl {
     }
   }
 
+  /** Executes the provided code block and then closes the queue of resources */
+  def withResource[T <: AutoCloseable, V](r: mutable.Queue[T])(block: mutable.Queue[T] => V): V = {
+    try {
+      block(r)
+    } finally {
+      r.safeClose()
+    }
+  }
+
   /** Executes the provided code block and then closes the value if it is AutoCloseable */
   def withResourceIfAllowed[T, V](r: T)(block: T => V): V = {
     try {
@@ -124,6 +134,21 @@ object Arm extends ArmScalaSpecificImpl {
     }
   }
 
+
+  /** Executes the provided code block, closing the resources only if an exception occurs */
+  def closeOnExcept[T <: AutoCloseable, V](r: mutable.Queue[T])(block: mutable.Queue[T] => V): V = {
+    try {
+      block(r)
+    } catch {
+      case t: ControlThrowable =>
+        // Don't close for these cases..
+        throw t
+      case t: Throwable =>
+        r.safeClose(t)
+        throw t
+    }
+  }
+
   /** Executes the provided code block, closing the resources only if an exception occurs */
   def closeOnExcept[T <: AutoCloseable, V](r: Option[T])(block: Option[T] => V): V = {
     try {
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala
index 2bec8bc581a..5ef1f2125ab 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -168,17 +168,26 @@ final case class WrappedGpuMetric(sqlMetric: SQLMetric) extends GpuMetric {
   override def value: Long = sqlMetric.value
 }
 
-class CollectTimeIterator(
+/** A GPU metric class that just accumulates into a variable without implicit publishing. */
+final class LocalGpuMetric extends GpuMetric {
+  private var lval = 0L
+  override def value: Long = lval
+  override def set(v: Long): Unit = { lval = v }
+  override def +=(v: Long): Unit = { lval += v }
+  override def add(v: Long): Unit = { lval += v }
+}
+
+class CollectTimeIterator[T](
     nvtxName: String,
-    it: Iterator[ColumnarBatch],
-    collectTime: GpuMetric) extends Iterator[ColumnarBatch] {
+    it: Iterator[T],
+    collectTime: GpuMetric) extends Iterator[T] {
   override def hasNext: Boolean = {
     withResource(new NvtxWithMetrics(nvtxName, NvtxColor.BLUE, collectTime)) { _ =>
       it.hasNext
     }
   }
 
-  override def next(): ColumnarBatch = {
+  override def next(): T = {
     withResource(new NvtxWithMetrics(nvtxName, NvtxColor.BLUE, collectTime)) { _ =>
       it.next
     }
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
index 47a99006b58..b9451b51606 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
@@ -1703,13 +1703,7 @@ object GpuOverrides extends Logging {
             TypeSig.STRING)),
       (a, conf, p, r) => new UnixTimeExprMeta[ToUnixTimestamp](a, conf, p, r) {
         // String type is not supported yet for non-UTC timezone.
-        override def isTimeZoneSupported: Boolean = a.timeZoneId.forall { zoneID =>
-          a.left.dataType match {
-            case _: StringType => GpuOverrides.isUTCTimezone(zoneID)
-            case _ => true
-          }
-        }
-
+        override def isTimeZoneSupported = true
         override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression = {
           GpuToUnixTimestamp(lhs, rhs, sparkFormat, strfFormat, a.timeZoneId)
         }
@@ -1724,14 +1718,7 @@ object GpuOverrides extends Logging {
             .withPsNote(TypeEnum.STRING, "A limited number of formats are supported"),
             TypeSig.STRING)),
       (a, conf, p, r) => new UnixTimeExprMeta[UnixTimestamp](a, conf, p, r) {
-        // String type is not supported yet for non-UTC timezone.
-        override def isTimeZoneSupported: Boolean = a.timeZoneId.forall { zoneID =>
-            a.left.dataType match {
-              case _: StringType => GpuOverrides.isUTCTimezone(zoneID)
-              case _ => true
-            }
-        }
-
+        override def isTimeZoneSupported = true
         override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression = {
           GpuUnixTimestamp(lhs, rhs, sparkFormat, strfFormat, a.timeZoneId)
         }
@@ -3647,7 +3634,8 @@ object GpuOverrides extends Logging {
         override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression =
           GpuGetJsonObject(lhs, rhs)
       }
-    ),
+    ).disabledByDefault("escape sequences are not processed correctly, the input is not " +
+        "validated, and the output is not normalized the same as Spark"),
     expr[JsonToStructs](
       "Returns a struct value with the given `jsonStr` and `schema`",
       ExprChecks.projectOnly(
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala
index 97b49d20fcc..d9a5d5b2f3e 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinExec.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -28,7 +28,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
-import org.apache.spark.sql.catalyst.plans.{InnerLike, JoinType, LeftAnti, LeftSemi}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike, JoinType, LeftAnti, LeftSemi}
 import org.apache.spark.sql.catalyst.plans.physical.Distribution
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.execution.joins.ShuffledHashJoinExec
@@ -69,17 +69,31 @@ class GpuShuffledHashJoinMeta(
       (None, condition)
     }
     val Seq(left, right) = childPlans.map(_.convertIfNeeded())
-    val joinExec = GpuShuffledHashJoinExec(
-      leftKeys.map(_.convertToGpu()),
-      rightKeys.map(_.convertToGpu()),
-      join.joinType,
-      buildSide,
-      joinCondition,
-      left,
-      right,
-      isSkewJoin = false)(
-      join.leftKeys,
-      join.rightKeys)
+    val joinExec = join.joinType match {
+      case Inner if conf.useShuffledSymmetricHashJoin =>
+        GpuShuffledSymmetricHashJoinExec(
+          leftKeys.map(_.convertToGpu()),
+          rightKeys.map(_.convertToGpu()),
+          joinCondition,
+          left,
+          right,
+          conf.isGPUShuffle,
+          conf.gpuTargetBatchSizeBytes)(
+          join.leftKeys,
+          join.rightKeys)
+      case _ =>
+        GpuShuffledHashJoinExec(
+          leftKeys.map(_.convertToGpu()),
+          rightKeys.map(_.convertToGpu()),
+          join.joinType,
+          buildSide,
+          joinCondition,
+          left,
+          right,
+          isSkewJoin = false)(
+          join.leftKeys,
+          join.rightKeys)
+    }
     // For inner joins we can apply a post-join condition for any conditions that cannot be
     // evaluated directly in a mixed join that leverages a cudf AST expression
     filterCondition.map(c => GpuFilterExec(c,
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSymmetricHashJoinExec.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSymmetricHashJoinExec.scala
new file mode 100644
index 00000000000..fa4885f848a
--- /dev/null
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSymmetricHashJoinExec.scala
@@ -0,0 +1,1109 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.nvidia.spark.rapids
+
+import scala.collection.{mutable, BitSet}
+
+import ai.rapids.cudf.{ContiguousTable, HostMemoryBuffer}
+import ai.rapids.cudf.JCudfSerialization.SerializedTableHeader
+import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource}
+import com.nvidia.spark.rapids.GpuMetric._
+import com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec.JoinInfo
+import com.nvidia.spark.rapids.RapidsPluginImplicits._
+import com.nvidia.spark.rapids.RmmRapidsRetryIterator.withRetryNoSplit
+import com.nvidia.spark.rapids.ScalableTaskCompletion.onTaskCompletion
+import com.nvidia.spark.rapids.shims.{GpuHashPartitioning, ShimBinaryExecNode}
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.plans.Inner
+import org.apache.spark.sql.catalyst.plans.physical.Distribution
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec
+import org.apache.spark.sql.rapids.execution.{ConditionalHashJoinIterator, GpuCustomShuffleReaderExec, GpuHashJoin, GpuShuffleExchangeExecBase, HashJoinIterator}
+import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.vectorized.ColumnarBatch
+
+object GpuShuffledSymmetricHashJoinExec {
+  /** Utility class to track bound expressions and expression metadata related to a join. */
+  case class BoundJoinExprs(
+      boundBuildKeys: Seq[GpuExpression],
+      buildTypes: Array[DataType],
+      boundStreamKeys: Seq[GpuExpression],
+      streamTypes: Array[DataType],
+      streamOutput: Seq[Attribute],
+      boundCondition: Option[GpuExpression],
+      numFirstConditionTableColumns: Int,
+      compareNullsEqual: Boolean,
+      buildSideNeedsNullFilter: Boolean)
+
+  object BoundJoinExprs {
+    /**
+     * Utility to bind join expressions and produce a BoundJoinExprs result. Note that this should
+     * be called with the build side that was dynamically determined after probing the join inputs.
+     */
+    def bind(
+        leftKeys: Seq[Expression],
+        leftOutput: Seq[Attribute],
+        rightKeys: Seq[Expression],
+        rightOutput: Seq[Attribute],
+        condition: Option[Expression],
+        buildSide: GpuBuildSide): BoundJoinExprs = {
+      val leftTypes = leftOutput.map(_.dataType).toArray
+      val rightTypes = rightOutput.map(_.dataType).toArray
+      val boundLeftKeys = GpuBindReferences.bindGpuReferences(leftKeys, leftOutput)
+      val boundRightKeys = GpuBindReferences.bindGpuReferences(rightKeys, rightOutput)
+      val boundCondition = condition.map { c =>
+        GpuBindReferences.bindGpuReference(c, leftOutput ++ rightOutput)
+      }
+      val (boundBuildKeys, buildTypes, boundStreamKeys, streamTypes, streamOutput) =
+        buildSide match {
+          case GpuBuildRight => (boundRightKeys, rightTypes, boundLeftKeys, leftTypes, leftOutput)
+          case GpuBuildLeft => (boundLeftKeys, leftTypes, boundRightKeys, rightTypes, rightOutput)
+      }
+      val compareNullsEqual = GpuHashJoin.anyNullableStructChild(boundBuildKeys)
+      val needNullFilter = compareNullsEqual && boundBuildKeys.exists(_.nullable)
+      BoundJoinExprs(boundBuildKeys, buildTypes, boundStreamKeys, streamTypes, streamOutput,
+        boundCondition, leftOutput.size, compareNullsEqual, needNullFilter)
+    }
+  }
+
+  /** Utility class to track information related to a join. */
+  class JoinInfo(
+      val buildSide: GpuBuildSide,
+      val buildIter: Iterator[ColumnarBatch],
+      val buildSize: Long,
+      val streamIter: Iterator[ColumnarBatch],
+      val exprs: BoundJoinExprs)
+
+  /**
+   * Trait to house common code for determining the ideal build/stream
+   * assignments for symmetric joins.
+   */
+  trait SymmetricJoinSizer[T <: AutoCloseable] {
+    /** Wrap, if necessary, an iterator in preparation for probing the size before a join. */
+    def setupForProbe(iter: Iterator[ColumnarBatch]): Iterator[T]
+
+    /**
+     * Build an iterator in preparation for using it for sub-joins.
+     *
+     * @param queue a possibly empty queue of data that has already been fetched from the underlying
+     *              iterator as part of probing sizes of the join inputs
+     * @param remainingIter the data remaining to be fetched from the original iterator. Iterating
+     *                      the queue followed by this iterator reconstructs the iteration order of
+     *                      the original input iterator.
+     * @param batchTypes the schema of the data
+     * @param gpuBatchSizeBytes target GPU batch size in bytes
+     * @param metrics metrics to update (e.g.: if coalescing batches)
+     * @return iterator of columnar batches to use in sub-joins
+     */
+    def setupForJoin(
+        queue: mutable.Queue[T],
+        remainingIter: Iterator[ColumnarBatch],
+        batchTypes: Array[DataType],
+        gpuBatchSizeBytes: Long,
+        metrics: Map[String, GpuMetric]): Iterator[ColumnarBatch]
+
+    /** Get the row count of a batch of data */
+    def getProbeBatchRowCount(batch: T): Long
+
+    /** Get the data size in bytes of a batch of data */
+    def getProbeBatchDataSize(batch: T): Long
+
+    /**
+     * Whether to start pulling from the left or right input iterator when probing for data sizes.
+     * This helps avoid grabbing the GPU semaphore too early when probing.
+     */
+    val startWithLeftSide: Boolean
+
+    /**
+     * Probe the left and right join inputs to determine which side should be used as the build
+     * side and which should be used as the stream side.
+     *
+     * @param leftKeys join keys for the left table
+     * @param leftOutput schema of the left table
+     * @param rawLeftIter iterator of batches for the left table
+     * @param rightKeys join keys for the right table
+     * @param rightOutput schema of the right table
+     * @param rawRightIter iterator of batches for the right table
+     * @param condition inequality portions of the join condition
+     * @param gpuBatchSizeBytes target GPU batch size
+     * @param metrics map of metrics to update
+     * @return join information including build side, bound expressions, etc.
+     */
+    def getJoinInfo(
+        leftKeys: Seq[Expression],
+        leftOutput: Seq[Attribute],
+        rawLeftIter: Iterator[ColumnarBatch],
+        rightKeys: Seq[Expression],
+        rightOutput: Seq[Attribute],
+        rawRightIter: Iterator[ColumnarBatch],
+        condition: Option[Expression],
+        gpuBatchSizeBytes: Long,
+        metrics: Map[String, GpuMetric]): JoinInfo = {
+      val leftTime = new LocalGpuMetric
+      val rightTime = new LocalGpuMetric
+      val buildTime = metrics(BUILD_TIME)
+      val streamTime = metrics(STREAM_TIME)
+      val leftIter = new CollectTimeIterator("probe left", setupForProbe(rawLeftIter), leftTime)
+      val rightIter = new CollectTimeIterator("probe right", setupForProbe(rawRightIter), rightTime)
+      closeOnExcept(mutable.Queue.empty[T]) { leftQueue =>
+        closeOnExcept(mutable.Queue.empty[T]) { rightQueue =>
+          var leftSize = 0L
+          var rightSize = 0L
+          var buildSide: GpuBuildSide = null
+          while (buildSide == null) {
+            if (leftSize < rightSize || (startWithLeftSide && leftSize == rightSize)) {
+              if (leftIter.hasNext) {
+                val leftBatch = leftIter.next()
+                if (getProbeBatchRowCount(leftBatch) > 0) {
+                  leftQueue += leftBatch
+                  leftSize += getProbeBatchDataSize(leftBatch)
+                }
+              } else {
+                buildSide = GpuBuildLeft
+              }
+            } else {
+              if (rightIter.hasNext) {
+                val rightBatch = rightIter.next()
+                if (getProbeBatchRowCount(rightBatch) > 0) {
+                  rightQueue += rightBatch
+                  rightSize += getProbeBatchDataSize(rightBatch)
+                }
+              } else {
+                buildSide = GpuBuildRight
+              }
+            }
+          }
+          val exprs = BoundJoinExprs.bind(leftKeys, leftOutput, rightKeys, rightOutput,
+            condition, buildSide)
+          val (buildQueue, buildSize, streamQueue, rawStreamIter) = buildSide match {
+            case GpuBuildRight =>
+              buildTime += rightTime.value
+              streamTime += leftTime.value
+              (rightQueue, rightSize, leftQueue, rawLeftIter)
+            case GpuBuildLeft =>
+              buildTime += leftTime.value
+              streamTime += rightTime.value
+              (leftQueue, leftSize, rightQueue, rawRightIter)
+          }
+          metrics(BUILD_DATA_SIZE).set(buildSize)
+          val baseBuildIter = setupForJoin(buildQueue, Iterator.empty, exprs.buildTypes,
+            gpuBatchSizeBytes, metrics)
+          val buildIter = if (exprs.buildSideNeedsNullFilter) {
+            new NullFilteredBatchIterator(baseBuildIter, exprs.boundBuildKeys, metrics(OP_TIME))
+          } else {
+            baseBuildIter
+          }
+          val streamIter = new CollectTimeIterator("fetch join stream",
+            setupForJoin(streamQueue, rawStreamIter, exprs.streamTypes, gpuBatchSizeBytes, metrics),
+            streamTime)
+          new JoinInfo(buildSide, buildIter, buildSize, streamIter, exprs)
+        }
+      }
+    }
+  }
+
+  /**
+   * Join sizer to use when both the left and right table are coming directly from a shuffle and
+   * the data will be on the host. Caches shuffle batches in host memory while probing without
+   * grabbing the GPU semaphore.
+   */
+  class HostHostJoinSizer extends SymmetricJoinSizer[SpillableHostConcatResult] {
+
+    override def setupForProbe(
+        iter: Iterator[ColumnarBatch]): Iterator[SpillableHostConcatResult] = {
+      new SpillableHostConcatResultFromColumnarBatchIterator(iter)
+    }
+
+    override def setupForJoin(
+        queue: mutable.Queue[SpillableHostConcatResult],
+        remainingIter: Iterator[ColumnarBatch],
+        batchTypes: Array[DataType],
+        gpuBatchSizeBytes: Long,
+        metrics: Map[String, GpuMetric]): Iterator[ColumnarBatch] = {
+      val concatMetrics = getConcatMetrics(metrics)
+      val bufferedCoalesceIter = new CloseableBufferedIterator(
+        new HostShuffleCoalesceIterator(
+          new HostQueueBatchIterator(queue, remainingIter),
+          gpuBatchSizeBytes,
+          concatMetrics))
+      // Force a coalesce of the first batch before we grab the GPU semaphore
+      bufferedCoalesceIter.headOption
+      new GpuShuffleCoalesceIterator(bufferedCoalesceIter, batchTypes, concatMetrics)
+    }
+
+    override def getProbeBatchRowCount(batch: SpillableHostConcatResult): Long = {
+      batch.header.getNumRows
+    }
+
+    override def getProbeBatchDataSize(batch: SpillableHostConcatResult): Long = {
+      batch.header.getDataLen
+    }
+
+    override val startWithLeftSide: Boolean = true
+  }
+
+  /**
+   * Join sizer to use when at least one side of the join is coming from another GPU exec node
+   * such that the GPU semaphore is already held. Caches input batches on the GPU.
+   *
+   * @param startWithLeftSide whether to prefer fetching from the left or right side first
+   *                          when probing for table sizes.
+   */
+  class SpillableColumnarBatchJoinSizer(
+      override val startWithLeftSide: Boolean) extends SymmetricJoinSizer[SpillableColumnarBatch] {
+
+    override def setupForProbe(iter: Iterator[ColumnarBatch]): Iterator[SpillableColumnarBatch] = {
+      iter.map(batch => SpillableColumnarBatch(batch, SpillPriorities.ACTIVE_BATCHING_PRIORITY))
+    }
+
+    override def setupForJoin(
+        queue: mutable.Queue[SpillableColumnarBatch],
+        remainingIter: Iterator[ColumnarBatch],
+        batchTypes: Array[DataType],
+        gpuBatchSizeBytes: Long,
+        metrics: Map[String, GpuMetric]): Iterator[ColumnarBatch] = {
+      new SpillableColumnarBatchQueueIterator(queue, remainingIter)
+    }
+
+    override def getProbeBatchRowCount(batch: SpillableColumnarBatch): Long = batch.numRows()
+
+    override def getProbeBatchDataSize(batch: SpillableColumnarBatch): Long = batch.sizeInBytes
+  }
+
+  def getConcatMetrics(metrics: Map[String, GpuMetric]): Map[String, GpuMetric] = {
+    // Use a filtered metrics map to avoid output batch counts and other unrelated metric updates
+    Map(
+      OP_TIME -> metrics(OP_TIME),
+      CONCAT_TIME -> metrics(CONCAT_TIME)).withDefaultValue(NoopMetric)
+  }
+
+  def createJoinIterator(
+      info: JoinInfo,
+      spillableBuiltBatch: LazySpillableColumnarBatch,
+      lazyStream: Iterator[LazySpillableColumnarBatch],
+      gpuBatchSizeBytes: Long,
+      opTime: GpuMetric,
+      joinTime: GpuMetric): Iterator[ColumnarBatch] = {
+    if (info.exprs.boundCondition.isDefined) {
+      // ConditionalHashJoinIterator will close the compiled condition
+      val compiledCondition = info.exprs.boundCondition.get.convertToAst(
+        info.exprs.numFirstConditionTableColumns).compile()
+      new ConditionalHashJoinIterator(spillableBuiltBatch, info.exprs.boundBuildKeys,
+        lazyStream, info.exprs.boundStreamKeys, info.exprs.streamOutput, compiledCondition,
+        gpuBatchSizeBytes, Inner, info.buildSide, info.exprs.compareNullsEqual,
+        opTime, joinTime)
+    } else {
+      new HashJoinIterator(spillableBuiltBatch, info.exprs.boundBuildKeys,
+        lazyStream, info.exprs.boundStreamKeys, info.exprs.streamOutput,
+        gpuBatchSizeBytes, Inner, info.buildSide, info.exprs.compareNullsEqual,
+        opTime, joinTime)
+    }
+  }
+}
+
+/**
+ * A GPU shuffled hash join optimized to handle inner joins. Probes the sizes of the input tables
+ * before performing the join to determine which to use as the build side.
+ *
+ * @param leftKeys join keys for the left table
+ * @param rightKeys join keys for the right table
+ * @param condition inequality portions of the join condition
+ * @param left plan for the left table
+ * @param right plan for the right table
+ * @param isGpuShuffle whether the shuffle is GPU-centric (e.g.: UCX-based)
+ * @param gpuBatchSizeBytes target GPU batch size
+ * @param cpuLeftKeys original CPU expressions for the left join keys
+ * @param cpuRightKeys original CPU expressions for the right join keys
+ */
+case class GpuShuffledSymmetricHashJoinExec(
+    leftKeys: Seq[Expression],
+    rightKeys: Seq[Expression],
+    condition: Option[Expression],
+    left: SparkPlan,
+    right: SparkPlan,
+    isGpuShuffle: Boolean,
+    gpuBatchSizeBytes: Long)(
+    cpuLeftKeys: Seq[Expression],
+    cpuRightKeys: Seq[Expression]) extends ShimBinaryExecNode with GpuExec {
+  import GpuShuffledSymmetricHashJoinExec._
+
+  override def otherCopyArgs: Seq[AnyRef] = Seq(cpuLeftKeys, cpuRightKeys)
+
+  override val outputRowsLevel: MetricsLevel = ESSENTIAL_LEVEL
+  override val outputBatchesLevel: MetricsLevel = MODERATE_LEVEL
+  override lazy val additionalMetrics: Map[String, GpuMetric] = Map(
+    OP_TIME -> createNanoTimingMetric(MODERATE_LEVEL, DESCRIPTION_OP_TIME),
+    CONCAT_TIME -> createNanoTimingMetric(DEBUG_LEVEL, DESCRIPTION_CONCAT_TIME),
+    BUILD_DATA_SIZE -> createSizeMetric(ESSENTIAL_LEVEL, DESCRIPTION_BUILD_DATA_SIZE),
+    BUILD_TIME -> createNanoTimingMetric(ESSENTIAL_LEVEL, DESCRIPTION_BUILD_TIME),
+    STREAM_TIME -> createNanoTimingMetric(DEBUG_LEVEL, DESCRIPTION_STREAM_TIME),
+    JOIN_TIME -> createNanoTimingMetric(DEBUG_LEVEL, DESCRIPTION_JOIN_TIME))
+
+  override def requiredChildDistribution: Seq[Distribution] =
+    Seq(GpuHashPartitioning.getDistribution(cpuLeftKeys),
+      GpuHashPartitioning.getDistribution(cpuRightKeys))
+
+  override def output: Seq[Attribute] = left.output ++ right.output
+
+  override def doExecute(): RDD[InternalRow] = {
+    throw new IllegalStateException(s"${this.getClass} does not support row-based execution")
+  }
+
+  override def internalDoExecuteColumnar(): RDD[ColumnarBatch] = {
+    val localLeftKeys = leftKeys
+    val leftOutput = left.output
+    val isLeftHost = isHostBatchProducer(left)
+    val localRightKeys = rightKeys
+    val rightOutput = right.output
+    val isRightHost = isHostBatchProducer(right)
+    val localCondition = condition
+    val localGpuBatchSizeBytes = gpuBatchSizeBytes
+    val localMetrics = allMetrics.withDefaultValue(NoopMetric)
+    left.executeColumnar().zipPartitions(right.executeColumnar()) { case (leftIter, rightIter) =>
+      val joinInfo = (isLeftHost, isRightHost) match {
+        case (true, true) =>
+          getHostHostJoinInfo(localLeftKeys, leftOutput, leftIter,
+            localRightKeys, rightOutput, rightIter,
+            localCondition, localGpuBatchSizeBytes, localMetrics)
+        case (true, false) =>
+          getHostGpuJoinInfo(localLeftKeys, leftOutput, leftIter,
+            localRightKeys, rightOutput, rightIter,
+            localCondition, localGpuBatchSizeBytes, localMetrics)
+        case (false, true) =>
+          getGpuHostJoinInfo(localLeftKeys, leftOutput, leftIter,
+            localRightKeys, rightOutput, rightIter,
+            localCondition, localGpuBatchSizeBytes, localMetrics)
+        case (false, false) =>
+          getGpuGpuJoinInfo(localLeftKeys, leftOutput, leftIter,
+            localRightKeys, rightOutput, rightIter,
+            localCondition, localGpuBatchSizeBytes, localMetrics)
+      }
+      val joinIterator = if (joinInfo.buildSize <= localGpuBatchSizeBytes) {
+        if (joinInfo.buildSize == 0) {
+          Iterator.empty
+        } else {
+          doSmallBuildJoin(joinInfo, localGpuBatchSizeBytes, localMetrics)
+        }
+      } else {
+        doBigBuildJoin(joinInfo, localGpuBatchSizeBytes, localMetrics)
+      }
+      val numOutputRows = localMetrics(NUM_OUTPUT_ROWS)
+      val numOutputBatches = localMetrics(NUM_OUTPUT_BATCHES)
+      joinIterator.map { cb =>
+        numOutputRows += cb.numRows()
+        numOutputBatches += 1
+        cb
+      }
+    }
+  }
+
+  /**
+   * Perform a join where the build side fits in a single GPU batch.
+   *
+   * @param info join information from the probing phase
+   * @param gpuBatchSizeBytes target GPU batch size
+   * @param metricsMap metrics to update
+   * @return iterator to produce the results of the join
+   */
+  private def doSmallBuildJoin(
+      info: JoinInfo,
+      gpuBatchSizeBytes: Long,
+      metricsMap: Map[String, GpuMetric]): Iterator[ColumnarBatch] = {
+    val opTime = metricsMap(OP_TIME)
+    val lazyStream = new Iterator[LazySpillableColumnarBatch]() {
+      override def hasNext: Boolean = info.streamIter.hasNext
+
+      override def next(): LazySpillableColumnarBatch = {
+        withResource(info.streamIter.next()) { batch =>
+          LazySpillableColumnarBatch(batch, "stream_batch")
+        }
+      }
+    }
+    val buildIter = new GpuCoalesceIterator(
+      info.buildIter,
+      info.exprs.buildTypes,
+      RequireSingleBatch,
+      numInputRows = NoopMetric,
+      numInputBatches = NoopMetric,
+      numOutputRows = NoopMetric,
+      numOutputBatches = NoopMetric,
+      collectTime = NoopMetric,
+      concatTime = metricsMap(CONCAT_TIME),
+      opTime = opTime,
+      opName = "build batch")
+    assert(buildIter.hasNext, "build side should not be empty")
+    val spillableBuiltBatch = withResource(buildIter.next()) { batch =>
+      assert(!buildIter.hasNext, "build side should have a single batch")
+      LazySpillableColumnarBatch(batch, "built")
+    }
+    createJoinIterator(info, spillableBuiltBatch, lazyStream, gpuBatchSizeBytes, opTime,
+      metricsMap(JOIN_TIME))
+  }
+
+  /**
+   * Perform a join where the build side does not fit in a single GPU batch.
+   *
+   * @param info join information from the probing phase
+   * @param gpuBatchSizeBytes target GPU batch size
+   * @param metricsMap metrics to update
+   * @return iterator to produce the results of the join
+   */
+  private def doBigBuildJoin(
+      info: JoinInfo,
+      gpuBatchSizeBytes: Long,
+      metricsMap: Map[String, GpuMetric]): Iterator[ColumnarBatch] = {
+    new BigInnerJoinIterator(info, gpuBatchSizeBytes, metricsMap)
+  }
+
+  /**
+   * Probe for join information when both inputs are coming from host memory (i.e.: both
+   * inputs are coming from a shuffle when not using a GPU-centered shuffle manager).
+   */
+  private def getHostHostJoinInfo(
+      leftKeys: Seq[Expression],
+      leftOutput: Seq[Attribute],
+      leftIter: Iterator[ColumnarBatch],
+      rightKeys: Seq[Expression],
+      rightOutput: Seq[Attribute],
+      rightIter: Iterator[ColumnarBatch],
+      condition: Option[Expression],
+      gpuBatchSizeBytes: Long,
+      metrics: Map[String, GpuMetric]): JoinInfo = {
+    val sizer = new HostHostJoinSizer()
+    sizer.getJoinInfo(leftKeys, leftOutput, leftIter, rightKeys, rightOutput, rightIter,
+      condition, gpuBatchSizeBytes, metrics)
+  }
+
+  /**
+   * Probe for join information when the left input is coming from host memory and the
+   * right table is coming from GPU memory.
+   */
+  private def getHostGpuJoinInfo(
+      leftKeys: Seq[Expression],
+      leftOutput: Seq[Attribute],
+      rawLeftIter: Iterator[ColumnarBatch],
+      rightKeys: Seq[Expression],
+      rightOutput: Seq[Attribute],
+      rightIter: Iterator[ColumnarBatch],
+      condition: Option[Expression],
+      gpuBatchSizeBytes: Long,
+      metrics: Map[String, GpuMetric]): JoinInfo = {
+    val sizer = new SpillableColumnarBatchJoinSizer(startWithLeftSide = true)
+    val concatMetrics = getConcatMetrics(metrics)
+    val leftIter = new GpuShuffleCoalesceIterator(
+      new HostShuffleCoalesceIterator(rawLeftIter, gpuBatchSizeBytes, concatMetrics),
+      leftOutput.map(_.dataType).toArray,
+      concatMetrics)
+    sizer.getJoinInfo(leftKeys, leftOutput, leftIter, rightKeys, rightOutput, rightIter,
+      condition, gpuBatchSizeBytes, metrics)
+  }
+
+  /**
+   * Probe for the join information when the left input is coming from GPU memory and the
+   * left table is coming from host memory.
+   */
+  private def getGpuHostJoinInfo(
+      leftKeys: Seq[Expression],
+      leftOutput: Seq[Attribute],
+      leftIter: Iterator[ColumnarBatch],
+      rightKeys: Seq[Expression],
+      rightOutput: Seq[Attribute],
+      rawRightIter: Iterator[ColumnarBatch],
+      condition: Option[Expression],
+      gpuBatchSizeBytes: Long,
+      metrics: Map[String, GpuMetric]): JoinInfo = {
+    val sizer = new SpillableColumnarBatchJoinSizer(startWithLeftSide = false)
+    val concatMetrics = getConcatMetrics(metrics)
+    val rightIter = new GpuShuffleCoalesceIterator(
+      new HostShuffleCoalesceIterator(rawRightIter, gpuBatchSizeBytes, concatMetrics),
+      rightOutput.map(_.dataType).toArray,
+      concatMetrics)
+    sizer.getJoinInfo(leftKeys, leftOutput, leftIter, rightKeys, rightOutput, rightIter,
+      condition, gpuBatchSizeBytes, metrics)
+  }
+
+  /**
+   * Probe for the join information when both inputs are coming from GPU memory.
+   */
+  private def getGpuGpuJoinInfo(
+      leftKeys: Seq[Expression],
+      leftOutput: Seq[Attribute],
+      leftIter: Iterator[ColumnarBatch],
+      rightKeys: Seq[Expression],
+      rightOutput: Seq[Attribute],
+      rightIter: Iterator[ColumnarBatch],
+      condition: Option[Expression],
+      gpuBatchSizeBytes: Long,
+      metrics: Map[String, GpuMetric]): JoinInfo = {
+    val sizer = new SpillableColumnarBatchJoinSizer(startWithLeftSide = true)
+    sizer.getJoinInfo(leftKeys, leftOutput, leftIter, rightKeys, rightOutput, rightIter,
+      condition, gpuBatchSizeBytes, metrics)
+  }
+
+  /**
+   * Determines if a plan produces data in host memory.
+   *
+   * @param plan the plan to check
+   * @return true if the plan produces batches in host memory, false otherwise
+   */
+  private def isHostBatchProducer(plan: SparkPlan): Boolean = {
+    if (isGpuShuffle) {
+      false
+    } else {
+      plan match {
+        case _: GpuShuffleCoalesceExec =>
+          throw new IllegalStateException("Should not have shuffle coalesce before this node")
+        case _: GpuShuffleExchangeExecBase | _: GpuCustomShuffleReaderExec => true
+        case _: ShuffleQueryStageExec => true
+        case _ => false
+      }
+    }
+  }
+}
+
+/**
+ * A spillable form of a HostConcatResult. Takes ownership of the specified host buffer.
+ */
+class SpillableHostConcatResult(
+    val header: SerializedTableHeader,
+    hmb: HostMemoryBuffer) extends AutoCloseable {
+  private var buffer = {
+    SpillableHostBuffer(hmb, hmb.getLength, SpillPriorities.ACTIVE_BATCHING_PRIORITY)
+  }
+
+  def getHostMemoryBufferAndClose(): HostMemoryBuffer = {
+    val hostBuffer = buffer.getHostBuffer()
+    closeOnExcept(hostBuffer) { _ =>
+      close()
+    }
+    hostBuffer
+  }
+
+  override def close(): Unit = {
+    buffer.close()
+    buffer = null
+  }
+}
+
+/**
+ * Converts an iterator of shuffle batches in host memory into an iterator of spillable
+ * host memory batches.
+ */
+class SpillableHostConcatResultFromColumnarBatchIterator(
+    iter: Iterator[ColumnarBatch]) extends Iterator[SpillableHostConcatResult] {
+  override def hasNext: Boolean = iter.hasNext
+
+  override def next(): SpillableHostConcatResult = {
+    withResource(iter.next()) { batch =>
+      require(batch.numCols() > 0, "Batch must have at least 1 column")
+      batch.column(0) match {
+        case col: SerializedTableColumn =>
+          val buffer = col.hostBuffer
+          buffer.incRefCount()
+          new SpillableHostConcatResult(col.header, buffer)
+        case c =>
+          throw new IllegalStateException(s"Expected SerializedTableColumn, got ${c.getClass}")
+      }
+    }
+  }
+}
+
+/**
+ * Iterator that produces SerializedTableColumn batches from a queue of spillable host memory
+ * batches that were fetched first during probing and the (possibly empty) remaining iterator of
+ * un-probed host memory batches. The iterator returns the queue elements first, followed by
+ * the elements of the remaining iterator.
+ *
+ * @param spillableQueue queue of spillable host memory batches
+ * @param batchIter iterator of remaining host memory batches
+ */
+class HostQueueBatchIterator(
+    spillableQueue: mutable.Queue[SpillableHostConcatResult],
+    batchIter: Iterator[ColumnarBatch]) extends GpuColumnarBatchIterator(true) {
+  override def hasNext: Boolean = spillableQueue.nonEmpty || batchIter.hasNext
+
+  override def next(): ColumnarBatch = {
+    if (spillableQueue.nonEmpty) {
+      val shcr = spillableQueue.dequeue()
+      closeOnExcept(shcr.getHostMemoryBufferAndClose()) { hostBuffer =>
+        SerializedTableColumn.from(shcr.header, hostBuffer)
+      }
+    } else {
+      batchIter.next()
+    }
+  }
+
+  override def doClose(): Unit = {
+    spillableQueue.safeClose()
+  }
+}
+
+/**
+ * Iterator that produces columnar batches from a queue of spillable batches that were fetched
+ * first during probing and the (possibly empty) remaining iterator fo un-probed batches. The
+ * iterator returns the queue elements first, followed by the elements of the remaining iterator.
+ */
+class SpillableColumnarBatchQueueIterator(
+    queue: mutable.Queue[SpillableColumnarBatch],
+    batchIter: Iterator[ColumnarBatch]) extends GpuColumnarBatchIterator(true) {
+
+  override def hasNext: Boolean = queue.nonEmpty || batchIter.hasNext
+
+  override def next(): ColumnarBatch = {
+    if (queue.nonEmpty) {
+      withResource(queue.dequeue()) { spillable =>
+        spillable.getColumnarBatch()
+      }
+    } else {
+      batchIter.next()
+    }
+  }
+
+  override def doClose(): Unit = {
+    queue.safeClose()
+  }
+}
+
+/**
+ * Iterator that filters out rows with null keys.
+ *
+ * @param iter iterator of batches to filter
+ * @param boundKeys expressions to produce the keys
+ * @param opTime metric to update for time taken during the filter operation
+ */
+class NullFilteredBatchIterator(
+    iter: Iterator[ColumnarBatch],
+    boundKeys: Seq[Expression],
+    opTime: GpuMetric) extends Iterator[ColumnarBatch] with AutoCloseable {
+  private var onDeck: Option[ColumnarBatch] = None
+
+  onTaskCompletion(close())
+
+  override def hasNext: Boolean = {
+    while (onDeck.isEmpty && iter.hasNext) {
+      val batch = withResource(iter.next()) { batch =>
+        opTime.ns {
+          val spillable = SpillableColumnarBatch(batch, SpillPriorities.ACTIVE_ON_DECK_PRIORITY)
+          GpuHashJoin.filterNullsWithRetryAndClose(spillable, boundKeys)
+        }
+      }
+      if (batch.numRows > 0) {
+        onDeck = Some(batch)
+      } else {
+        batch.close()
+      }
+    }
+    onDeck.nonEmpty
+  }
+
+  override def next(): ColumnarBatch = {
+    if (!hasNext) {
+      throw new NoSuchElementException("no more batches")
+    }
+    val batch = onDeck.get
+    onDeck = None
+    batch
+  }
+
+  override def close(): Unit = {
+    onDeck.foreach(_.close())
+    onDeck = None
+  }
+}
+
+/** Tracks a collection of batches associated with a partition in a large join */
+class JoinPartition extends AutoCloseable {
+  private var totalSize = 0L
+  private val batches = new mutable.ArrayBuffer[SpillableColumnarBatch]()
+
+  def addPartitionBatch(part: SpillableColumnarBatch, dataSize: Long): Unit = {
+    batches.append(part)
+    totalSize += dataSize
+  }
+
+  def getTotalSize: Long = totalSize
+
+  def releaseBatches(): Array[SpillableColumnarBatch] = {
+    val result = batches.toArray
+    batches.clear()
+    totalSize = 0
+    result
+  }
+
+  override def close(): Unit = {
+    batches.safeClose()
+    batches.clear()
+    totalSize = 0L
+  }
+}
+
+/**
+ * Base class for a partitioner in a large join.
+ *
+ * @param numPartitions number of partitions being used in the join
+ * @param batchTypes schema of the batches
+ * @param boundJoinKeys bound keys used in the join
+ * @param metrics metrics to update
+ */
+abstract class JoinPartitioner(
+    numPartitions: Int,
+    batchTypes: Array[DataType],
+    boundJoinKeys: Seq[Expression],
+    metrics: Map[String, GpuMetric]) extends AutoCloseable {
+  protected val partitions: Array[JoinPartition] =
+    (0 until numPartitions).map(_ => new JoinPartition).toArray
+  protected val opTime = metrics(OP_TIME)
+
+  /**
+   * Hash partitions a batch in preparation for performing a sub-join. The input batch will
+   * be closed by this method.
+   */
+  protected def partitionBatch(inputBatch: ColumnarBatch): Unit = {
+    val spillableBatch = SpillableColumnarBatch(inputBatch, SpillPriorities.ACTIVE_ON_DECK_PRIORITY)
+    withRetryNoSplit(spillableBatch) { _ =>
+      opTime.ns {
+        val partsTable = GpuHashPartitioningBase.hashPartitionAndClose(
+          spillableBatch.getColumnarBatch(), boundJoinKeys, numPartitions, "partition for join",
+          JoinPartitioner.HASH_SEED)
+        val contigTables = withResource(partsTable) { _ =>
+          partsTable.getTable.contiguousSplit(partsTable.getPartitions.tail: _*)
+        }
+        withResource(contigTables) { _ =>
+          contigTables.zipWithIndex.foreach { case (ct, i) =>
+            if (ct.getRowCount > 0) {
+              contigTables(i) = null
+              addPartition(i, ct)
+            }
+          }
+        }
+      }
+    }
+  }
+
+  /**
+   * Adds a batch associated with a partition.
+   *
+   * @param partIndex index of the partition
+   * @param ct contiguous table to add to the specified partition
+   */
+  protected def addPartition(partIndex: Int, ct: ContiguousTable): Unit = {
+    val dataSize = ct.getBuffer.getLength
+    partitions(partIndex).addPartitionBatch(SpillableColumnarBatch(ct, batchTypes,
+      SpillPriorities.ACTIVE_BATCHING_PRIORITY), dataSize)
+  }
+
+  override def close(): Unit = {
+    partitions.safeClose()
+  }
+}
+
+object JoinPartitioner {
+  /**
+   * A seed to use for the hash partitioner when sub-partitioning. Needs to be different than
+   * the seed used by Spark for partitioning a hash join (i.e.: 42)
+   */
+  val HASH_SEED: Int = 100
+}
+
+/**
+ * Join partitioner for the build side of a large join where the build side of the join does not
+ * fit in a single GPU batch.
+ *
+ * @param numPartitions number of partitions being used in the join
+ * @param buildSideIter iterator of build side batches
+ * @param buildSideTypes schema of the build side batches
+ * @param boundBuildKeys bound join key expressions for the build side
+ * @param gpuBatchSizeBytes target GPU batch size
+ * @param metrics metrics to update
+ */
+class BuildSidePartitioner(
+    val numPartitions: Int,
+    buildSideIter: Iterator[ColumnarBatch],
+    buildSideTypes: Array[DataType],
+    boundBuildKeys: Seq[Expression],
+    gpuBatchSizeBytes: Long,
+    metrics: Map[String, GpuMetric])
+  extends JoinPartitioner(numPartitions, buildSideTypes, boundBuildKeys, metrics) {
+
+  // partition all of the build-side batches
+  closeOnExcept(partitions) { _ =>
+    while (buildSideIter.hasNext) {
+      partitionBatch(buildSideIter.next())
+    }
+  }
+
+  private val (emptyPartitions, joinGroups) = findEmptyAndJoinGroups()
+  private val partitionBatches = new Array[LazySpillableColumnarBatch](joinGroups.length)
+  private val concatTime = metrics(CONCAT_TIME)
+
+  /** Returns a BitSet where a set bit corresponds to an empty partition at that index. */
+  def getEmptyPartitions: BitSet = emptyPartitions
+
+  /**
+   * Returns a BitSet array where each BitSet corresponds to a "join group," a group of partitions
+   * that can together be used as a single build batch for a sub-join. A set bit in a join group's
+   * BitSet corresponds to a partition index that is part of the join group.
+   */
+  def getJoinGroups: Array[BitSet] = joinGroups
+
+  /**
+   * Returns the batch of data for the specified join group index. All of the data across the
+   * partitions comprising the join group are concatenated together to produce the batch.
+   * The concatenated batches are lazily cached, so the cost of concatenation is only incurred
+   * by the first caller for a particular join group.
+   *
+   * @param partitionGroupIndex the index of the join group for which to produce the batch
+   * @return the batch of data for the join group
+   */
+  def getBuildBatch(partitionGroupIndex: Int): LazySpillableColumnarBatch = {
+    var batch = partitionBatches(partitionGroupIndex)
+    if (batch == null) {
+      val spillBatchesBuffer = new mutable.ArrayBuffer[SpillableColumnarBatch]()
+      closeOnExcept(spillBatchesBuffer) { _ =>
+        joinGroups(partitionGroupIndex).foreach { i =>
+          val batches = partitions(i).releaseBatches()
+          assert(batches.nonEmpty)
+          spillBatchesBuffer ++= batches
+        }
+      }
+      val concatBatch = withRetryNoSplit(spillBatchesBuffer.toSeq) { spillBatches =>
+        val batchesToConcat = spillBatches.safeMap(_.getColumnarBatch()).toArray
+        opTime.ns {
+          concatTime.ns {
+            ConcatAndConsumeAll.buildNonEmptyBatchFromTypes(batchesToConcat, buildSideTypes)
+          }
+        }
+      }
+      withResource(concatBatch) { _ =>
+        batch = LazySpillableColumnarBatch(concatBatch, "build subtable")
+        partitionBatches(partitionGroupIndex) = batch
+      }
+    }
+    LazySpillableColumnarBatch.spillOnly(batch)
+  }
+
+  override def close(): Unit = {
+    partitions.safeClose()
+    partitionBatches.safeClose()
+  }
+
+  /**
+   * After partitioning the build-side table, find the set of partition indices
+   * that are empty partitions along with an iterator of partition index sets that,
+   * for each set of indices, produces a build-side sub-table that is ideally
+   * within the GPU batch size limits. If there is a partition that is larger than
+   * the target size, it will be in its own partition index set.
+   *
+   * @return the set of empty partition indices and an iterator of partition index
+   *         sets where each set identifies partitions that can be processed together
+   *         in a join pass.
+   */
+  private def findEmptyAndJoinGroups(): (BitSet, Array[BitSet]) = {
+    val emptyPartitions = new mutable.BitSet(numPartitions)
+    val joinGroups = new mutable.ArrayBuffer[BitSet]()
+    val sortedIndices = (0 until numPartitions).sortBy(i => partitions(i).getTotalSize)
+    val (emptyIndices, nonEmptyIndices) = sortedIndices.partition { i =>
+      partitions(i).getTotalSize == 0
+    }
+    emptyPartitions ++= emptyIndices
+    var group = new mutable.BitSet(numPartitions)
+    var groupSize = 0L
+    nonEmptyIndices.foreach { i =>
+      val newSize = groupSize + partitions(i).getTotalSize
+      if (newSize > gpuBatchSizeBytes) {
+        if (group.nonEmpty) {
+          joinGroups.append(group)
+        }
+        group = new mutable.BitSet(numPartitions)
+        groupSize = partitions(i).getTotalSize
+      } else {
+        groupSize = newSize
+      }
+      group.add(i)
+    }
+    if (group.nonEmpty) {
+      joinGroups.append(group)
+    }
+    (emptyPartitions, joinGroups.toArray)
+  }
+}
+
+/**
+ * Join partitioner for the stream side of a large join.
+ *
+ * @param numPartitions number of partitions being used in the join
+ * @param emptyBuildPartitions BitSet indicating which build side partitions are empty
+ * @param iter iterator of stream side batches
+ * @param streamTypes schema of the stream side batches
+ * @param boundStreamKeys bound join key expressions for the stream side
+ * @param metrics metrics to update
+ */
+class StreamSidePartitioner(
+    numPartitions: Int,
+    emptyBuildPartitions: BitSet,
+    iter: Iterator[ColumnarBatch],
+    streamTypes: Array[DataType],
+    boundStreamKeys: Seq[Expression],
+    metrics: Map[String, GpuMetric])
+  extends JoinPartitioner(numPartitions, streamTypes, boundStreamKeys, metrics) {
+
+  override protected def addPartition(partIndex: Int, ct: ContiguousTable): Unit = {
+    // Ignore partitions that correspond to empty build-side partitions, since
+    // no stream-side keys in this partition will match anything on the build-side.
+    if (emptyBuildPartitions.contains(partIndex)) {
+      ct.close()
+    } else {
+      super.addPartition(partIndex, ct)
+    }
+  }
+
+  def hasInputBatches: Boolean = iter.hasNext
+
+  def partitionNextBatch(): Unit = {
+    assert(partitions.forall(_.getTotalSize == 0), "leftover partitions from previous batch")
+    partitionBatch(iter.next)
+  }
+
+  def releasePartitions(partIndices: BitSet): Array[SpillableColumnarBatch] = {
+    partIndices.iterator.flatMap(i => partitions(i).releaseBatches().toSeq).toArray
+  }
+
+  override def close(): Unit = {
+    partitions.safeClose()
+  }
+}
+
+/**
+ * Iterator that produces the result of a large inner join where the build side of the join is
+ * too large for a single GPU batch. The prior join input probing phase has sized the build side
+ * of the join, so this partitions both the build side and stream side into N+1 partitions, where
+ * N is the size of the build side divided by the target GPU batch size.
+ *
+ * Once the build side is partitioned completely, the partitions are placed into "join groups"
+ * where all the build side data of a join group fits in the GPU target batch size. If the input
+ * data is skewed, a single build partition could be larger than the target GPU batch size.
+ * Currently such oversized partitions are placed in separate join groups consisting just of one
+ * partition each in the hopes that there will be enough GPU memory to proceed with the join
+ * despite the skew. We will need to revisit this for very large, skewed build side data arriving
+ * at a single task.
+ *
+ * Once the build side join groups are identified, each stream batch is partitioned into the same
+ * number of partitions as the build side with the same hash key used for the build side. The
+ * partitions from the batch are grouped into join groups matching the partition grouping from
+ * the build side, and each join group is processed as a sub-join. Once all the join groups for
+ * a stream batch have been processed, the next stream batch is fetched, partitioned, and sub-joins
+ * are processed against the build side join groups. Repeat until the stream side is exhausted.
+ *
+ * @param info join information from input probing phase
+ * @param gpuBatchSizeBytes target GPU batch size
+ * @param metrics metrics to update
+ */
+class BigInnerJoinIterator(
+    info: JoinInfo,
+    gpuBatchSizeBytes: Long,
+    metrics: Map[String, GpuMetric])
+  extends Iterator[ColumnarBatch] with TaskAutoCloseableResource {
+
+  private val buildPartitioner = {
+    val numPartitions = (info.buildSize / gpuBatchSizeBytes) + 1
+    require(numPartitions <= Int.MaxValue, "too many build partitions")
+    new BuildSidePartitioner(numPartitions.toInt, info.buildIter, info.exprs.buildTypes,
+      info.exprs.boundBuildKeys, gpuBatchSizeBytes, metrics)
+  }
+  use(buildPartitioner)
+
+  private val joinGroups = buildPartitioner.getJoinGroups
+  private var nextJoinGroupIndex = joinGroups.length
+
+  private val streamPartitioner = new StreamSidePartitioner(buildPartitioner.numPartitions,
+    buildPartitioner.getEmptyPartitions, info.streamIter, info.exprs.streamTypes,
+    info.exprs.boundStreamKeys, metrics)
+  use(streamPartitioner)
+
+  private var subIter: Option[Iterator[ColumnarBatch]] = None
+  private var isExhausted = joinGroups.isEmpty
+
+  override def hasNext: Boolean = {
+    if (isExhausted) {
+      false
+    } else if (subIter.exists(_.hasNext)) {
+      true
+    } else {
+      setupNextJoinIterator()
+      val result = subIter.exists(_.hasNext)
+      if (!result) {
+        isExhausted = true
+        close()
+      }
+      result
+    }
+  }
+
+  override def next(): ColumnarBatch = {
+    if (!hasNext) {
+      throw new NoSuchElementException("join batches exhausted")
+    }
+    subIter.get.next()
+  }
+
+  private def setupNextJoinIterator(): Unit = {
+    while (!isExhausted && !subIter.exists(_.hasNext)) {
+      if (nextJoinGroupIndex >= joinGroups.length) {
+        // try to pull in the next stream batch
+        if (streamPartitioner.hasInputBatches) {
+          streamPartitioner.partitionNextBatch()
+          nextJoinGroupIndex = 0
+          subIter = Some(moveToNextBuildGroup())
+        } else {
+          isExhausted = true
+          subIter = None
+        }
+      } else {
+        subIter = Some(moveToNextBuildGroup())
+      }
+    }
+  }
+
+  private def moveToNextBuildGroup(): Iterator[ColumnarBatch] = {
+    val builtBatch = buildPartitioner.getBuildBatch(nextJoinGroupIndex)
+    val group = joinGroups(nextJoinGroupIndex)
+    nextJoinGroupIndex += 1
+    val streamBatches = streamPartitioner.releasePartitions(group)
+    val lazyStream = new Iterator[LazySpillableColumnarBatch] {
+      onTaskCompletion(streamBatches.safeClose())
+
+      private var i = 0
+
+      override def hasNext: Boolean = i < streamBatches.length
+
+      override def next(): LazySpillableColumnarBatch = {
+        withResource(streamBatches(i)) { spillBatch =>
+          streamBatches(i) = null
+          i += 1
+          withResource(spillBatch.getColumnarBatch()) { batch =>
+            LazySpillableColumnarBatch(batch, "stream_batch")
+          }
+        }
+      }
+    }
+    GpuShuffledSymmetricHashJoinExec.createJoinIterator(info, builtBatch, lazyStream,
+      gpuBatchSizeBytes, metrics(OP_TIME), metrics(JOIN_TIME))
+  }
+}
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSortMergeJoinMeta.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSortMergeJoinMeta.scala
index bd24d9e6b4d..15cabdcb7a9 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSortMergeJoinMeta.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSortMergeJoinMeta.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -16,6 +16,7 @@
 
 package com.nvidia.spark.rapids
 
+import org.apache.spark.sql.catalyst.plans.Inner
 import org.apache.spark.sql.execution.SortExec
 import org.apache.spark.sql.execution.joins.SortMergeJoinExec
 import org.apache.spark.sql.rapids.execution.{GpuHashJoin, JoinTypeChecks}
@@ -81,17 +82,31 @@ class GpuSortMergeJoinMeta(
       (None, condition)
     }
     val Seq(left, right) = childPlans.map(_.convertIfNeeded())
-    val joinExec = GpuShuffledHashJoinExec(
-      leftKeys.map(_.convertToGpu()),
-      rightKeys.map(_.convertToGpu()),
-      join.joinType,
-      buildSide,
-      joinCondition,
-      left,
-      right,
-      join.isSkewJoin)(
-      join.leftKeys,
-      join.rightKeys)
+    val joinExec = join.joinType match {
+      case Inner if conf.useShuffledSymmetricHashJoin =>
+        GpuShuffledSymmetricHashJoinExec(
+          leftKeys.map(_.convertToGpu()),
+          rightKeys.map(_.convertToGpu()),
+          joinCondition,
+          left,
+          right,
+          conf.isGPUShuffle,
+          conf.gpuTargetBatchSizeBytes)(
+          join.leftKeys,
+          join.rightKeys)
+      case _ =>
+        GpuShuffledHashJoinExec(
+          leftKeys.map(_.convertToGpu()),
+          rightKeys.map(_.convertToGpu()),
+          join.joinType,
+          buildSide,
+          joinCondition,
+          left,
+          right,
+          join.isSkewJoin)(
+          join.leftKeys,
+          join.rightKeys)
+    }
     // For inner joins we can apply a post-join condition for any conditions that cannot be
     // evaluated directly in a mixed join that leverages a cudf AST expression
     filterCondition.map(c => GpuFilterExec(c,
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
index 20a9482b70c..48f9de5a61a 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, Attribut
 import org.apache.spark.sql.catalyst.plans.physical.IdentityBroadcastMode
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.execution._
-import org.apache.spark.sql.execution.adaptive.{AdaptiveSparkPlanExec, BroadcastQueryStageExec, ShuffleQueryStageExec}
+import org.apache.spark.sql.execution.adaptive._
 import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec
 import org.apache.spark.sql.execution.command.{DataWritingCommandExec, ExecutedCommandExec}
 import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2ScanExecBase, DropTableExec, ShowTablesExec}
@@ -106,40 +106,40 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
           GpuRowToColumnarExec(newChild, goal)
       }
 
-      // adaptive plan final query stage with columnar output
-      case r2c @ RowToColumnarExec(child) if parent.isEmpty =>
-        val optimizedChild = optimizeAdaptiveTransitions(child, Some(r2c))
-        val projectedChild =
-          optimizedChild.getTagValue(GpuOverrides.preRowToColProjection).map { exprs =>
-            ProjectExec(exprs, optimizedChild)
-          }.getOrElse(optimizedChild)
-        GpuRowToColumnarExec(projectedChild, TargetSize(rapidsConf.gpuTargetBatchSizeBytes))
-
-      case ColumnarToRowExec(bb: GpuBringBackToHost) =>
-        // We typically want the final operator in the plan (the operator that has no parent) to be
-        // wrapped in `ColumnarToRowExec(GpuBringBackToHost(_))` operators to
-        // bring the data back onto the host and be translated to rows so that it can be returned
-        // from the Spark API. However, in the case of AQE, each exchange operator is treated as an
-        // individual query with no parent and we need to remove these operators in this case
-        // because we need to return an operator that implements `BroadcastExchangeLike` or
-        // `ShuffleExchangeLike`.
-        bb.child match {
-          case GpuShuffleCoalesceExec(e: GpuShuffleExchangeExecBase, _) if parent.isEmpty =>
-            // The coalesce step gets added back into the plan later on, in a
-            // future query stage that reads the output from this query stage. This
-            // is handled in the case clauses below.
-            e.withNewChildren(e.children.map(c => optimizeAdaptiveTransitions(c, Some(e))))
-          case GpuCoalesceBatches(e: GpuShuffleExchangeExecBase, _) if parent.isEmpty =>
-            // The coalesce step gets added back into the plan later on, in a
-            // future query stage that reads the output from this query stage. This
-            // is handled in the case clauses below.
-            e.withNewChildren(e.children.map(c => optimizeAdaptiveTransitions(c, Some(e))))
-          case _ => optimizeAdaptiveTransitions(bb.child, Some(bb)) match {
-            case e: GpuBroadcastExchangeExecBase => e
-            case e: GpuShuffleExchangeExecBase => e
-            case other => GpuColumnarToRowExec(other)
-          }
+    // adaptive plan final query stage with columnar output
+    case r2c @ RowToColumnarExec(child) if parent.isEmpty =>
+      val optimizedChild = optimizeAdaptiveTransitions(child, Some(r2c))
+      val projectedChild =
+        optimizedChild.getTagValue(GpuOverrides.preRowToColProjection).map { exprs =>
+          ProjectExec(exprs, optimizedChild)
+        }.getOrElse(optimizedChild)
+      GpuRowToColumnarExec(projectedChild, TargetSize(rapidsConf.gpuTargetBatchSizeBytes))
+
+    case ColumnarToRowExec(bb: GpuBringBackToHost) =>
+      // We typically want the final operator in the plan (the operator that has no parent) to be
+      // wrapped in `ColumnarToRowExec(GpuBringBackToHost(_))` operators to
+      // bring the data back onto the host and be translated to rows so that it can be returned
+      // from the Spark API. However, in the case of AQE, each exchange operator is treated as an
+      // individual query with no parent and we need to remove these operators in this case
+      // because we need to return an operator that implements `BroadcastExchangeLike` or
+      // `ShuffleExchangeLike`.
+      bb.child match {
+        case GpuShuffleCoalesceExec(e: GpuShuffleExchangeExecBase, _) if parent.isEmpty =>
+          // The coalesce step gets added back into the plan later on, in a
+          // future query stage that reads the output from this query stage. This
+          // is handled in the case clauses below.
+          e.withNewChildren(e.children.map(c => optimizeAdaptiveTransitions(c, Some(e))))
+        case GpuCoalesceBatches(e: GpuShuffleExchangeExecBase, _) if parent.isEmpty =>
+          // The coalesce step gets added back into the plan later on, in a
+          // future query stage that reads the output from this query stage. This
+          // is handled in the case clauses below.
+          e.withNewChildren(e.children.map(c => optimizeAdaptiveTransitions(c, Some(e))))
+        case _ => optimizeAdaptiveTransitions(bb.child, Some(bb)) match {
+          case e: GpuBroadcastExchangeExecBase => e
+          case e: GpuShuffleExchangeExecBase => e
+          case other => GpuColumnarToRowExec(other)
         }
+      }
 
     case s: ShuffleQueryStageExec =>
       // When reading a materialized shuffle query stage in AQE mode, we need to insert an
@@ -174,6 +174,28 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
       // We wrap custom shuffle readers with a coalesce batches operator here.
       addPostShuffleCoalesce(e.copy(child = optimizeAdaptiveTransitions(e.child, Some(e))))
 
+    case c2re: ColumnarToRowExec if
+        SparkShimImpl.checkCToRWithExecBroadcastAQECoalPart(c2re, parent) =>
+      val shuffle = SparkShimImpl.getShuffleFromCToRWithExecBroadcastAQECoalPart(c2re)
+      shuffle match {
+        case Some(s) =>
+            /*
+             * When we find this pattern, explicitly add in the GPU columnar to row and CPU
+             * exchange for executor broadcast.
+             * Note that this likely adds some performance overhead because we end up doing
+             * an extra exchange.
+             *
+             * +- Exchange (SinglePartition, EXECUTOR_BROADCAST)
+             *     +- GpuColumnarToRow
+             *         +- GpuShuffleCoalesce
+             *             +- ShuffleQueryStage
+             *                 +- GpuColumnarExchange
+             */
+          val gpuc2r = GpuColumnarToRowExec(optimizeAdaptiveTransitions(s, Some(plan)))
+          SparkShimImpl.addExecBroadcastShuffle(gpuc2r)
+        case _ => c2re
+      }
+
     case ColumnarToRowExec(e: ShuffleQueryStageExec) =>
       val c2r = GpuColumnarToRowExec(optimizeAdaptiveTransitions(e, Some(plan)))
       SparkShimImpl.addRowShuffleToQueryStageTransitionIfNeeded(c2r, e)
@@ -264,25 +286,37 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
   }
 
   /**
-   * Removes `GpuCoalesceBatches(GpuShuffleCoalesceExec(build side))` for the build side
-   * for the shuffled hash join. The coalesce logic has been moved to the
+   * Removes `GpuCoalesceBatches(GpuShuffleCoalesceExec(build side))` from either side of a
+   * GpuShuffledInnerHashJoinExec since that node handles shuffled data directly. For other joins,
+   * it removes it for for the build side. The coalesce logic has been moved to the
    * `GpuShuffleCoalesceExec` class, and is handled differently to prevent holding onto the
    * GPU semaphore for stream IO.
    */
-  def shuffledHashJoinOptimizeShuffle(plan: SparkPlan): SparkPlan = plan match {
-    case x@GpuShuffledHashJoinExec(
-         _, _, _, buildSide, _,
-        left: GpuShuffleCoalesceExec,
-        GpuCoalesceBatches(GpuShuffleCoalesceExec(rc, _), _),_) if buildSide == GpuBuildRight =>
-      x.withNewChildren(
-        Seq(shuffledHashJoinOptimizeShuffle(left), shuffledHashJoinOptimizeShuffle(rc)))
-    case x@GpuShuffledHashJoinExec(
-         _, _, _, buildSide, _,
-        GpuCoalesceBatches(GpuShuffleCoalesceExec(lc, _), _),
-        right: GpuShuffleCoalesceExec, _) if buildSide == GpuBuildLeft =>
-      x.withNewChildren(
-        Seq(shuffledHashJoinOptimizeShuffle(lc), shuffledHashJoinOptimizeShuffle(right)))
-    case p => p.withNewChildren(p.children.map(shuffledHashJoinOptimizeShuffle))
+  def shuffledHashJoinOptimizeShuffle(plan: SparkPlan): SparkPlan = {
+    plan match {
+      case j: GpuShuffledSymmetricHashJoinExec =>
+        val newChildren = Seq(j.left, j.right).map {
+          case GpuCoalesceBatches(GpuShuffleCoalesceExec(c, _), _) => c
+          case GpuShuffleCoalesceExec(c, _) => c
+          case c => c
+        }.map(shuffledHashJoinOptimizeShuffle)
+        j.withNewChildren(newChildren)
+      case x@GpuShuffledHashJoinExec(
+          _, _, _, buildSide, _,
+          left: GpuShuffleCoalesceExec,
+          GpuCoalesceBatches(GpuShuffleCoalesceExec(rc, _), _),_)
+          if buildSide == GpuBuildRight && rapidsConf.shuffledHashJoinOptimizeShuffle =>
+        x.withNewChildren(
+          Seq(shuffledHashJoinOptimizeShuffle(left), shuffledHashJoinOptimizeShuffle(rc)))
+      case x@GpuShuffledHashJoinExec(
+          _, _, _, buildSide, _,
+          GpuCoalesceBatches(GpuShuffleCoalesceExec(lc, _), _),
+          right: GpuShuffleCoalesceExec, _)
+          if buildSide == GpuBuildLeft && rapidsConf.shuffledHashJoinOptimizeShuffle =>
+        x.withNewChildren(
+          Seq(shuffledHashJoinOptimizeShuffle(lc), shuffledHashJoinOptimizeShuffle(right)))
+      case p => p.withNewChildren(p.children.map(shuffledHashJoinOptimizeShuffle))
+    }
   }
 
   private def insertCoalesce(plans: Seq[SparkPlan], goals: Seq[CoalesceGoal],
@@ -764,7 +798,7 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
         }
         updatedPlan = fixupHostColumnarTransitions(updatedPlan)
         updatedPlan = optimizeCoalesce(updatedPlan)
-        if (rapidsConf.shuffledHashJoinOptimizeShuffle) {
+        if (rapidsConf.shuffledHashJoinOptimizeShuffle || rapidsConf.useShuffledSymmetricHashJoin) {
           updatedPlan = shuffledHashJoinOptimizeShuffle(updatedPlan)
         }
         if (rapidsConf.exportColumnarRdd) {
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
index 659d87fcafe..4857bde2ac0 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -627,6 +627,13 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
       .booleanConf
       .createWithDefault(true)
 
+  val USE_SHUFFLED_SYMMETRIC_HASH_JOIN = conf("spark.rapids.sql.join.useShuffledSymmetricHashJoin")
+    .doc("Use the experimental shuffle symmetric hash join designed to improve handling of large " +
+      "joins. Requires spark.rapids.sql.shuffledHashJoin.optimizeShuffle=true.")
+    .internal()
+    .booleanConf
+    .createWithDefault(false)
+
   val STABLE_SORT = conf("spark.rapids.sql.stableSort.enabled")
       .doc("Enable or disable stable sorting. Apache Spark's sorting is typically a stable " +
           "sort, but sort stability cannot be guaranteed in distributed work loads because the " +
@@ -2295,6 +2302,8 @@ class RapidsConf(conf: Map[String, String]) extends Logging {
 
   lazy val shuffledHashJoinOptimizeShuffle: Boolean = get(SHUFFLED_HASH_JOIN_OPTIMIZE_SHUFFLE)
 
+  lazy val useShuffledSymmetricHashJoin: Boolean = get(USE_SHUFFLED_SYMMETRIC_HASH_JOIN)
+
   lazy val stableSort: Boolean = get(STABLE_SORT)
 
   lazy val isFileScanPrunePartitionEnabled: Boolean = get(FILE_SCAN_PRUNE_PARTITION_ENABLED)
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala
index f37b59f709a..5cb170968a6 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala
@@ -1154,7 +1154,6 @@ abstract class BaseExprMeta[INPUT <: Expression](
   //|         Value          | needTimeZoneCheck |           isTimeZoneSupported           |
   //+------------------------+-------------------+-----------------------------------------+
   //| TimezoneAwareExpression| True              | False by default, True when implemented |
-  //| UTCTimestamp           | True              | False by default, True when implemented |
   //| Others                 | False             | N/A (will not be checked)               |
   //+------------------------+-------------------+-----------------------------------------+
   lazy val needTimeZoneCheck: Boolean = {
@@ -1171,7 +1170,6 @@ abstract class BaseExprMeta[INPUT <: Expression](
         } else{
           true
         }
-      case _: UTCTimestamp => true
       case _ => false
     }
   }
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SparkShims.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SparkShims.scala
index b3a74f1251f..8f863d2962f 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SparkShims.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SparkShims.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -141,6 +141,22 @@ trait SparkShims {
   def addRowShuffleToQueryStageTransitionIfNeeded(c2r: ColumnarToRowTransition,
       sqse: ShuffleQueryStageExec): SparkPlan = c2r
 
+  /*
+   * The following two functions are used to recognize when an executor broadcast
+   * is being used to feed into a join but a columnar to row gets inserted between
+   * the exchange and the join. This causes issues on some versions of Spark so we
+   * have to shim it.
+   */
+  def checkCToRWithExecBroadcastAQECoalPart(p: SparkPlan,
+      parent: Option[SparkPlan]): Boolean = false
+
+  def getShuffleFromCToRWithExecBroadcastAQECoalPart(p: SparkPlan): Option[SparkPlan] = None
+
+  /**
+   * If the shim doesn't support executor broadcast, just return the plan passed in
+   */
+  def addExecBroadcastShuffle(p: SparkPlan): SparkPlan = p
+
   /**
    * Walk the plan recursively and return a list of operators that match the predicate
    */
diff --git a/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala b/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala
index 4818f25343f..3169d6bc543 100644
--- a/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala
+++ b/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressions.scala
@@ -874,7 +874,9 @@ abstract class GpuToTimestamp
         if (GpuOverrides.isUTCTimezone(zoneId)) {
           res
         } else {
-          GpuTimeZoneDB.fromTimestampToUtcTimestamp(res, zoneId)
+          withResource(res) { _ =>
+            GpuTimeZoneDB.fromTimestampToUtcTimestamp(res, zoneId)
+          }
         }
       case _: DateType =>
         timeZoneId match {
diff --git a/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala b/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
index 21463506f97..c211b29a9f0 100644
--- a/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
+++ b/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -27,7 +27,6 @@
 {"spark": "324"}
 {"spark": "330"}
 {"spark": "330cdh"}
-{"spark": "330db"}
 {"spark": "331"}
 {"spark": "332"}
 {"spark": "332cdh"}
diff --git a/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala b/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
index be92a516bef..1fb974ec9bd 100644
--- a/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
+++ b/sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2022-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2022-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -27,7 +27,6 @@
 {"spark": "324"}
 {"spark": "330"}
 {"spark": "330cdh"}
-{"spark": "330db"}
 {"spark": "331"}
 {"spark": "332"}
 {"spark": "332cdh"}
diff --git a/sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala b/sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
index fde797c6965..064701b811c 100644
--- a/sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
+++ b/sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -24,7 +24,6 @@
 {"spark": "324"}
 {"spark": "330"}
 {"spark": "330cdh"}
-{"spark": "330db"}
 {"spark": "331"}
 {"spark": "332"}
 {"spark": "332cdh"}
diff --git a/sql-plugin/src/main/spark321db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala b/sql-plugin/src/main/spark321db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
index 77265121efe..6623fc8765f 100644
--- a/sql-plugin/src/main/spark321db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
+++ b/sql-plugin/src/main/spark321db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2024, NVIDIA CORPORATION.
  *
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
@@ -19,7 +19,6 @@
 
 /*** spark-rapids-shim-json-lines
 {"spark": "321db"}
-{"spark": "330db"}
 {"spark": "332db"}
 spark-rapids-shim-json-lines ***/
 package org.apache.spark.sql.rapids.execution.python.shims
diff --git a/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/Spark330PlusDBShims.scala b/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/Spark330PlusDBShims.scala
index cb45d0fa440..0c5594b8da0 100644
--- a/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/Spark330PlusDBShims.scala
+++ b/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/Spark330PlusDBShims.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -73,15 +73,22 @@ trait Spark330PlusDBShims extends Spark321PlusDBShims {
     }
   }
 
+  /*
+   * Explicitly add in the CPU exchange for executor broadcast. Generally
+   * we expect the plan to be passed in to be a GPU columnar to row but
+   * we are not explicitly limiting it.
+   */
+  override def addExecBroadcastShuffle(p: SparkPlan): SparkPlan = {
+    ShuffleExchangeExec(SinglePartition, p, EXECUTOR_BROADCAST)
+  }
 
   override def addRowShuffleToQueryStageTransitionIfNeeded(c2r: ColumnarToRowTransition,
       sqse: ShuffleQueryStageExec): SparkPlan = {
     val plan = GpuTransitionOverrides.getNonQueryStagePlan(sqse)
     plan match {
       case shuffle: ShuffleExchangeLike if shuffle.shuffleOrigin.equals(EXECUTOR_BROADCAST) =>
-        ShuffleExchangeExec(SinglePartition, c2r, EXECUTOR_BROADCAST)
-      case _ =>
-        c2r
+        addExecBroadcastShuffle(c2r)
+      case _ => c2r
     }
   }
 }
diff --git a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
similarity index 97%
rename from sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
rename to sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
index 9c8b180f344..f229e50528a 100644
--- a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
+++ b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonOutput.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -15,6 +15,7 @@
  */
 
 /*** spark-rapids-shim-json-lines
+{"spark": "330db"}
 {"spark": "341db"}
 spark-rapids-shim-json-lines ***/
 package org.apache.spark.sql.rapids.execution.python.shims
diff --git a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
similarity index 98%
rename from sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
rename to sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
index 2912bff097c..fc79236095a 100644
--- a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
+++ b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuArrowPythonRunner.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -14,6 +14,7 @@
  * limitations under the License.
  */
 /*** spark-rapids-shim-json-lines
+{"spark": "330db"}
 {"spark": "341db"}
 spark-rapids-shim-json-lines ***/
 package org.apache.spark.sql.rapids.execution.python.shims
diff --git a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
similarity index 98%
rename from sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
rename to sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
index ec2c8e67664..6f657950a91 100644
--- a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
+++ b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuCoGroupedArrowPythonRunner.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -15,6 +15,7 @@
  */
 
 /*** spark-rapids-shim-json-lines
+{"spark": "330db"}
 {"spark": "341db"}
 spark-rapids-shim-json-lines ***/
 package org.apache.spark.sql.rapids.execution.python.shims
diff --git a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
similarity index 98%
rename from sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
rename to sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
index 1a7251b33f1..ec3bde02434 100644
--- a/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
+++ b/sql-plugin/src/main/spark330db/scala/org/apache/spark/sql/rapids/execution/python/shims/GpuGroupUDFArrowPythonRunner.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2023, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2024, NVIDIA CORPORATION.
  *
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
@@ -18,6 +18,7 @@
  */
 
 /*** spark-rapids-shim-json-lines
+{"spark": "330db"}
 {"spark": "341db"}
 spark-rapids-shim-json-lines ***/
 package org.apache.spark.sql.rapids.execution.python.shims
diff --git a/sql-plugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/Spark341PlusDBShims.scala b/sql-plugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/Spark341PlusDBShims.scala
index 36ffc1db926..d5f554adcee 100644
--- a/sql-plugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/Spark341PlusDBShims.scala
+++ b/sql-plugin/src/main/spark341db/scala/com/nvidia/spark/rapids/shims/Spark341PlusDBShims.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -25,8 +25,10 @@ import com.nvidia.spark.rapids.GpuOverrides.pluginSupportedOrderableSig
 import org.apache.spark.rapids.shims.GpuShuffleExchangeExec
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
-import org.apache.spark.sql.execution.{CollectLimitExec, GlobalLimitExec, SparkPlan, TakeOrderedAndProjectExec}
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.execution.adaptive._
 import org.apache.spark.sql.execution.exchange.ENSURE_REQUIREMENTS
+import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BroadcastNestedLoopJoinExec}
 import org.apache.spark.sql.rapids.GpuV1WriteUtils.GpuEmpty2Null
 import org.apache.spark.sql.rapids.execution.python.GpuPythonUDAF
 import org.apache.spark.sql.types.StringType
@@ -171,4 +173,43 @@ trait Spark341PlusDBShims extends Spark332PlusDBShims {
   override def getExecs: Map[Class[_ <: SparkPlan], ExecRule[_ <: SparkPlan]] =
     super.getExecs ++ shimExecs
 
+  /*
+   * We are looking for the pattern describe below. We end up with a ColumnarToRow that feeds
+   * into a CPU broadcasthash join which is using Executor broadcast. This pattern fails on
+   * Databricks because it doesn't like the ColumnarToRow feeding into the BroadcastHashJoin.
+   * Note, in most other cases we see executor broadcast, the Exchange would be CPU
+   * single partition exchange explicitly marked with type EXECUTOR_BROADCAST.
+   *
+   *  +- BroadcastHashJoin || BroadcastNestedLoopJoin (using executor broadcast)
+   *  ^
+   *  +- ColumnarToRow
+   *      +- AQEShuffleRead ebj (uses coalesce partitions to go to 1 partition)
+   *        +- ShuffleQueryStage
+   *            +- GpuColumnarExchange gpuhashpartitioning
+   */
+  override def checkCToRWithExecBroadcastAQECoalPart(p: SparkPlan,
+      parent: Option[SparkPlan]): Boolean = {
+    p match {
+      case ColumnarToRowExec(AQEShuffleReadExec(_: ShuffleQueryStageExec, _, _)) =>
+        parent match {
+          case Some(bhje: BroadcastHashJoinExec) if bhje.isExecutorBroadcast => true
+          case Some(bhnlj: BroadcastNestedLoopJoinExec) if bhnlj.isExecutorBroadcast => true
+          case _ => false
+        }
+      case _ => false
+    }
+  }
+
+  /*
+   * If this plan matches the checkCToRWithExecBroadcastCoalPart() then get the shuffle
+   * plan out so we can wrap it. This function does not check that the parent is
+   * BroadcastHashJoin doing executor broadcast, so is expected to be called only
+   * after checkCToRWithExecBroadcastCoalPart().
+   */
+  override def getShuffleFromCToRWithExecBroadcastAQECoalPart(p: SparkPlan): Option[SparkPlan] = {
+    p match {
+      case ColumnarToRowExec(AQEShuffleReadExec(s: ShuffleQueryStageExec, _, _)) => Some(s)
+      case _ => None
+    }
+  }
 }
diff --git a/tools/generated_files/supportedExprs.csv b/tools/generated_files/supportedExprs.csv
index fc32f187566..2dbc386656d 100644
--- a/tools/generated_files/supportedExprs.csv
+++ b/tools/generated_files/supportedExprs.csv
@@ -224,9 +224,9 @@ GetArrayItem,S, ,None,project,ordinal,NA,S,S,S,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
 GetArrayItem,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
 GetArrayStructFields,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
 GetArrayStructFields,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
-GetJsonObject,S,`get_json_object`,None,project,json,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-GetJsonObject,S,`get_json_object`,None,project,path,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA
-GetJsonObject,S,`get_json_object`,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
+GetJsonObject,NS,`get_json_object`,This is disabled by default because escape sequences are not processed correctly; the input is not validated; and the output is not normalized the same as Spark,project,json,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
+GetJsonObject,NS,`get_json_object`,This is disabled by default because escape sequences are not processed correctly; the input is not validated; and the output is not normalized the same as Spark,project,path,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA
+GetJsonObject,NS,`get_json_object`,This is disabled by default because escape sequences are not processed correctly; the input is not validated; and the output is not normalized the same as Spark,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
 GetMapValue,S, ,None,project,map,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA
 GetMapValue,S, ,None,project,key,S,S,S,S,S,S,S,S,PS,S,S,NS,NS,NS,NS,NS,NS,NS
 GetMapValue,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS