Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add date and timestamp support to to_json [databricks] #9600

Merged
merged 26 commits into from
Nov 29, 2023

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Nov 1, 2023

Closes #9515
Part of #9518

This PR adds support for date and timestamp in to_json.

Changes:

  • Allow date and timestamp in GpuOverrides
  • Check JSON options for date/timestamp formats and fall back to CPU if not defaults (there is a follow on issue [FEA][JSON] Support custom date and timestamp formats in to_json #9602 for supporting custom formats but I am not sure what the priority of that is)
  • Add quotes around date and timestamp values
  • Implement a new method for formatting timestamps for JSON
  • Update tests and docs

@andygrove andygrove self-assigned this Nov 1, 2023
@andygrove andygrove changed the title [WIP] Add date and timestamp support to to_json Add date and timestamp support to to_json Nov 1, 2023
Signed-off-by: Andy Grove <[email protected]>
@andygrove andygrove marked this pull request as ready for review November 1, 2023 21:55
@sameerz sameerz added the feature request New feature or request label Nov 2, 2023
@andygrove andygrove changed the title Add date and timestamp support to to_json Add date and timestamp support to to_json [databricks] Nov 14, 2023
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

Build failed with:

2023-11-14T17:22:13.3676904Z [2023-11-14T16:58:19.497Z] ^[[31m- convert large InternalRow iterator to cached batch single col *** FAILED ***^[[0m
2023-11-14T17:22:13.3678719Z [2023-11-14T16:58:19.497Z] ^[[31m  java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-203-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found^[[0m

@andygrove
Copy link
Contributor Author

build

2 similar comments
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@@ -628,7 +628,8 @@ def test_read_case_col_name(spark_tmp_path, v1_enabled_list, col_name):
pytest.param(True, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/9517')),
False
])
def test_structs_to_json(spark_tmp_path, data_gen, ignore_null_fields, pretty):
@pytest.mark.parametrize('timezone', ['UTC', 'Etc/UTC'])
def test_structs_to_json(spark_tmp_path, data_gen, ignore_null_fields, pretty, timezone):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a fallback test for non-utc?

pytest.param(date_gen, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/9515')),
pytest.param(timestamp_gen, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/9515')),
date_gen,
timestamp_gen,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some fallback tests for the options timestampFormat, dateFormat, and if "spark.sql.legacy.timeParserPolicy" is set to LEGACY? At least from the code all of these impact the format that is used to write the JSON output.

if (hasTimestamps) {
// check if the default format is being used
val defaultFormat = "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]"
val timestampFormat = a.options.getOrElse("timestampFormat", defaultFormat)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does 100% of what we want

https://github.com/apache/spark/blob/7120e6b88f2327ffb71c4bca14b10b15aeb26c32/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L125-L130

include a check for the legacyTimeParserPolicy. Either we need to use JSONOptions directly to get the format that we should use, or we need to replicate the behavior exactly. I would prefer JSONOptions if possible, even through a trampoline, because it lets us know exactly what it is, even if it changes from one version of spark to another.

@andygrove andygrove marked this pull request as draft November 20, 2023 19:57
@andygrove andygrove marked this pull request as ready for review November 20, 2023 23:34
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@revans2 I have addressed feedback now.

rule: DataFromReplacementRule
) extends UnaryExprMeta[StructsToJson](expr, conf, parent, rule) {

lazy val supportedTimezones = Seq("UTC", "Etc/UTC").map(ZoneId.of)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can move this into

val UTC_TIMEZONE_ID = ZoneId.of("UTC").normalized()

lazy val supportedTimezones = Seq("UTC", "Etc/UTC").map(ZoneId.of)

override def tagExprForGpu(): Unit = {
if (expr.options.get("pretty").exists(_.equalsIgnoreCase("true"))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI. #9719 touches a little bit of timezone check. given StructsToJson is also a TimeZoneAwareExpression.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

There is a failure after upmerging. I am looking into it.

@andygrove
Copy link
Contributor Author

@revans2 Could you take another look when you get a chance

@andygrove
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes Nov 28, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful that we coordinate this going in with other timezone work being done, but nothing I see here as being a blocker.

@@ -438,6 +438,7 @@ object GpuOverrides extends Logging {
"the Unicode version used by cuDF and the JVM may differ, resulting in some " +
"corner-case characters not changing case correctly."
val UTC_TIMEZONE_ID = ZoneId.of("UTC").normalized()
val SUPPORTED_TIMEZONE_IDS = Seq("UTC", "Etc/UTC").map(ZoneId.of)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need to have an explicit allow list for specific ZoneIds instead of normalizing them and comparing them to UTC_TIMESONE_ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I wasn't familiar with ZoneId, but that makes sense. I have pushed that fix.

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 7efcb81 into NVIDIA:branch-23.12 Nov 29, 2023
37 checks passed
@andygrove andygrove deleted the to_json_dates branch November 29, 2023 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support temporal types in to_json
5 participants