Improve fix for `from_json` function for empty data input column #8542

cindyyuanjiang · 2023-06-09T05:55:37Z

from_json function fails when the input column contains all empty or null strings. Current fix #8526 is inefficient in terms of GPU memory usage.

This follow-up issue tracks:

Improve fix for from_json function for empty data input column w.r.t. GPU memory usage
Revert current workaround for strip, rstrip, and lstrip once fix Fix cudf::strings::strip for all-empty input column rapidsai/cudf#13533 gets merged

The text was updated successfully, but these errors were encountered:

revans2 · 2023-06-09T14:24:57Z

Actually I don't think we want to revert the change for strip, rstrip, and lstrip. It is likely still a performance/memory win, but very very small.

What I wanted to see what a fix for the empty string replacement in from_json The current JSON parser acts differently from all of the other parsers in that it does not return the requested columns. It returns all of the columns it saw, and only uses the types passed in to match it up with the columns it saw. rapidsai/cudf#13473 describes this.

Prior to #8526 we would replace empty strings with {} so that the lines in from_json would not be stripped out. But #8526 changed it to include an entry for each column. That works, but it is not memory efficient. We added in rapidsai/cudf#13477 to CUDF to work around most of the issues, so we could stick with {} as the replacement so long as there is at least one column in one row in the batch. But that is hard to detect do we really should just replace the empty columns with something, ideally something that is in the input schema

constructEmptyRow(schema: DataType): String = schema match {
  case struct: StructType =>
    if (struct.fields.length <= 0) {
      // This needs to be fixed at a higher level and we want to test for it. The output would be a batch of only rows, no columns
      // so we should just return that. I suspect Spark on the CPU will have issues with this too? but who knows...
      throw new IllegalArgumentException()
    } else {
      s"\"${escapeFieldName(struct.head.name)}\":null"
    }
  case other =>
    throw new IllegalArgumentException(s"other is not supported as a top level type")
}

This way the replacement string data is small and we should get the results that we want.

andygrove · 2023-10-09T15:04:14Z

Closed by #9369

cindyyuanjiang added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jun 9, 2023

cindyyuanjiang self-assigned this Jun 9, 2023

cindyyuanjiang mentioned this issue Jun 9, 2023

Fix from_json function failure when input contains empty or null strings #8526

Merged

mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 9, 2023

sameerz assigned andygrove and unassigned cindyyuanjiang Sep 13, 2023

andygrove mentioned this issue Oct 3, 2023

Improve JSON empty row fix to use less memory #9369

Merged

sameerz added tech debt and removed feature request New feature or request labels Oct 7, 2023

andygrove closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fix for `from_json` function for empty data input column #8542

Improve fix for `from_json` function for empty data input column #8542

cindyyuanjiang commented Jun 9, 2023

revans2 commented Jun 9, 2023

andygrove commented Oct 9, 2023

Improve fix for from_json function for empty data input column #8542

Improve fix for from_json function for empty data input column #8542

Comments

cindyyuanjiang commented Jun 9, 2023

revans2 commented Jun 9, 2023

andygrove commented Oct 9, 2023

Improve fix for `from_json` function for empty data input column #8542

Improve fix for `from_json` function for empty data input column #8542