Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fix for from_json function for empty data input column #8542

Closed
2 tasks
cindyyuanjiang opened this issue Jun 9, 2023 · 2 comments · Fixed by #9369
Closed
2 tasks

Improve fix for from_json function for empty data input column #8542

cindyyuanjiang opened this issue Jun 9, 2023 · 2 comments · Fixed by #9369
Assignees

Comments

@cindyyuanjiang
Copy link
Collaborator

from_json function fails when the input column contains all empty or null strings. Current fix #8526 is inefficient in terms of GPU memory usage.

This follow-up issue tracks:

@revans2
Copy link
Collaborator

revans2 commented Jun 9, 2023

Actually I don't think we want to revert the change for strip, rstrip, and lstrip. It is likely still a performance/memory win, but very very small.

What I wanted to see what a fix for the empty string replacement in from_json The current JSON parser acts differently from all of the other parsers in that it does not return the requested columns. It returns all of the columns it saw, and only uses the types passed in to match it up with the columns it saw. rapidsai/cudf#13473 describes this.

Prior to #8526 we would replace empty strings with {} so that the lines in from_json would not be stripped out. But #8526 changed it to include an entry for each column. That works, but it is not memory efficient. We added in rapidsai/cudf#13477 to CUDF to work around most of the issues, so we could stick with {} as the replacement so long as there is at least one column in one row in the batch. But that is hard to detect do we really should just replace the empty columns with something, ideally something that is in the input schema

constructEmptyRow(schema: DataType): String = schema match {
  case struct: StructType =>
    if (struct.fields.length <= 0) {
      // This needs to be fixed at a higher level and we want to test for it. The output would be a batch of only rows, no columns
      // so we should just return that. I suspect Spark on the CPU will have issues with this too? but who knows...
      throw new IllegalArgumentException()
    } else {
      s"\"${escapeFieldName(struct.head.name)}\":null"
    }
  case other =>
    throw new IllegalArgumentException(s"other is not supported as a top level type")
}

This way the replacement string data is small and we should get the results that we want.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 9, 2023
@sameerz sameerz assigned andygrove and unassigned cindyyuanjiang Sep 13, 2023
@sameerz sameerz added tech debt and removed feature request New feature or request labels Oct 7, 2023
@andygrove
Copy link
Contributor

Closed by #9369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants