-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve GpuJsonToStructs
performance
#11560
Labels
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
epic
Issue that encompasses a significant feature or body of work
feature request
New feature or request
improve
performance
A performance related task/issue
Comments
ttnghia
added
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
epic
Issue that encompasses a significant feature or body of work
feature request
New feature or request
improve
performance
A performance related task/issue
labels
Oct 4, 2024
After discussion with @ttnghia, Here are the improvements planned for different sections:
|
This was referenced Oct 17, 2024
This was referenced Oct 25, 2024
rapids-bot bot
pushed a commit
to rapidsai/cudf
that referenced
this issue
Oct 29, 2024
In Spark, the `DecimalType` has a specific number of digits to represent the numbers. However, when creating a data Schema, only type and name of the column are stored, thus we lose that precision information. As such, it would be difficult to reconstruct the original decimal types from cudf's `Schema` instance. This PR adds a `precision` member variable to the `Schema` class in cudf Java, allowing it to store the precision number of the original decimal column. Partially contributes to NVIDIA/spark-rapids#11560. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #17176
rapids-bot bot
pushed a commit
to rapidsai/cudf
that referenced
this issue
Nov 8, 2024
…read_json` directly (#17180) With this PR, `Table.readJSON` will return the output from libcudf `read_json` directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudf `read_json` already does these thus we no longer have to do it. Depends on: * #17029 Partially contributes to NVIDIA/spark-rapids#11560. Closes #17002 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #17180
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
epic
Issue that encompasses a significant feature or body of work
feature request
New feature or request
improve
performance
A performance related task/issue
The performance of our current
GpuJsonToStructs
is not good. When running the profiling, it looks like this:In the particular test case for the profiling above, the only useful work is only what to the end of the
read_json
range (just above 300ms), which is less than 50% of the entireGpuJsonToStructs
projection (>800ms). The rest are just overhead, but it consists mostly of hundreds of small kernel calls and stream syncs due to pure copying data from the intermediate result to the final output.We can do a lot better by reducing the unnecessary overhead, or improving them by a way that they can run in a much less time. If we divide the runtime of
GpuJsonToStructs
into sections:The improvement can be done by the following tasks:
cudf::read_json
. This needs help from cudf team. Beyond that, we can improve the performance of this section with some auxiliary work:concat_json
to join JSON strings given by strings column spark-rapids-jni#2457JSONUtils.concatenateJsonStrings
for concatenating JSON strings #11549cudf::read_json
into the output structs column with the desired read schema. Currently, this process may need to copy a lot of columns from the output table ofcudf::read_json
(hundreds columns), which is a significant overhead. We can see it from the profiling of this section. We can just move them instead. This can be achieved by:cudf::read_json
rapidsai/cudf#17002read_json
need to follow depth-first-search order as in the input schema rapidsai/cudf#17090read_json
should output all-nulls columns for the schema columns that do not exist in the input rapidsai/cudf#17091from_json_to_structs
spark-rapids-jni#2510.The text was updated successfully, but these errors were encountered: