-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON tokenizer memory optimizations #16978
Conversation
Benchmark used for profiling study: https://github.com/karthikeyann/cudf/blob/enh-profile_memusage_json/wm_benchmark.py Profiles before and after optimization: We see that the peak memory usage comes down from 20.8GiB to 10.3GiB and the runtime of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the benchmarks. The benchmarks have Out of memory allocation error for integral types.
./benchmarks/JSON_READER_NVBENCH --profile --benchmark json_read_data_type --axis data_type=INTEGRAL
I debugged further and noticed that, propagate_writes_scan_bytes
in
cudf/cpp/src/io/fst/logical_stack.cuh
Line 444 in bcf9425
propagate_writes_scan_bytes, |
18446744073707336447
Here are some numbers to debug further
json_in.size() = 2167967896
stack_ops_bufsize: 4922600
num_stack_ops: 4922600
stack_level_scan_bytes: 11519
stack_level_sort_bytes: 608511
match_level_scan_bytes: 11519
propagate_writes_scan_bytes: 18446744073707336447
num_symbols_in: 4922600
num_symbols_out: 2167967896
total_temp_storage_bytes: 18446744073707336447 (FAILURE is in this allocation)
Debugged further: This issue is unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
PR #17057 has been opened to resolve this overflow error in the benchmarks. To summarize the fix, we have enabled batching only for JSON lines inputs, not regular JSON. So running the benchmark for JSON inputs of sizes greater than 2GiB causes overflow bugs in the parser. |
The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR. For the end-to-end runs on a V100 for
So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being? |
@elstehle I created the issue #17114 for calculating total output values of FST without redundant calculations. Assigned to you. |
From offline discussions, the current plan is to accept the end-to-end performance regression in favor of the reduction in memory footprint. Next steps would be to introduce a multi-stage FST implementation (#17114) that separates the tranduced buffer size estimation and final DFA run to remove the performance downside of redundantly running FSTs, |
/merge |
Description
The full push-down automata that tokenizes the input JSON string, as well as the bracket-brace FST over-estimates the total buffer size required for the translated output and indices. This PR splits the
transduce
calls for both FSTs into two invocations. The first invocation estimates the size of the translated buffer and the translated indices, and the second call performs the DFA run.Checklist