Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON tokenizer memory optimizations #16978

Merged
merged 9 commits into from
Oct 23, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 2, 2024

Description

The full push-down automata that tokenizes the input JSON string, as well as the bracket-brace FST over-estimates the total buffer size required for the translated output and indices. This PR splits the transduce calls for both FSTs into two invocations. The first invocation estimates the size of the translated buffer and the translated indices, and the second call performs the DFA run.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 2, 2024
@shrshi shrshi added cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 5, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Oct 5, 2024

Benchmark used for profiling study: https://github.com/karthikeyann/cudf/blob/enh-profile_memusage_json/wm_benchmark.py

Profiles before and after optimization:
image
image

We see that the peak memory usage comes down from 20.8GiB to 10.3GiB and the runtime of get_token_stream also reduces from 1.028s to 825.578ms

@shrshi shrshi marked this pull request as ready for review October 5, 2024 01:12
@shrshi shrshi requested a review from a team as a code owner October 5, 2024 01:12
@karthikeyann karthikeyann requested a review from elstehle October 7, 2024 19:45
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the benchmarks. The benchmarks have Out of memory allocation error for integral types.
./benchmarks/JSON_READER_NVBENCH --profile --benchmark json_read_data_type --axis data_type=INTEGRAL
I debugged further and noticed that, propagate_writes_scan_bytes in

propagate_writes_scan_bytes,
is a very large number 18446744073707336447
Here are some numbers to debug further

json_in.size() = 2167967896
stack_ops_bufsize: 4922600
num_stack_ops: 4922600
stack_level_scan_bytes: 11519
stack_level_sort_bytes: 608511
match_level_scan_bytes: 11519
propagate_writes_scan_bytes: 18446744073707336447
num_symbols_in: 4922600
num_symbols_out: 2167967896
total_temp_storage_bytes: 18446744073707336447 (FAILURE is in this allocation)

@karthikeyann
Copy link
Contributor

Debugged further: This issue is unrelated to this PR.
allocation goes to negative since in32_t > 2**31 is made negative.
json_in.size() = 2167967896 is more than 2 GiB (2.019 GiB).
We should have safe guards to prevent >2GiB passed to read_json due to int32_t limitations in cub algorithms.
created issue #17017

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Contributor

@lamarrr lamarrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@shrshi
Copy link
Contributor Author

shrshi commented Oct 11, 2024

Debugged further: This issue is unrelated to this PR. allocation goes to negative since in32_t > 2**31 is made negative. json_in.size() = 2167967896 is more than 2 GiB (2.019 GiB). We should have safe guards to prevent >2GiB passed to read_json due to int32_t limitations in cub algorithms. created issue #17017

PR #17057 has been opened to resolve this overflow error in the benchmarks. To summarize the fix, we have enabled batching only for JSON lines inputs, not regular JSON. So running the benchmark for JSON inputs of sizes greater than 2GiB causes overflow bugs in the parser.

@elstehle
Copy link
Contributor

The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.

For the end-to-end runs on a V100 for JSON_READER_NVBENCH.nested_json_gpu_parser_depth, I get the following performance numbers:


|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|   2^1   |     2^20      |   2.848 ms |      10.80% |   2.964 ms |      10.46% | 115.818 us |   4.07% |   PASS   |
|   2^2   |     2^20      |   2.817 ms |       8.62% |   2.952 ms |       8.29% | 135.483 us |   4.81% |   PASS   |
|   2^3   |     2^20      |   8.325 ms |       4.53% |   8.367 ms |       4.19% |  41.742 us |   0.50% |   PASS   |
|   2^4   |     2^20      |  10.352 ms |       3.47% |  10.475 ms |       3.89% | 122.863 us |   1.19% |   PASS   |
|   2^1   |     2^22      |   3.707 ms |       4.67% |   3.977 ms |       3.27% | 269.364 us |   7.27% |   FAIL   |
|   2^2   |     2^22      |   3.694 ms |       4.71% |   3.982 ms |       4.77% | 288.939 us |   7.82% |   FAIL   |
|   2^3   |     2^22      |   9.137 ms |       2.21% |   9.411 ms |       2.80% | 273.779 us |   3.00% |   FAIL   |
|   2^4   |     2^22      |  11.413 ms |       2.87% |  11.683 ms |       2.62% | 269.444 us |   2.36% |   PASS   |
|   2^1   |     2^24      |   7.800 ms |       2.31% |   8.490 ms |       0.46% | 689.604 us |   8.84% |   FAIL   |
|   2^2   |     2^24      |   7.805 ms |       2.38% |   8.512 ms |       2.10% | 707.728 us |   9.07% |   FAIL   |
|   2^3   |     2^24      |  12.846 ms |       1.73% |  13.629 ms |       1.94% | 782.743 us |   6.09% |   FAIL   |
|   2^4   |     2^24      |  16.278 ms |       1.88% |  17.111 ms |       1.46% | 832.984 us |   5.12% |   FAIL   |
|   2^1   |     2^26      |  23.387 ms |       0.31% |  25.726 ms |       0.33% |   2.339 ms |  10.00% |   FAIL   |
|   2^2   |     2^26      |  23.339 ms |       0.41% |  25.718 ms |       0.20% |   2.380 ms |  10.20% |   FAIL   |
|   2^3   |     2^26      |  28.909 ms |       0.89% |  31.439 ms |       1.14% |   2.530 ms |   8.75% |   FAIL   |
|   2^4   |     2^26      |  37.511 ms |       1.30% |  39.965 ms |       0.45% |   2.454 ms |   6.54% |   FAIL   |
|   2^1   |     2^28      |  85.176 ms |       0.08% |  94.190 ms |       0.09% |   9.014 ms |  10.58% |   FAIL   |
|   2^2   |     2^28      |  85.214 ms |       0.50% |  94.241 ms |       0.06% |   9.027 ms |  10.59% |   FAIL   |
|   2^3   |     2^28      | 110.835 ms |       0.22% | 119.999 ms |       0.49% |   9.164 ms |   8.27% |   FAIL   |
|   2^4   |     2^28      | 142.120 ms |       0.25% | 151.918 ms |       0.41% |   9.798 ms |   6.89% |   FAIL   |

So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?

@lamarrr lamarrr self-requested a review October 14, 2024 06:00
@karthikeyann
Copy link
Contributor

The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.

So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?

@elstehle I created the issue #17114 for calculating total output values of FST without redundant calculations. Assigned to you.

@shrshi
Copy link
Contributor Author

shrshi commented Oct 22, 2024

The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.

For the end-to-end runs on a V100 for JSON_READER_NVBENCH.nested_json_gpu_parser_depth, I get the following performance numbers:


|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|   2^1   |     2^20      |   2.848 ms |      10.80% |   2.964 ms |      10.46% | 115.818 us |   4.07% |   PASS   |
|   2^2   |     2^20      |   2.817 ms |       8.62% |   2.952 ms |       8.29% | 135.483 us |   4.81% |   PASS   |
|   2^3   |     2^20      |   8.325 ms |       4.53% |   8.367 ms |       4.19% |  41.742 us |   0.50% |   PASS   |
|   2^4   |     2^20      |  10.352 ms |       3.47% |  10.475 ms |       3.89% | 122.863 us |   1.19% |   PASS   |
|   2^1   |     2^22      |   3.707 ms |       4.67% |   3.977 ms |       3.27% | 269.364 us |   7.27% |   FAIL   |
|   2^2   |     2^22      |   3.694 ms |       4.71% |   3.982 ms |       4.77% | 288.939 us |   7.82% |   FAIL   |
|   2^3   |     2^22      |   9.137 ms |       2.21% |   9.411 ms |       2.80% | 273.779 us |   3.00% |   FAIL   |
|   2^4   |     2^22      |  11.413 ms |       2.87% |  11.683 ms |       2.62% | 269.444 us |   2.36% |   PASS   |
|   2^1   |     2^24      |   7.800 ms |       2.31% |   8.490 ms |       0.46% | 689.604 us |   8.84% |   FAIL   |
|   2^2   |     2^24      |   7.805 ms |       2.38% |   8.512 ms |       2.10% | 707.728 us |   9.07% |   FAIL   |
|   2^3   |     2^24      |  12.846 ms |       1.73% |  13.629 ms |       1.94% | 782.743 us |   6.09% |   FAIL   |
|   2^4   |     2^24      |  16.278 ms |       1.88% |  17.111 ms |       1.46% | 832.984 us |   5.12% |   FAIL   |
|   2^1   |     2^26      |  23.387 ms |       0.31% |  25.726 ms |       0.33% |   2.339 ms |  10.00% |   FAIL   |
|   2^2   |     2^26      |  23.339 ms |       0.41% |  25.718 ms |       0.20% |   2.380 ms |  10.20% |   FAIL   |
|   2^3   |     2^26      |  28.909 ms |       0.89% |  31.439 ms |       1.14% |   2.530 ms |   8.75% |   FAIL   |
|   2^4   |     2^26      |  37.511 ms |       1.30% |  39.965 ms |       0.45% |   2.454 ms |   6.54% |   FAIL   |
|   2^1   |     2^28      |  85.176 ms |       0.08% |  94.190 ms |       0.09% |   9.014 ms |  10.58% |   FAIL   |
|   2^2   |     2^28      |  85.214 ms |       0.50% |  94.241 ms |       0.06% |   9.027 ms |  10.59% |   FAIL   |
|   2^3   |     2^28      | 110.835 ms |       0.22% | 119.999 ms |       0.49% |   9.164 ms |   8.27% |   FAIL   |
|   2^4   |     2^28      | 142.120 ms |       0.25% | 151.918 ms |       0.41% |   9.798 ms |   6.89% |   FAIL   |

So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?

From offline discussions, the current plan is to accept the end-to-end performance regression in favor of the reduction in memory footprint. Next steps would be to introduce a multi-stage FST implementation (#17114) that separates the tranduced buffer size estimation and final DFA run to remove the performance downside of redundantly running FSTs,

@shrshi
Copy link
Contributor Author

shrshi commented Oct 23, 2024

/merge

@rapids-bot rapids-bot bot merged commit cff1296 into rapidsai:branch-24.12 Oct 23, 2024
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants