You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Spark has a config in 3.5.0+ for maxNestingDepth, which defaults to 1000. Prior to this the depth was unlimited. We really want to get as close to this 1000 as possible, but without sacrificing performance. Part of the issue is that by adding more possible maximum nesting depth to the kernel we think that the performance will suffer. But we don't know by how much.
So the plan is to do some experiments with how/where the matching data for [] and {} are stored to see what the performance impact is on these in practice. We should use NVIDIA/spark-rapids#10803 to help measure this.
I can see a few options/places where we could store the data.
registers - very fast, but would likely impact the occupancy if we use too much
shared memory - quite fast still, but there is a limited amount, and can also impact occupancy
main-memory - lots of this is available, but it is slow.
We also need a way to decide at run time which of these options, or some combination of them to use.
Our goal is to be able to support at least a max nesting depth of 1000 with very little performance regression. Each nesting level needs to store 1-bit to indicate if it was for an array, or for an object, so we can validate that both sides matched. is [} should be an error.
The first step is to add a template to the json_parser/tokenizer so we can replace how these bits are stored. I think we would start off with what we have today. It is a single long value that can store up to 64-bits. This would lets us measure the performance and verify that there was no regression for the tempateing.
Next we should try and store it in shared memory. A nesting depth of 1000 needs 125 bytes per thread. Out current get_json_object_kernel uses a block_size of 512, but we could adjust that down if needed.
indicates that we can get at least 48 KiB per block, but for Turing and newer that is at least 64 KiB. 512 * 125 = 62.5 KiB. A block size of 256 would let us fit in the 48 KiB maximum without any issues, but we might be able to adjust based on the compute architecture we see. My biggest concern here is that we also know that our memory access pattern is not good, and we might be able to speed things up by copying it to shared memory instead of reading it directly from main memory. So we need to think about this too.
The if we need more then the 1000 we can store the data in main memory, and allocate some temporary space before we launch the kernel based on the configured maximum per string (instead of per thread) like with shared memory.
We could even try to do a tiered hybrid approach where the first 64-bits are in registers, then we fall back to some shared memory, and finally fall back to main memory.
I really don't know how any of these options would perform. Especially for data that does not have that deep of a nesting level. So the first thing to do it to test them so we can figure that out.
If we see that there is no good one size fits all option, that would let us run a single kernel everywhere, then we might need some heuristics to help us decide at run time.
The problem is that I don't know of any really good and cheap heuristics to run. I see three options to over estimate the maximum nesting depth needed.
I think we want to do a segmented_reduce like operation on the strings' bytes. We would COUNT the number of '[' and '{' bytes found in the data, and then we would take the maximum COUNT value found for all of the strings. The would allocate a single int/long value per string to do the count as intermediate data, and then a maximum operation to get the final estimate.
Another options is to do a segmented scan on the bytes where any [ or { is a +1, and any ] or } is a -1 while everything else is a 0. After this we would need to do a max for all of the values. In the common case this would be more accurate, but it also would use a lot more memory, and could under count the maximum if String values have ] or } characters in them.
If we really want something accurate we could try and use the DFA solution from CUDF, one the performance improvements are in I am not 100% sure how that would work though, so I think in the short term we would just do the first option, but I am nervous in practice we would always end up running with the 1000 maximum kernel anyways.
The text was updated successfully, but these errors were encountered:
revans2
changed the title
[FEA] Support multiple different kernels for JSON tokenization/parsing for different nesting depths.
[FEA] Figure out how to properly support JSON parsing with a nesting depth that goes to at least 1000
May 13, 2024
Is your feature request related to a problem? Please describe.
Spark has a config in 3.5.0+ for
maxNestingDepth
, which defaults to 1000. Prior to this the depth was unlimited. We really want to get as close to this 1000 as possible, but without sacrificing performance. Part of the issue is that by adding more possible maximum nesting depth to the kernel we think that the performance will suffer. But we don't know by how much.So the plan is to do some experiments with how/where the matching data for [] and {} are stored to see what the performance impact is on these in practice. We should use NVIDIA/spark-rapids#10803 to help measure this.
I can see a few options/places where we could store the data.
We also need a way to decide at run time which of these options, or some combination of them to use.
Our goal is to be able to support at least a max nesting depth of 1000 with very little performance regression. Each nesting level needs to store 1-bit to indicate if it was for an array, or for an object, so we can validate that both sides matched. is
[}
should be an error.The first step is to add a template to the json_parser/tokenizer so we can replace how these bits are stored. I think we would start off with what we have today. It is a single long value that can store up to 64-bits. This would lets us measure the performance and verify that there was no regression for the tempateing.
Next we should try and store it in shared memory. A nesting depth of 1000 needs 125 bytes per thread. Out current
get_json_object_kernel
uses ablock_size
of 512, but we could adjust that down if needed.https://docs.nvidia.com/cuda/cuda-c-programming-guide/#features-and-technical-specifications-technical-specifications-per-compute-capability
indicates that we can get at least 48 KiB per block, but for Turing and newer that is at least 64 KiB. 512 * 125 = 62.5 KiB. A block size of 256 would let us fit in the 48 KiB maximum without any issues, but we might be able to adjust based on the compute architecture we see. My biggest concern here is that we also know that our memory access pattern is not good, and we might be able to speed things up by copying it to shared memory instead of reading it directly from main memory. So we need to think about this too.
The if we need more then the 1000 we can store the data in main memory, and allocate some temporary space before we launch the kernel based on the configured maximum per string (instead of per thread) like with shared memory.
We could even try to do a tiered hybrid approach where the first 64-bits are in registers, then we fall back to some shared memory, and finally fall back to main memory.
I really don't know how any of these options would perform. Especially for data that does not have that deep of a nesting level. So the first thing to do it to test them so we can figure that out.
If we see that there is no good one size fits all option, that would let us run a single kernel everywhere, then we might need some heuristics to help us decide at run time.
The problem is that I don't know of any really good and cheap heuristics to run. I see three options to over estimate the maximum nesting depth needed.
I think we want to do a segmented_reduce like operation on the strings' bytes. We would COUNT the number of '[' and '{' bytes found in the data, and then we would take the maximum COUNT value found for all of the strings. The would allocate a single int/long value per string to do the count as intermediate data, and then a maximum operation to get the final estimate.
Another options is to do a segmented scan on the bytes where any
[
or{
is a +1, and any]
or}
is a -1 while everything else is a 0. After this we would need to do a max for all of the values. In the common case this would be more accurate, but it also would use a lot more memory, and could under count the maximum if String values have]
or}
characters in them.If we really want something accurate we could try and use the DFA solution from CUDF, one the performance improvements are in I am not 100% sure how that would work though, so I think in the short term we would just do the first option, but I am nervous in practice we would always end up running with the 1000 maximum kernel anyways.
The text was updated successfully, but these errors were encountered: