-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Faster path for calculating total output symbols in FST #17114
Comments
After discussions with @elstehle, here's a brief overview of two possible approaches considered:
Though approach 2 has fewer global memory reads and writes, re-computing the number of translated characters per thread can be expensive. The plan is to start with approach 1 and implement |
For approach 2,
Instead, can we take approach similar to cub, where we can pass |
The output iterators don't have to be pointers but can be arbitrary iterators. It will be difficult to agree on a sentinel value (like I don't have a clear favorite. I think the advantage of the callback is that I expect it to be easier to implement, because we don't have to figure out the right way to properly re-enter the Here's an example function object implementation that we had in mind:
Another question that came up while I was thinking about all this: Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this? |
IMO, I think we can retain the decoupled look-back approach and introduce a 'low-memory' policy that adopts this new approach i.e. writing output offsets to global memory. |
Is your feature request related to a problem? Please describe.
To create a faster path for calculating total output symbols in FST, without redundantly computing some values, such as the state-transition vectors and write offsets in the FST.
Also, if the impact is significant, provide a way to optimize the FST kernel not to redundantly compute number of total output symbols, if it was passed with discard iterator.
Describe the solution you'd like
A reduce kernel to compute total number of output symbols.
Describe alternatives you've considered
used make_discard_iterator in #16978
The compiler should optimize and remove redundant parts, But still some redundant parts are still computed.
@elstehle shows it has 10% runtime impact. Impact on just FST alone would be much more.
Additional context
#16978 (comment) Benchmark shows addtional 10% impact on overall JSON reader runtime.
The text was updated successfully, but these errors were encountered: