Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Faster path for calculating total output symbols in FST #17114

Open
karthikeyann opened this issue Oct 17, 2024 · 4 comments
Open

[FEA] Faster path for calculating total output symbols in FST #17114

karthikeyann opened this issue Oct 17, 2024 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@karthikeyann
Copy link
Contributor

Is your feature request related to a problem? Please describe.
To create a faster path for calculating total output symbols in FST, without redundantly computing some values, such as the state-transition vectors and write offsets in the FST.

Also, if the impact is significant, provide a way to optimize the FST kernel not to redundantly compute number of total output symbols, if it was passed with discard iterator.

Describe the solution you'd like
A reduce kernel to compute total number of output symbols.

Describe alternatives you've considered
used make_discard_iterator in #16978
The compiler should optimize and remove redundant parts, But still some redundant parts are still computed.
@elstehle shows it has 10% runtime impact. Impact on just FST alone would be much more.

Additional context
#16978 (comment) Benchmark shows addtional 10% impact on overall JSON reader runtime.

@shrshi
Copy link
Contributor

shrshi commented Oct 17, 2024

After discussions with @elstehle, here's a brief overview of two possible approaches considered:

  1. Approach 1: Split SimulateDFAKernel after offsets for translated characters are computed
    1. Write offsets to global memory, and read last element to get size of translated buffer
    2. Launch another kernel that uses this global offsets array and runs DFA to translate the input
  2. Approach 2: Split SimulateDFAKernel after the number of translated characters per thread is computed
    1. DeviceReduce and compute total translated buffer size
    2. In the next kernel, recompute the number of translated characters per thread i.e. run the DFA simulation again for the start states
    3. Prefix sum on the number of translated chars per output and then run final DFA

Though approach 2 has fewer global memory reads and writes, re-computing the number of translated characters per thread can be expensive. The plan is to start with approach 1 and implement DeviceMultiStageTransduce which takes a function object that allocates memory for output offsets and returns output iterators for the transduced output and output indices.

@karthikeyann
Copy link
Contributor Author

For approach 2,

which takes a function object that allocates memory for output offsets and returns output iterators for the transduced output and output indices

Instead, can we take approach similar to cub, where we can pass nullptr, and the DeviceMultiStageTransduce decides to call DeviceReduce, and estimates output size. if it's not nullptr, it will calculate the transduced output and output indices. Will this work?

@elstehle
Copy link
Contributor

elstehle commented Oct 18, 2024

Instead, can we take approach similar to cub, where we can pass nullptr, and the DeviceMultiStageTransduce decides to call DeviceReduce, and estimates output size. if it's not nullptr, it will calculate the transduced output and output indices. Will this work?

The output iterators don't have to be pointers but can be arbitrary iterators. It will be difficult to agree on a sentinel value (like nullptr for pointers) that indicates "only return the output size but do not write to these iterators". Maybe we could do it if the output iterator types are cub::NullType.

I don't have a clear favorite. I think the advantage of the callback is that I expect it to be easier to implement, because we don't have to figure out the right way to properly re-enter the Dispatch instance after allocating the right outputs.

Here's an example function object implementation that we had in mind:

template<typename OutT, typename OutIndexT>
struct allocate_and_get_output_iterators
{
  device_uvector<OutT> &stack_operations_out;
  device_uvector<OutIndexT> &stack_operations_indexes;
  ::cuda::std::tuple<TranslatedOutItT, TranslatedOutIndexItT> operator()(OffsetT num_transduced_out){
    stack_operations_out.resize(num_transduced_out);
    stack_operations_indexes.resize(num_transduced_out);
    return std::make_tuple(
      stack_operations_out.data(), // or thrust::make_transofrm_iterator(stack_operations_out.data(), my_op_t{}), 
      stack_operations_indexes.data());
  }
}

Another question that came up while I was thinking about all this:

Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this?

@shrshi
Copy link
Contributor

shrshi commented Oct 18, 2024

Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this?

IMO, I think we can retain the decoupled look-back approach and introduce a 'low-memory' policy that adopts this new approach i.e. writing output offsets to global memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants