[FEA] Faster path for calculating total output symbols in FST #17114

karthikeyann · 2024-10-17T17:33:36Z

Is your feature request related to a problem? Please describe.
To create a faster path for calculating total output symbols in FST, without redundantly computing some values, such as the state-transition vectors and write offsets in the FST.

Also, if the impact is significant, provide a way to optimize the FST kernel not to redundantly compute number of total output symbols, if it was passed with discard iterator.

Describe the solution you'd like
A reduce kernel to compute total number of output symbols.

Describe alternatives you've considered
used make_discard_iterator in #16978
The compiler should optimize and remove redundant parts, But still some redundant parts are still computed.
@elstehle shows it has 10% runtime impact. Impact on just FST alone would be much more.

Additional context
#16978 (comment) Benchmark shows addtional 10% impact on overall JSON reader runtime.

shrshi · 2024-10-17T19:50:13Z

After discussions with @elstehle, here's a brief overview of two possible approaches considered:

Approach 1: Split SimulateDFAKernel after offsets for translated characters are computed
1. Write offsets to global memory, and read last element to get size of translated buffer
2. Launch another kernel that uses this global offsets array and runs DFA to translate the input
Approach 2: Split SimulateDFAKernel after the number of translated characters per thread is computed
1. DeviceReduce and compute total translated buffer size
2. In the next kernel, recompute the number of translated characters per thread i.e. run the DFA simulation again for the start states
3. Prefix sum on the number of translated chars per output and then run final DFA

Though approach 2 has fewer global memory reads and writes, re-computing the number of translated characters per thread can be expensive. The plan is to start with approach 1 and implement DeviceMultiStageTransduce which takes a function object that allocates memory for output offsets and returns output iterators for the transduced output and output indices.

karthikeyann · 2024-10-17T20:37:38Z

For approach 2,

which takes a function object that allocates memory for output offsets and returns output iterators for the transduced output and output indices

Instead, can we take approach similar to cub, where we can pass nullptr, and the DeviceMultiStageTransduce decides to call DeviceReduce, and estimates output size. if it's not nullptr, it will calculate the transduced output and output indices. Will this work?

elstehle · 2024-10-18T04:19:56Z

Instead, can we take approach similar to cub, where we can pass nullptr, and the DeviceMultiStageTransduce decides to call DeviceReduce, and estimates output size. if it's not nullptr, it will calculate the transduced output and output indices. Will this work?

The output iterators don't have to be pointers but can be arbitrary iterators. It will be difficult to agree on a sentinel value (like nullptr for pointers) that indicates "only return the output size but do not write to these iterators". Maybe we could do it if the output iterator types are cub::NullType.

I don't have a clear favorite. I think the advantage of the callback is that I expect it to be easier to implement, because we don't have to figure out the right way to properly re-enter the Dispatch instance after allocating the right outputs.

Here's an example function object implementation that we had in mind:

template<typename OutT, typename OutIndexT>
struct allocate_and_get_output_iterators
{
  device_uvector<OutT> &stack_operations_out;
  device_uvector<OutIndexT> &stack_operations_indexes;
  ::cuda::std::tuple<TranslatedOutItT, TranslatedOutIndexItT> operator()(OffsetT num_transduced_out){
    stack_operations_out.resize(num_transduced_out);
    stack_operations_indexes.resize(num_transduced_out);
    return std::make_tuple(
      stack_operations_out.data(), // or thrust::make_transofrm_iterator(stack_operations_out.data(), my_op_t{}), 
      stack_operations_indexes.data());
  }
}

Another question that came up while I was thinking about all this:

Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this?

shrshi · 2024-10-18T16:25:40Z

Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this?

IMO, I think we can retain the decoupled look-back approach and introduce a 'low-memory' policy that adopts this new approach i.e. writing output offsets to global memory.

karthikeyann added the feature request New feature or request label Oct 17, 2024

karthikeyann assigned elstehle Oct 17, 2024

karthikeyann mentioned this issue Oct 17, 2024

JSON tokenizer memory optimizations #16978

Merged

3 tasks

karthikeyann mentioned this issue Oct 21, 2024

JSON spark reader plan for 24.12 #17138

Open

karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Faster path for calculating total output symbols in FST #17114

[FEA] Faster path for calculating total output symbols in FST #17114

karthikeyann commented Oct 17, 2024

shrshi commented Oct 17, 2024

karthikeyann commented Oct 17, 2024

elstehle commented Oct 18, 2024 •

edited

Loading

shrshi commented Oct 18, 2024

[FEA] Faster path for calculating total output symbols in FST #17114

[FEA] Faster path for calculating total output symbols in FST #17114

Comments

karthikeyann commented Oct 17, 2024

shrshi commented Oct 17, 2024

karthikeyann commented Oct 17, 2024

elstehle commented Oct 18, 2024 • edited Loading

shrshi commented Oct 18, 2024

elstehle commented Oct 18, 2024 •

edited

Loading