[Flash Attention 2] Performance improvement #28160

li-plus · 2023-12-20T11:39:32Z

Feature request

The current flash attention 2 integration is sub-optimal in performance because it requires unpadding and padding the activations on each layer. For example in llama implementation:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 591 to 612 in 769a954

    
           batch_size = query_states.shape[0] 
        
           query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input( 
        
               query_states, key_states, value_states, attention_mask, query_length 
        
           ) 
        
           cu_seqlens_q, cu_seqlens_k = cu_seq_lens 
        
           max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens 
        
           attn_output_unpad = flash_attn_varlen_func( 
        
               query_states, 
        
               key_states, 
        
               value_states, 
        
               cu_seqlens_q=cu_seqlens_q, 
        
               cu_seqlens_k=cu_seqlens_k, 
        
               max_seqlen_q=max_seqlen_in_batch_q, 
        
               max_seqlen_k=max_seqlen_in_batch_k, 
        
               dropout_p=dropout, 
        
               softmax_scale=softmax_scale, 
        
               causal=causal, 
        
           ) 
        
           attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)

These small kernels for unpad/pad keep gpu waiting for cpu, as shown in the visible gaps between kernels in cuda stream.

I'll suggest unpadding the activations at the very beginning (right after word embeddings) and padding it back at the end (maybe before lm_head), and the gap should disappear.

Motivation

To eliminate performance overhead of flash attention 2.

Your contribution

I can write the code when I'm not busy. Maybe not now.

amyeroberts · 2023-12-20T11:55:15Z

cc @ArthurZucker @younesbelkada

younesbelkada · 2023-12-20T12:55:35Z

Hi @li-plus
Thanks a lot for the suggestion !
@fxmarty tried the approach of pad / unpadd at the beginning of the model forward call here: younesbelkada#5 but the implementation ended up bloating the modeling code, therefore it has been decided to not move forward for that approach maybe we can revisit this cc @ArthurZucker

ArthurZucker · 2023-12-20T13:18:08Z

I think this could be revisited given that we have more flexibility with the cache and the attention layer as well, not bandwidth on my side but ready to review a PR so will label it as a good difficult issue!

amyeroberts added the Flash Attention label Dec 20, 2023

ArthurZucker added the Good Difficult Issue label Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flash Attention 2] Performance improvement #28160

[Flash Attention 2] Performance improvement #28160

li-plus commented Dec 20, 2023 •

edited

Loading

amyeroberts commented Dec 20, 2023

younesbelkada commented Dec 20, 2023

ArthurZucker commented Dec 20, 2023

[Flash Attention 2] Performance improvement #28160

[Flash Attention 2] Performance improvement #28160

Comments

li-plus commented Dec 20, 2023 • edited Loading

Feature request

Motivation

Your contribution

amyeroberts commented Dec 20, 2023

younesbelkada commented Dec 20, 2023

ArthurZucker commented Dec 20, 2023

li-plus commented Dec 20, 2023 •

edited

Loading