[Compile] Understand why FSDP2 saves both SDPA out and wo in for bwd #610

awgu · 2024-10-11T15:29:04Z

With FSDP2 and transformer block compile, torch.compile saves both the SDPA output and the contiguous transposed tensor for backward:

torchtitan/torchtitan/models/llama/model.py

Lines 210 to 213 in 7e93822

    
           output = F.scaled_dot_product_attention(xq, xk, xv, is_causal=True) 
        
           output = output.transpose( 
        
               1, 2 
        
           ).contiguous()  # (bs, seqlen, n_local_heads, head_dim)

However, with simpleFSDP with full model compile, torch.compile only saves the SDPA output. This means that FSDP2 saves an extra (bs, seq_len, dim) tensor per transformer block.

Traditionally, SDPA output is required for SDPA backward, and the input to wo is required for the wo backward. However, it may be profitable memory-wise to recompute one from the other (e.g. recompute SDPA output from undo-ing the transpose of wo input).

One question is why the activations saved for backward differ between simple FSDP with full model compile vs. FSDP2 with transformer block compile.

The text was updated successfully, but these errors were encountered:

tianyu-l added the question Further information is requested label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Compile] Understand why FSDP2 saves both SDPA out and wo in for bwd #610

[Compile] Understand why FSDP2 saves both SDPA out and wo in for bwd #610

awgu commented Oct 11, 2024

[Compile] Understand why FSDP2 saves both SDPA out and wo in for bwd #610

[Compile] Understand why FSDP2 saves both SDPA out and wo in for bwd #610

Comments

awgu commented Oct 11, 2024