The hidden states in LlamaFlashAttention2 are cast in fp16 unexpectedly #26451

hiyouga · 2023-09-27T17:13:48Z

System Info

transformers version: 4.33.1
Platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.17.1
Safetensors version: 0.3.3
Accelerate version: 0.23.0
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: A100 40GB
Using distributed or parallel set-up in script?: No

Who can help?

@younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

As we discussed in this thread: #25598 (comment)

The hidden states may be cast in float16 even if we are using bf16 mixed precision training.

transformers/src/transformers/models/llama/modeling_llama.py

Lines 485 to 487 in 78dd120

    
           query_states = query_states.to(torch.float16) 
        
           key_states = key_states.to(torch.float16) 
        
           value_states = value_states.to(torch.float16)

It may be difficult to figure out the correct data type if the model is loaded in 4/8-bit mode.

Expected behavior

The hidden states should be cast in Bfloat16 in bf16 training.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-09-28T06:42:37Z

Indeed we should not always cast if the dtype is float32

FYI @younesbelkada

younesbelkada · 2023-09-28T08:58:57Z

Thanks @hiyouga this makes sense

Indeed we should not always cast if the dtype is float32

Flash Attention supports only fp16 / bf16 as input dtype so we should always cast to half precision if the input gets silently casted to full precision (e.g. layer norm in Llama)

I will work on it and let you know !

younesbelkada self-assigned this Sep 28, 2023

This was referenced Sep 29, 2023

Add flash attention for gpt_bigcode #26479

Merged

[FA2] Cast to correct dtype #26560

Closed

younesbelkada mentioned this issue Oct 16, 2023

[FA-2] Final fix for FA2 dtype #26846

Merged

younesbelkada closed this as completed in #26846 Oct 18, 2023

vasqu mentioned this issue May 28, 2024

[GPT-NeoX] Add SDPA support #31031

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The hidden states in LlamaFlashAttention2 are cast in fp16 unexpectedly #26451

The hidden states in LlamaFlashAttention2 are cast in fp16 unexpectedly #26451

hiyouga commented Sep 27, 2023

ArthurZucker commented Sep 28, 2023

younesbelkada commented Sep 28, 2023

The hidden states in LlamaFlashAttention2 are cast in fp16 unexpectedly #26451

The hidden states in LlamaFlashAttention2 are cast in fp16 unexpectedly #26451

Comments

hiyouga commented Sep 27, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Sep 28, 2023

younesbelkada commented Sep 28, 2023