RuntimeError: FlashAttention only supports Ampere GPUs or newer.这种问题怎么解决，卡好几天了，我改代码也不行 #52

SongAoxiang · 2024-07-03T10:23:37Z

root@f24a8b4b662d:/home/Telechat5/inference_telechat# python telechat_infer_demo.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:12<00:00, 2.66it/s]
多轮输入演示
提问: 你是谁？
Traceback (most recent call last):
File "/home/Telechat5/inference_telechat/telechat_infer_demo.py", line 65, in
main()
File "/home/Telechat5/inference_telechat/telechat_infer_demo.py", line 27, in main
answer, history = model.chat(tokenizer=tokenizer, question=question, history=[], generation_config=generate_config,
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 878, in chat
outputs = self.generate(inputs.to(self.device), generation_config=generation_config, **model_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate
return self.greedy_search(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 799, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 716, in forward
outputs = block(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 540, in forward
attn_outputs = self.self_attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 460, in forward
context_layer = self.core_attention_flash(q, k, v)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 210, in forward
output = flash_attn_unpadded_func(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 529, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 288, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask = _flash_attn_varlen_forward(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

oslijunw · 2024-07-19T08:52:22Z

config里 flash_attn改为false

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: FlashAttention only supports Ampere GPUs or newer.这种问题怎么解决，卡好几天了，我改代码也不行 #52

RuntimeError: FlashAttention only supports Ampere GPUs or newer.这种问题怎么解决，卡好几天了，我改代码也不行 #52

SongAoxiang commented Jul 3, 2024

oslijunw commented Jul 19, 2024

RuntimeError: FlashAttention only supports Ampere GPUs or newer.这种问题怎么解决，卡好几天了，我改代码也不行 #52

RuntimeError: FlashAttention only supports Ampere GPUs or newer.这种问题怎么解决，卡好几天了，我改代码也不行 #52

Comments

SongAoxiang commented Jul 3, 2024

oslijunw commented Jul 19, 2024