-
Notifications
You must be signed in to change notification settings - Fork 6
Roadmap #2
Comments
A 3090 vs 7900XTX is about the same speed if the 3090 uses xformers and the XTX uses sub-quad in ComfyUI as Stable Diffusion XL 1.0 30 steps Euler 1024x1024. I expect a performance lift once we get the flash attention, then SDP. |
Yeah. There are Attention impl for Navi 3x in Composable Kernel, which is used in AITemplate, which is said to be 30it/s for Stable Diffusion. I made a dirty Flash Attention impl and integrated into PyTorch before but I didn't see performance difference compared to the default math impl, and the generated images are meaningless. There are too many parameters in CK and it's hard to correctly port XDL code for WMMA, and I give up. |
Flash is comming and supposedly that will allow Pytorch 2's SDP? |
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html SDP is currently available for Navi 3x, but among the three underlying implementations of SDP: Flash Attention, Memory Efficient Attention, and the math impl, Navi 3x can only use the last one, which is just invoking PyTorch methods from C++ and does not offer substantial optimization. The current development of Flash Attention for ROCm is focused on CDNA, and I don't know when RDNA will truly be able to utilize Flash Attention. All I can say is that there is potential. |
Very sad that a card with all this potential hardware wise is falling down on the software side. |
BitsAndBytes
https://github.com/are-we-gfx1100-yet/bitsandbytes-rocmGPTQ for LLaMA
https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm
AutoGPTQ
https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm
Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.
Triton
Navi 3x support is currently work in progress. Stay tuned.
13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.
There is still a lot of room for improvement.
AITemplate
Navi 3x support is currently work in progress. Stay tuned.
Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.
Somewhat disappointing. Is this really the limit of the RX 7900 XTX?
Flash Attention
To be ported to Navi 3x.
ROCm
ROCm 5.6.0 is available now, but we can't find Windows support anywhere.
I think it might be more appropriate to call it ROCm 5.5.2.
The text was updated successfully, but these errors were encountered: