Roadmap #2

evshiron · 2023-05-22T08:22:28Z

BitsAndBytes

GPTQ for LLaMA

https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm

AutoGPTQ

https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm

Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.

Triton

Navi 3x support is currently work in progress. Stay tuned.

13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.

There is still a lot of room for improvement.

AITemplate

Navi 3x support is currently work in progress. Stay tuned.

Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.

Somewhat disappointing. Is this really the limit of the RX 7900 XTX?

Flash Attention

To be ported to Navi 3x.

ROCm

ROCm 5.6.0 is available now, but we can't find Windows support anywhere.

I think it might be more appropriate to call it ROCm 5.5.2.

DarkAlchy · 2023-08-03T04:06:33Z

A 3090 vs 7900XTX is about the same speed if the 3090 uses xformers and the XTX uses sub-quad in ComfyUI as Stable Diffusion XL 1.0 30 steps Euler 1024x1024. I expect a performance lift once we get the flash attention, then SDP.

evshiron · 2023-08-03T04:19:23Z

Yeah. There are Attention impl for Navi 3x in Composable Kernel, which is used in AITemplate, which is said to be 30it/s for Stable Diffusion.

I made a dirty Flash Attention impl and integrated into PyTorch before but I didn't see performance difference compared to the default math impl, and the generated images are meaningless.

There are too many parameters in CK and it's hard to correctly port XDL code for WMMA, and I give up.

DarkAlchy · 2023-08-03T04:25:43Z

Flash is comming and supposedly that will allow Pytorch 2's SDP?

evshiron · 2023-08-03T04:57:09Z

https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

SDP is currently available for Navi 3x, but among the three underlying implementations of SDP: Flash Attention, Memory Efficient Attention, and the math impl, Navi 3x can only use the last one, which is just invoking PyTorch methods from C++ and does not offer substantial optimization.

The current development of Flash Attention for ROCm is focused on CDNA, and I don't know when RDNA will truly be able to utilize Flash Attention. All I can say is that there is potential.

DarkAlchy · 2023-08-03T05:04:33Z

Very sad that a card with all this potential hardware wise is falling down on the software side.

evshiron changed the title ~~TODOs~~ Roadmap Jun 22, 2023

evshiron pinned this issue Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap #2

Roadmap #2

evshiron commented May 22, 2023 •

edited

Loading

DarkAlchy commented Aug 3, 2023

evshiron commented Aug 3, 2023

DarkAlchy commented Aug 3, 2023

evshiron commented Aug 3, 2023

DarkAlchy commented Aug 3, 2023

Roadmap #2

Roadmap #2

Comments

evshiron commented May 22, 2023 • edited Loading

BitsAndBytes

GPTQ for LLaMA

AutoGPTQ

Triton

AITemplate

Flash Attention

ROCm

DarkAlchy commented Aug 3, 2023

evshiron commented Aug 3, 2023

DarkAlchy commented Aug 3, 2023

evshiron commented Aug 3, 2023

DarkAlchy commented Aug 3, 2023

evshiron commented May 22, 2023 •

edited

Loading