Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Roadmap #2

Open
evshiron opened this issue May 22, 2023 · 5 comments
Open

Roadmap #2

evshiron opened this issue May 22, 2023 · 5 comments

Comments

@evshiron
Copy link
Owner

evshiron commented May 22, 2023

BitsAndBytes

GPTQ for LLaMA

https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm

AutoGPTQ

https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm

Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.

Triton

Navi 3x support is currently work in progress. Stay tuned.

13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.

There is still a lot of room for improvement.

AITemplate

Navi 3x support is currently work in progress. Stay tuned.

Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.

Somewhat disappointing. Is this really the limit of the RX 7900 XTX?

Flash Attention

To be ported to Navi 3x.

ROCm

ROCm 5.6.0 is available now, but we can't find Windows support anywhere.

I think it might be more appropriate to call it ROCm 5.5.2.

@evshiron evshiron changed the title TODOs Roadmap Jun 22, 2023
@evshiron evshiron pinned this issue Jun 22, 2023
@DarkAlchy
Copy link

A 3090 vs 7900XTX is about the same speed if the 3090 uses xformers and the XTX uses sub-quad in ComfyUI as Stable Diffusion XL 1.0 30 steps Euler 1024x1024. I expect a performance lift once we get the flash attention, then SDP.

@evshiron
Copy link
Owner Author

evshiron commented Aug 3, 2023

Yeah. There are Attention impl for Navi 3x in Composable Kernel, which is used in AITemplate, which is said to be 30it/s for Stable Diffusion.

I made a dirty Flash Attention impl and integrated into PyTorch before but I didn't see performance difference compared to the default math impl, and the generated images are meaningless.

There are too many parameters in CK and it's hard to correctly port XDL code for WMMA, and I give up.

@DarkAlchy
Copy link

Flash is comming and supposedly that will allow Pytorch 2's SDP?

@evshiron
Copy link
Owner Author

evshiron commented Aug 3, 2023

https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

SDP is currently available for Navi 3x, but among the three underlying implementations of SDP: Flash Attention, Memory Efficient Attention, and the math impl, Navi 3x can only use the last one, which is just invoking PyTorch methods from C++ and does not offer substantial optimization.

The current development of Flash Attention for ROCm is focused on CDNA, and I don't know when RDNA will truly be able to utilize Flash Attention. All I can say is that there is potential.

@DarkAlchy
Copy link

Very sad that a card with all this potential hardware wise is falling down on the software side.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants