Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ quantization is very slow for ONNX LLMs #1609

Open
PatriceVignola opened this issue Feb 10, 2024 · 1 comment
Open

AWQ quantization is very slow for ONNX LLMs #1609

PatriceVignola opened this issue Feb 10, 2024 · 1 comment
Assignees

Comments

@PatriceVignola
Copy link

PatriceVignola commented Feb 10, 2024

I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following np.matmul calls take forever to execute, and I estimate it would take days to quantize the model at current pace:

org_out = np.matmul(inp, org_weight) # n_token, oc

cur_out = np.matmul(inp, weight.T)

Would it make sense to allow the user to pass either a torch module or an ONNX model/session to compute the loss (or at the very least do the matmul computation)? Even replacing the np.matmul calls with simple torch.matmul calls on CUDA devices makes it exponentially faster.

Otherwise, is there a current workaround or option I'm unaware of at the moment that would make it faster? I feel like I might be missing something.

@yuwenzho
Copy link
Contributor

It takes about 1 hour to run AWQ quantization on Llama-2-7b model with our test device using scripts in our llama weight-only quantization example. You can refer to the options for AWQ in main.py#L325-L336.

We currently have no plans to support torch tensor computation in our ONNX weight-only quantization. However, we recommend considering alternative solutions using CuPy instead of numpy for GPU-accelerated computing. You can try to implement this method yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants