You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following np.matmul calls take forever to execute, and I estimate it would take days to quantize the model at current pace:
Would it make sense to allow the user to pass either a torch module or an ONNX model/session to compute the loss (or at the very least do the matmul computation)? Even replacing the np.matmul calls with simple torch.matmul calls on CUDA devices makes it exponentially faster.
Otherwise, is there a current workaround or option I'm unaware of at the moment that would make it faster? I feel like I might be missing something.
The text was updated successfully, but these errors were encountered:
We currently have no plans to support torch tensor computation in our ONNX weight-only quantization. However, we recommend considering alternative solutions using CuPy instead of numpy for GPU-accelerated computing. You can try to implement this method yourself.
I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following
np.matmul
calls take forever to execute, and I estimate it would take days to quantize the model at current pace:neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py
Line 466 in 26b260e
neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py
Line 488 in 26b260e
neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py
Line 615 in 26b260e
neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py
Line 636 in 26b260e
Would it make sense to allow the user to pass either a torch module or an ONNX model/session to compute the loss (or at the very least do the matmul computation)? Even replacing the
np.matmul
calls with simpletorch.matmul
calls on CUDA devices makes it exponentially faster.Otherwise, is there a current workaround or option I'm unaware of at the moment that would make it faster? I feel like I might be missing something.
The text was updated successfully, but these errors were encountered: