-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why the linear input for Layer.pack "must be of type torch.half
"?
#28
Comments
Hi, Marlin currently does not support BF16 inputs (though in many cases you can just convert your BF16 model to FP16). These require slightly different GPU instructions as well as a slightly different dequantization process (since there are only 5 mantissa bits). This is also why |
hi, I notice that the core mma function for bf16 is supported by vllm's gptq marlin. It seems that a few changes can do this feature. https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq_marlin/gptq_marlin.cu#L89 I really need this bf16 input (actually most model is by bf16 now), if it doesn't take much time can you merge that feature in the marlin repo? If not, I can do this later. |
@Azure-Tang |
Didnt donw yet, maybe next week? |
It seems that if we remove assert in the
Layer.pack
, then we can pack an bf16 linear?By the way, will marlin support "int4 \times bf16" as input?
The text was updated successfully, but these errors were encountered: