-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable gptqmodel #35012
Enable gptqmodel #35012
Conversation
@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained. |
Hey @jiqing-feng, thanks for adding |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. Left a couple of comments. Note that we aslo need to modify the dockerfile for our quantization test if we decide to deprecate auto-gptq + would be nice to include a new version of a colab notebook that works with gptqmodel
gptq_supports_cpu = ( | ||
is_auto_gptq_available() | ||
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2") | ||
) or is_gptqmodel_available() | ||
if not gptq_supports_cpu and not torch.cuda.is_available(): | ||
raise RuntimeError("GPU is required to quantize or run quantize model.") | ||
elif not (is_optimum_available() and is_auto_gptq_available()): | ||
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())): | ||
raise ImportError( | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)" | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)" | ||
) | ||
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"): | ||
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse( | ||
"0.4.2" | ||
): | ||
raise ImportError( | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`" | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a message mentioning that autogptq will be deprecated ? I think we can do two version of transformers from now. For optimum, maybe we can deprecate this a bit later than transformers to make sure that we can still revert if there is a big issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
gptq_supports_cpu = ( | ||
is_auto_gptq_available() | ||
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2") | ||
) or is_gptqmodel_available() | ||
if not gptq_supports_cpu and not torch.cuda.is_available(): | ||
raise RuntimeError("GPU is required to quantize or run quantize model.") | ||
elif not (is_optimum_available() and is_auto_gptq_available()): | ||
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())): | ||
raise ImportError( | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)" | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)" | ||
) | ||
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"): | ||
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse( | ||
"0.4.2" | ||
): | ||
raise ImportError( | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`" | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget that the users need to use the latest version from optimum with gptqmodel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have limited the optimum and gptqmodel version. The version limitation can be changed after gptqmodel and optimum released.
@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently. |
* gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]>
* revert quantizer_gptq.py change * pass **kwargs
Testing changes contain: Refactor: CPU tests do not need @require_torch_gpu |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
* revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]>
@SunMarc Review can start at There may be some testing code tweaks but I do not foresee any major changes from this point foreword other than passing flaky tests and/or some testing bugs. Due to |
@ArthurZucker @SunMarc @MekkCyber Anything else that is required of us to move this PR forward? Thanks! We still have a lingering PEFT PR that is contingent on this PR be merged first. |
Hey, super sorry reviewing in a bit I hope I was not the blocker ! |
Hi @SunMarc @ArthurZucker @MekkCyber . The optimum PR has been merged, this PR should be ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly wondering if it would not make more sense to create a separate backend as we can now treat them as different libs no? 🤗
docs/source/en/quantization/gptq.md
Outdated
* Model support: GPTQModel continues to support all of the latest released LLM models. | ||
* Multi-Modal support: GPTQModel supports accurate quantization of Qwen 2-VL and Ovis 1.6-VL image-to-text models. | ||
* Platform support: Validated MacOS Apple Silicon and Windows 11 support. | ||
* Hardware support: Apple Silicon M1+, Intel/AMD CPU, and Intel Datacenter Max + Arc GPUs. | ||
* Asymmetric support: Asymmetric quantization can potentially introduce lower quantization errors compared to symmetric quantization. However, it is not backward compatible with AutoGPTQ, and not all kernels, such as Marlin, support asymmetric quantization. | ||
* IPEX kernel for Intel/AMD accelerated CPU and Intel GPU (Datacenter Max + ARc) support. | ||
* Updated Marlin kernel from Neural Magic that is optimized for A100 (Ampere) | ||
* Updated Kernels with auto-padding for legacy model support and models with non-uniform in/out-features. | ||
* Faster quantization, lower memory usage, and more accurate default quantization via GPTQModel quantization apis. | ||
* User and developer friendly apis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! 🤗
For docs, we'd better put autogptq and gptqmodel in the same section because autogptq is no longer maintained, and we might deprecate autogptq in the future if it is incompatible. |
To add to what @jiqing-feng mentioned. For this PR, backward compat and minimal code change on transformer/optimum/peft is target. Deprecation is fully planned for autogptq with good reason. Looking forward, long term, there is no reason to keep autogptq or to spend time doing arch and split hf gptq into two backends. The autogptq core maintainer has been mia, literally unreachable by anyone including @fxmarty (second maintainer) who did almost all the work in 2024 (until he ran out of spare time to work on it) when the project was active until I was invited by fxmarty to help as 3rd part-time restricted maintainer but I later decided to make GPTQModel instead in my vision unburdened by legacy api and what I considered questionable foundation code. GPTQModel runs 100% ci feature and model coverage for each release. Autogptq has no ci. We have bugs too, but there are so many hidden bugs in autogptq we fixed we have lost count. Transformers/Optimum/Peft only use the kernel part of autogptq code base so the full short/long term problems there are not visible here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating!
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
@stevhliu Thanks for the doc/text corrections. |
I suppose this PR is ready to be merged as we got enough approvals, please let me know if anything I need to change. |
Hi @ArthurZucker @Rocketknight1 . Pls let me know if there is anything I need to change before merging. Thx! |
@SunMarc @MekkCyber I think you have the ultimate authority here. Take one last look and feel free to merge it if you're happy! |
Sounds good ! I'll merge it then ! cc @MekkCyber for visibility |
We are going to replace
auto_gptq
withgptqmodel
. Start with the quantizer check, and also need to change the optimum: huggingface/optimum#2064.We intended to deprecate AutoGPTQ in this PR, but considering users' behavior, we would like to keep the support for auto_gptq for the next few versions and give a warning for deprecating.