Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable gptqmodel #35012

Merged
merged 63 commits into from
Jan 15, 2025
Merged

Enable gptqmodel #35012

merged 63 commits into from
Jan 15, 2025

Conversation

jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Nov 29, 2024

We are going to replace auto_gptq with gptqmodel. Start with the quantizer check, and also need to change the optimum: huggingface/optimum#2064.

We intended to deprecate AutoGPTQ in this PR, but considering users' behavior, we would like to keep the support for auto_gptq for the next few versions and give a warning for deprecating.

@Rocketknight1
Copy link
Member

cc @SunMarc @MekkCyber

@Qubitium
Copy link
Contributor

@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained.

@MekkCyber
Copy link
Contributor

Hey @jiqing-feng, thanks for adding gptqmodel LGTM ! Could you update the PR description and title to make them clearer? Thanks 😊

@MekkCyber MekkCyber requested a review from SunMarc November 29, 2024 14:48
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
@jiqing-feng jiqing-feng marked this pull request as ready for review December 2, 2024 05:14
@jiqing-feng jiqing-feng changed the title gptqmodel Enable gptqmodel Dec 2, 2024
@jiqing-feng jiqing-feng marked this pull request as draft December 2, 2024 09:11
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Left a couple of comments. Note that we aslo need to modify the dockerfile for our quantization test if we decide to deprecate auto-gptq + would be nice to include a new version of a colab notebook that works with gptqmodel

Comment on lines 52 to 67
gptq_supports_cpu = (
is_auto_gptq_available()
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2")
) or is_gptqmodel_available()
if not gptq_supports_cpu and not torch.cuda.is_available():
raise RuntimeError("GPU is required to quantize or run quantize model.")
elif not (is_optimum_available() and is_auto_gptq_available()):
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())):
raise ImportError(
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)"
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)"
)
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse(
"0.4.2"
):
raise ImportError(
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`"
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a message mentioning that autogptq will be deprecated ? I think we can do two version of transformers from now. For optimum, maybe we can deprecate this a bit later than transformers to make sure that we can still revert if there is a big issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 52 to 67
gptq_supports_cpu = (
is_auto_gptq_available()
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2")
) or is_gptqmodel_available()
if not gptq_supports_cpu and not torch.cuda.is_available():
raise RuntimeError("GPU is required to quantize or run quantize model.")
elif not (is_optimum_available() and is_auto_gptq_available()):
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())):
raise ImportError(
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)"
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)"
)
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse(
"0.4.2"
):
raise ImportError(
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`"
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget that the users need to use the latest version from optimum with gptqmodel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have limited the optimum and gptqmodel version. The version limitation can be changed after gptqmodel and optimum released.

@Qubitium
Copy link
Contributor

Qubitium commented Dec 2, 2024

@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently.

* gptqmodel need use checkpoint_format

* fix quantize

* Update quantization_config.py

* Update quantization_config.py

* Update quantization_config.py

---------

Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
LRL-ModelCloud and others added 2 commits December 4, 2024 08:55
* revert quantizer_gptq.py change

* pass **kwargs
@jiqing-feng
Copy link
Contributor Author

Testing changes contain:

Refactor: CPU tests do not need @require_torch_gpu
GPTQ lib: @require_gptq means we can run these tests with gptqmodel or auto-gptq
Default model: we default run llama in tests instead of bloom because it's more common.

jiqing-feng and others added 8 commits December 4, 2024 10:32
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
* revert quantizer_gptq.py change

* pass **kwargs

* add meta info

* cleanup

* cleanup

* Update quantization_config.py

* hf_select_quant_linear pass checkpoint_format and meta

* fix GPTQTestCUDA

* Update test_gptq.py

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* cleanup

* add backend

* cleanup

* cleanup

* no need check exllama version

* Update quantization_config.py

* lower checkpoint_format and backend

* check none

* cleanup

* Update quantization_config.py

* fix self.use_exllama == False

* spell

* fix unittest

* fix unittest

---------

Co-authored-by: LRL <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
@Qubitium
Copy link
Contributor

Qubitium commented Dec 5, 2024

@SunMarc Review can start at optimum first. I will write up a detailed explainer at the optimum pr on some obvious small/large change we pushed.

There may be some testing code tweaks but I do not foresee any major changes from this point foreword other than passing flaky tests and/or some testing bugs. Due to optimum having the largest diffs and where most of the gptq quant logic is, we fill first concentrate on making sure optimum is review-cleared, then pef/transformers in that order.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 6, 2025

@ArthurZucker @SunMarc @MekkCyber Anything else that is required of us to move this PR forward? Thanks! We still have a lingering PEFT PR that is contingent on this PR be merged first.

@ArthurZucker
Copy link
Collaborator

Hey, super sorry reviewing in a bit I hope I was not the blocker !

@jiqing-feng
Copy link
Contributor Author

Hi @SunMarc @ArthurZucker @MekkCyber . The optimum PR has been merged, this PR should be ready to merge.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly wondering if it would not make more sense to create a separate backend as we can now treat them as different libs no? 🤗

Comment on lines 29 to 38
* Model support: GPTQModel continues to support all of the latest released LLM models.
* Multi-Modal support: GPTQModel supports accurate quantization of Qwen 2-VL and Ovis 1.6-VL image-to-text models.
* Platform support: Validated MacOS Apple Silicon and Windows 11 support.
* Hardware support: Apple Silicon M1+, Intel/AMD CPU, and Intel Datacenter Max + Arc GPUs.
* Asymmetric support: Asymmetric quantization can potentially introduce lower quantization errors compared to symmetric quantization. However, it is not backward compatible with AutoGPTQ, and not all kernels, such as Marlin, support asymmetric quantization.
* IPEX kernel for Intel/AMD accelerated CPU and Intel GPU (Datacenter Max + ARc) support.
* Updated Marlin kernel from Neural Magic that is optimized for A100 (Ampere)
* Updated Kernels with auto-padding for legacy model support and models with non-uniform in/out-features.
* Faster quantization, lower memory usage, and more accurate default quantization via GPTQModel quantization apis.
* User and developer friendly apis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! 🤗

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Jan 9, 2025

Mostly wondering if it would not make more sense to create a separate backend as we can now treat them as different libs no? 🤗

For docs, we'd better put autogptq and gptqmodel in the same section because autogptq is no longer maintained, and we might deprecate autogptq in the future if it is incompatible.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 9, 2025

Mostly wondering if it would not make more sense to create a separate backend as we can now treat them as different libs no? 🤗

To add to what @jiqing-feng mentioned. For this PR, backward compat and minimal code change on transformer/optimum/peft is target.

Deprecation is fully planned for autogptq with good reason.

Looking forward, long term, there is no reason to keep autogptq or to spend time doing arch and split hf gptq into two backends. The autogptq core maintainer has been mia, literally unreachable by anyone including @fxmarty (second maintainer) who did almost all the work in 2024 (until he ran out of spare time to work on it) when the project was active until I was invited by fxmarty to help as 3rd part-time restricted maintainer but I later decided to make GPTQModel instead in my vision unburdened by legacy api and what I considered questionable foundation code.

GPTQModel runs 100% ci feature and model coverage for each release. Autogptq has no ci. We have bugs too, but there are so many hidden bugs in autogptq we fixed we have lost count.

Transformers/Optimum/Peft only use the kernel part of autogptq code base so the full short/long term problems there are not visible here.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

docs/source/en/quantization/gptq.md Outdated Show resolved Hide resolved
docs/source/en/quantization/gptq.md Outdated Show resolved Hide resolved
docs/source/en/quantization/gptq.md Outdated Show resolved Hide resolved
docs/source/en/quantization/gptq.md Outdated Show resolved Hide resolved
docs/source/en/quantization/overview.md Outdated Show resolved Hide resolved
docs/source/en/quantization/overview.md Outdated Show resolved Hide resolved
@Qubitium
Copy link
Contributor

Thanks for updating!

@stevhliu Thanks for the doc/text corrections.

@jiqing-feng
Copy link
Contributor Author

I suppose this PR is ready to be merged as we got enough approvals, please let me know if anything I need to change.

@SunMarc SunMarc requested a review from ArthurZucker January 10, 2025 09:39
@jiqing-feng
Copy link
Contributor Author

Hi @ArthurZucker @Rocketknight1 . Pls let me know if there is anything I need to change before merging. Thx!

@Rocketknight1
Copy link
Member

@SunMarc @MekkCyber I think you have the ultimate authority here. Take one last look and feel free to merge it if you're happy!

@SunMarc
Copy link
Member

SunMarc commented Jan 15, 2025

Sounds good ! I'll merge it then ! cc @MekkCyber for visibility

@SunMarc SunMarc merged commit 387663e into huggingface:main Jan 15, 2025
25 checks passed
@Qubitium Qubitium deleted the gptq branch January 15, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants