Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPTQModel #2064

Merged
merged 28 commits into from
Dec 19, 2024
Merged

Enable GPTQModel #2064

merged 28 commits into from
Dec 19, 2024

Conversation

jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Oct 16, 2024

Enable GPTQModel in optimum.

@jiqing-feng jiqing-feng changed the title align gptq check to transformers for supporting cpu Enable GPTQModel Nov 29, 2024
@Qubitium
Copy link
Contributor

@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained.

Signed-off-by: jiqing-feng <[email protected]>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean PR ! LGTM ! Thanks for creating this lib ! Can you check that the tests in optimum and in transformers pass as expected ?

optimum/gptq/quantizer.py Outdated Show resolved Hide resolved
optimum/gptq/quantizer.py Outdated Show resolved Hide resolved
@Qubitium
Copy link
Contributor

Qubitium commented Dec 2, 2024

@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently.

* need checkpoint_format

* default value of checkpoint_format is gptq

* fix quantize

* fix quantize

* fix quantize

* Update quantizer.py

* need convert to v1 before gptqmodel save

* back checkpoint_format to gptq after convert

* cleanup code

* sym=False is not supported with auto-gptq

* add comments

* cleanup code

* Update quantizer.py

* always convert v2 to v1 if checkpoint_format = "gptq"

* Update quantizer.py

---------

Co-authored-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium-ModelCloud <[email protected]>
* keep gptq_v2 if sym is false

* use hf_convert_gptq_v1_to_v2_format, hf_convert_gptq_v2_to_v1_format, and hf_gptqmodel_post_init

* no need check backend

* use device_map

* cleanup

* Update quantizer.py

* move import

---------

Co-authored-by: Qubitium-ModelCloud <[email protected]>
@jiqing-feng
Copy link
Contributor Author

Hi @Qubitium . The gptqmodel tests have been integrated. I can run RUN_SLOW=1 pytest tests/gptq/test_quantization.py in a cpu-only device, please check with cuda tests.

@SunMarc Do we need to change any test yaml file in .github or any dockerfile? If yes, please let me know the file location. Thanks! BTW, tests with @slow will not be triggered in CI, only we manually set RUN_SLOW=1 can trigger this kind of test. So I was wondering how HF can ensure all the slow tests can pass since they are not in CI?

@jiqing-feng
Copy link
Contributor Author

Testing changes contain:

  1. Refactor: CPU tests do not need @require_torch_gpu
  2. GPTQ lib: @require_gptq means we can run these tests with gptqmodel or auto-gptq
  3. Default model: we default run llama in tests instead of bloom because it's more common, and there are also opt and bloom tests in GPTQUtilsTest.

jiqing-feng and others added 5 commits December 4, 2024 10:56
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
* add meta info

* cleanup

* cleanup

* The value of quantizer should be an array

* Update quantizer.py

* If is_auto_gptq_available() also writes "auto_gptq:version" to "quantizer"

* If is_auto_gptq_available() also writes "auto_gptq:version" to "quantizer"

* Update quantizer.py

* cleanup

* comment on meta

* hf_select_quant_linear pass checkpoint_format

* add todo fix

* move convert code to quantizer.save()

* Update quantizer.py

* Optimize hf_convert_gptq_v2_to_v1_format()

* Optimize hf_convert_gptq_v1_to_v2_format()

* fix GPTQTestCUDA

* hf_select_quant_linear() always set pack=True

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* GPTQQuantizer add backend

* lower checkpoint_format and backend

* cleanup

* move backend to bottom

* no need to check gptqmodel version for ipex support

* Update import_utils.py

* Update quantizer.py

* fix UnboundLocalError: cannot access local variable 'version' where it is not associated with a value

* make version var short

* Update import_utils.py

* fix unittest

* use assertLessEqual

---------

Co-authored-by: Qubitium-ModelCloud <[email protected]>
Co-authored-by: LRL <[email protected]>
Comment on lines 125 to 132
checkpoint_format (`str`, *optional*, defaults to `gptq`):
GPTQ weight format. `gptq`(v1) is supported by both gptqmodel and auto-gptq. `gptq_v2` is gptqmodel only.
meta (`Dict[str, any]`, *optional*):
Properties, such as tooling:version, that do not directly contributes to quantization or quant inference are stored in meta.
i.e. `meta.quantizer`: ["optimum:_version_", "gptqmodel:_version_"]
backend (`str`, *optional*):
Controls which gptq kernel to be used. Valid values for gptqmodel are `auto`, `auto_trainable` and more. For auto-gptq, only
valid value is None and `auto_trainable`. Ref gptqmodel backends: https://github.com/ModelCloud/GPTQModel/blob/main/gptqmodel/utils/backend.py
Copy link
Contributor

@Qubitium Qubitium Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc This is the biggest change and one I need to go into explicit details since they are non-obvious.

checkpoint_format is added to both auto-gptq(main) but never released and carried-over to gptqmodel since I see as a good addition since the method of gptq produces many kernels, each may use a specific weight/disk format. Existing gptqmodel checkpoint_formats are gptq, gptq_v2, marlin, ipex, bitblas with more coming this year.

meta is added by gptqmodel to store info only properties that is not related to the loading and execution of the quantized model. Most importantly, it stores a meta.quantizer property [list of: quantizer:version] tooling that produced the quant. This is extremely valuable for two reasons:

  1. [Good to have but not essential] Debugging and tracing back bad quants generated by bad code/tools. Who made the (bad/good) quants? This is an tooling fingerprint since there are multiple tools that can produce gptq format. They are not equal and this will allow everyone to trace to origins. In this PR, meta.quantizer is size 2 array holding both optimum version + (gptqmodel version or autogptq version)

  2. [Requirement for GPTQMode + future bug proofing] Backward compat and future bug-proofing. GPTQModel uses this to test of a zeropoint fix made by @qwopqwop200 that affects all gptq (v1) disk format created before this fix. Models made before this fix has broken sym=false zeropoint. Models quantized after this fix can load sym=False. This is a safety check since there are two version of gptq v1 that is compatible with either when sym=true, and only after fix for sym=false.

backend [Essential for GPTQModel and also good for auto-gptq]: The old auto-gptq method of selecting which kernel/quant_linear for which task/model is extremely cryptic and controlled by 3 params disable_exllama, exllama_config, disable_exllama plus 1 code state called use_exllama. Frankly, this control scheme no longer makes logical sense and is borderline crazy. GPTQmodel uses a single backend to signal kernel selection.

The kernel selection comes down to this logic split into two core paths: Does the model require training? aka will it enter peft path. True, than select the best kernel that can be trained on. False, select a best kernel for quant/inference.

You can ignore all the switches as the basic need for the 3+1 variables in auto-gptq boils down to above logic while trying to give users the ability to choose a specific kernel but I can safely say, there are maybe 3 people in the world that can select the correct auto-gptq kernel without actually reading the entire code. There are even more toggle beyond the 3+1 within auto-gptq.

Due to the above, GPTQModel will not accept or adapt to to auto-gptq kernel selection crazyiness. 1 clean variable is all you need with two primar/auto states: auto, auto_trainable, plus individual kernels you can explicitly call via this single backend param. This is the best and only way out the mess of kernel selection. There are currently 8 kernels in gptqmodel with more coming this year plus even more checkpoint_formats beyond gptq, gptq_v2, marlin, ipex, bitblas. We are not going down the auto-gptq path of adding a 1970 telco phoneline switch style variable for each kernel that needs to be and+ored to compute the kernel selection state.

But, for the sake of compatibility, this PR contains code that will allow users to pass only auto-gptq control vars and convert that to gptqmodel auto loading state of auto and auto_trainable and reversely, allow passing backend=auto_trainable and to auto-gptq kernel selection control.

I apologize for the overly verbose message here but this is the meat of the pr as far as potential friction for review since I can totally see why anyone seeing these changes will throw blank stares without me explaining every single detail about each new param/state var.

Comment on lines +217 to +228
def select_quant_linear(self, device_map: Union[str, dict]):
if is_gptqmodel_available():
self.quant_linear = hf_select_quant_linear(
bits=self.bits,
group_size=self.group_size,
desc_act=self.desc_act,
sym=self.sym,
checkpoint_format=self.checkpoint_format,
meta=self.meta,
device_map=device_map,
backend=self.backend,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc GPTQModel is exposing hf_ prefixed stable apis to transfomer/peft/optimum that will not change over time. So any calls to GPTQModel will be hf_ prefixed.

Here our quant linear selection requiring the full knowledge of sym, checkpoint_format, meta, device_map, and backend before deciding on the correct quant_linear to use.

Comment on lines +250 to +259
meta = gptq_dict["meta"]
# store both optimum:version and gptq_lib:version into quantize_config.meta.quantizer
if meta.get("quantizer") is None:
meta["quantizer"] = [f"optimum:{optimum_version}"]

if is_gptqmodel_available():
meta["quantizer"].append(f"gptqmodel:{gptqmodel_version}")
elif is_auto_gptq_available():
meta["quantizer"].append(f"auto_gptq:{autogptq_version}")

Copy link
Contributor

@Qubitium Qubitium Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc This is where we store the tooling fingerprints, both optimum name + version, and gptqmodel + version, and auto-gptq + version for fingerprinting that also double as future bug proofing since quant weight bugs can be detected and auto-fixed by new code.

Comment on lines 721 to 722
if is_gptqmodel_available():
model, _ = hf_convert_gptq_v1_to_v2_format(model, self.bits, self.quant_linear, self.checkpoint_format, self.meta)
Copy link
Contributor

@Qubitium Qubitium Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc GPTQModel uses v2 as internal format for most kernels, except for IPEX but this method will auto skip conversion for IPEX. It returns converted model plus true/false if conversion happened.

On model save to gptq format we do the reverse of v2 to v1. It's very fast and minimal relative to the slow quantization phase.

Comment on lines +143 to +159
v = version.parse(importlib_metadata.version("auto_gptq"))
if v >= AUTOGPTQ_MINIMUM_VERSION:
return True
else:
raise ImportError(
f"Found an incompatible version of auto-gptq. Found version {version_autogptq}, but only version above {AUTOGPTQ_MINIMUM_VERSION} are supported"
f"Found an incompatible version of auto-gptq. Found version {v}, but only version >= {AUTOGPTQ_MINIMUM_VERSION} are supported"
)


def is_gptqmodel_available():
if _gptqmodel_available:
v = version.parse(importlib_metadata.version("gptqmodel"))
if v >= GPTQMODEL_MINIMUM_VERSION:
return True
else:
raise ImportError(
f"Found an incompatible version of gptqmodel. Found version {v}, but only version >= {GPTQMODEL_MINIMUM_VERSION} are supported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable names are just too verbose. v is enough to convey clear message in such a short code ctx.

@IlyasMoutawwakil
Copy link
Member

Since GPTQModel tests will not be running on the CI to verify them, let's revert the modifications in GPTQ testing (auto-gptq + cuda only) to not omit something that might be broken.

@jiqing-feng
Copy link
Contributor Author

Since GPTQModel tests will not be running on the CI to verify them, let's revert the modifications in GPTQ testing (auto-gptq + cuda only) to not omit something that might be broken.

Done, please re-run the CI and take the second round review. Thanks

Comment on lines 93 to 95
checkpoint_format: str = "gptq",
meta: Optional[Dict[str, any]] = None,
backend: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be moved down so that codes that rely on the order of args won't break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -450,6 +564,8 @@ def store_input_hook(_, input, *args):
raise ValueError(f"Module {module_name} was not found in model")

torch.cuda.empty_cache()
if hasattr(torch, "xpu"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure but I have seen it multiple times don't we have to also check and torch.xpu.is_available() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jiqing-feng
Copy link
Contributor Author

Hi @IlyasMoutawwakil , Could you check the failed tests for gptq? It seems like a torch version error, I can pass the tests locally with torch 2.5. Besides, the failed tests should be already in the original repo. I see no gptq tests will be triggered in the previous optimum commits. Could you please help to check it?

@IlyasMoutawwakil
Copy link
Member

Hello, the error is RuntimeError: "LayerNormKernelImpl" not implemented for 'Half' as seen in https://github.com/huggingface/optimum/actions/runs/12373472857/job/34533937633?pr=2064#step:4:11410

This error doesn't happen on main, it happens on this branch only because a layer is on cpu trying to process 16bit inputs

pytorch/pytorch#96292

Screenshot 2024-12-17 161058

Scheduled CI from 14 hours ago on main runs successfully.

Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Dec 18, 2024

Hi @IlyasMoutawwakil . The main branch optimum/gptq cannot run gptqmodel on CPU device because it will move the model to cuda by hard code. But in my changes, I kept the original model's device where the model can run on CPU if you set device_map="cpu". I have sent the detailed information in Slack, please check it. Thanks.

The point is pytorch2.2 does not support cpu fp16 layer norm op and the pytorch 2.5 supports cpu fp16 layer norm op but conflicts with gptq exllama tests.

I am afraid we need to skip cpu tests. As the previous optimum didn't actually run on CPU ever because of the hard code.....

Another way is to use fp32 model on CPU so it could pass the tests.

We can book a meeting to align it if I didn't make it clear, please let me know your time slot. Thanks!

@jiqing-feng
Copy link
Contributor Author

Hi @IlyasMoutawwakil . I use a fp32 model to run CPU tests and now all tests are passed, please trigger the CI. Thanks!

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Dec 18, 2024

Hi @IlyasMoutawwakil . I have fixed the tests according to your instructions, please re-run the CI, thanks!

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Dec 18, 2024

I am afraid we need to skip cpu tests.

There are no gptq+cpu tests in optimum...

The tests that failed on previous commits here are cuda tests which no longer worked because of the previous code change.
The original code puts everything on cuda:0 layer by layer (even when the model or a layer is on cpu) because auto-gptq only supported gpu and to allow users to quantize big models even on a google colab the idea is to move and process layer by layer on the gpu, reducing the memory requirements for quantizing a model.

It's not cpu-only friendly yes, but we can think about how to make it so in another PR that adds cpu tests as well, breaking compatibility with pytorch<2.5 is not an optimal solution IMO.

@jiqing-feng
Copy link
Contributor Author

I am afraid we need to skip cpu tests.

There are no gptq+cpu tests in optimum...

The tests that failed on previous commits here are cuda tests which no longer worked because of the previous code change. The original code puts everything on cuda:0 layer by layer (even when the model or a layer is on cpu) because auto-gptq only supported gpu and to allow users to quantize big models even on a google colab the idea is to move and process layer by layer on the gpu, reducing the memory requirements for quantizing a model.

It's not cpu-only friendly yes, but we can think about how to make it so in another PR that adds cpu tests as well, breaking compatibility with pytorch<2.5 is not an optimal solution IMO.

I get your point, it makes sense. Yes, we will see how to integrate cpu-only device test in the next PR since gptqmodel supports CPU-only now.

@jiqing-feng
Copy link
Contributor Author

Hi @IlyasMoutawwakil . Please check if any changes required before merging, thanks!

@Qubitium
Copy link
Contributor

Qubitium commented Dec 19, 2024

The tests that failed on previous commits here are cuda tests which no longer worked because of the previous code change. The original code puts everything on cuda:0 layer by layer (even when the model or a layer is on cpu) because auto-gptq only supported gpu and to allow users to quantize big models even on a google colab the idea is to move and process layer by layer on the gpu, reducing the memory requirements for quantizing a model.

@IlyasMoutawwakil, I have alread had a previous conversation with @jiqing-feng regarding this issue and we will need to address in the new PR (post merge) as it require even more internal changes that is well out-of-scope of this relatively simple changes for first-round of gptqmodel support.

For next round of PRs:

  • Address the issue of device as related to quantization. the device/device_map passed to AutoModel is misleading since both auto-gptq and gptqmodel will actually load model to cpu and switch to gpu per layer during quantization. WIth new IPEX kernel, gptqmodel can do everything on cpu. We need to have a concept of a quantization device that is separate from loading weight device. This was a huge pain point for us as no distinction exists in the current params but fixing it in this PR will cause unreasonable bloat and burden for review. For quantization, model should always load on cpu with a separate device targeting quant stage.
  • Optimum assumes same quantization for the entire model. GPTQModel already has dynamic per-module fine-grain control for quantization and inference. Need to enable this control in optimum. The selection of QuantLinear needs to happen per-module, not per model. dynamic also allow entire layers to be skipped for quantization which the current logic doesn't allow.

Our plan is to integrate GPTQModel as simply as possible with the least friction first. Then introduce more advanced features such as dynamic control and fix legacy issues with loading vs quantization device which both require more internal changes.

Edit: There are actually 2 kernels in GPTQModel that supports CPU inference/quantization: Torch + IPEX where IPEX has acceleration via native avx+amx ops.

jiqing-feng and others added 2 commits December 19, 2024 16:53
Co-authored-by: Ilyas Moutawwakil <[email protected]>
Co-authored-by: Ilyas Moutawwakil <[email protected]>
@jiqing-feng
Copy link
Contributor Author

Hi @IlyasMoutawwakil . Please re-run the CI if no more changes are required. Thanks!

@IlyasMoutawwakil
Copy link
Member

I think we can go ahead with merging these changes for now, @jiqing-feng does the tests pass locally when gptqmodel is installed ?

@jiqing-feng
Copy link
Contributor Author

I think we can go ahead with merging these changes for now, @jiqing-feng does the tests pass locally when gptqmodel is installed ?

Yes! Both Cuda and CPU can pass all gptq tests.

@IlyasMoutawwakil IlyasMoutawwakil merged commit 21de42f into huggingface:main Dec 19, 2024
22 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants