BUG: Mixed-precision configuration not working with STATIC quantization #163

sasha-hailo · 2024-10-27T22:37:05Z

Dear LLMC team,
I've been trying to run mixed-precision PTQ quantization using RTN.
I suspect there's a bug, as the non-default settings in mix_bits are ignored.

My understanding of the code:

In method get_act_qparams() of rtn.py, the values of qmax / qmin / scales / zeros are determined using the default quantizer bit precision
These values are registered as buf_act_<xxx> buffers, for all modules / layers.
During inference time, in method a_qdq() of rtn.py, though the aquantizer object of each layer is configured correctly, it blindly loads from buffer the registered quantization parameters qmin / qmax / scales / zeros, and uses them, instead of the actual values it should support.

What do you think?
Thanks in advance!

The text was updated successfully, but these errors were encountered:

Harahan · 2024-11-01T15:20:59Z

There's no get_act_qparams() in rtn.py. You can print the bit-width of each linear to check the code.

PS: This function hasn't update for a long time. If you confirm there's a bug, please feel free to contact me anytime.

sasha-hailo · 2024-11-04T14:29:10Z

Hi @Harahan,
Thank you for your response.
It turns out that a lot of changes have been made since my issue report (in this commit).
The functionality I was referring to as get_act_qparams() now resides in register_act_qparams() in file base_blockwise_quantization.py.

The bug, unfortunately, persists.

The "mechanism" is the same: function register_act_qparams() uses a single quantizer object (self.aquantizer) to determine the quantization parameters of all layers - and this quantizer is configured with the default settings. It determines the scale & zero point settings (w.r.t. incorrect bit width), and registers them via buf_act_scales / buf_act_zeros.

Note that the correct per-layer quantization configurations are loaded when executing deploy() function,
But they have no effect, as they are using the incorrect scale & zero-point values determined in the previous stage!

To sum it up: I think that the core issue that causes the [suspected] bug is that the calibration stage & function register_act_qparams() are unaware of the configured mixed-precision, and work with the default quantization config.
This code probably works well in dynamic quantization, but not in a static quantization scenario.
I also suspect that the same issue can happen with other quantization methods.

Can you please look into it?
Thanks in advance!

sasha-hailo · 2024-11-04T14:32:03Z

P.S.
An unrelated question:
I also noticed that the commit I mentioned above added some limited support to additional quantization granularity, via functions get_matmul_in_block(), get_softmax_in_block(), get_act_fn_in_block().
Do you plan to extend this support to the more common LLM models like Qwen & LLama?
(This could be really cool)

Harahan · 2024-11-04T16:07:07Z

It depends on whether we encounter such a need or it will be used in our research. So, not sure.

sasha-hailo · 2024-11-05T08:36:35Z

Did you succeed in reproducing the mix_bits problem I reported?
I believe the issue should be reopened as a bug...

Harahan · 2024-11-05T09:29:39Z

I'm sorry, but we do not have enough time to do this. If you are sure there's a bug, post the log/evidence and reopen the issue.

sasha-hailo · 2024-11-05T16:04:43Z

LLMC_RTN_W8A8_MixedA16_Bug.txt
LLMC_RTN_W8A8.txt

I'm pretty sure this is a bug.
And I now suspect that the issue affects not only RTN, but nearly any method based on static quantization.
Can you please reopen the issue? I don't think I have the permissions for this.

Please find attached 2 logs of LLMC with an RTN configuration.
One log refers to a configuration without mix-bit, the other with mix-bit.
If you compare the two files, you can see that

The outputs of both runs are identical (same PPL score), hinting that the mix_bit configuration had no effect.
The mix_bit configuration of the deployed model is correct (see circa line 2458 in the log)
==> the bug is not at the deployment stage, but at the calibration stage (see my explanation in earlier messages).

Harahan · 2024-11-07T20:59:36Z

I reopen the issue. Since we currently don't have the requirement for the static quantization, the bug may be fixed a long time later. You'd best try other settings.

Harahan closed this as completed Nov 1, 2024

sasha-hailo changed the title ~~Mixed-precision configuration not working with RTN?~~ BUG: Mixed-precision configuration not working with STATIC quantization Nov 5, 2024

Harahan reopened this Nov 7, 2024

Harahan added the bug Something isn't working label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Mixed-precision configuration not working with STATIC quantization #163

BUG: Mixed-precision configuration not working with STATIC quantization #163

sasha-hailo commented Oct 27, 2024

Harahan commented Nov 1, 2024 •

edited

Loading

sasha-hailo commented Nov 4, 2024 •

edited

Loading

sasha-hailo commented Nov 4, 2024

Harahan commented Nov 4, 2024

sasha-hailo commented Nov 5, 2024

Harahan commented Nov 5, 2024 •

edited

Loading

sasha-hailo commented Nov 5, 2024

Harahan commented Nov 7, 2024 •

edited

Loading

BUG: Mixed-precision configuration not working with STATIC quantization #163

BUG: Mixed-precision configuration not working with STATIC quantization #163

Comments

sasha-hailo commented Oct 27, 2024

Harahan commented Nov 1, 2024 • edited Loading

sasha-hailo commented Nov 4, 2024 • edited Loading

sasha-hailo commented Nov 4, 2024

Harahan commented Nov 4, 2024

sasha-hailo commented Nov 5, 2024

Harahan commented Nov 5, 2024 • edited Loading

sasha-hailo commented Nov 5, 2024

Harahan commented Nov 7, 2024 • edited Loading

Harahan commented Nov 1, 2024 •

edited

Loading

sasha-hailo commented Nov 4, 2024 •

edited

Loading

Harahan commented Nov 5, 2024 •

edited

Loading

Harahan commented Nov 7, 2024 •

edited

Loading