Add KV-Cache int8 quant support #10354

YanyunDuanIEI · 2024-11-15T06:00:52Z

Add KV-Cache int8 quant support

Support [layer_level] and [group_level] KV-Cache int8 quant.

[layer_level] use common scale factors for each layer.
[group_level] group the head_size according to group_size, with each group_size, the scaling factor of key/value corresponding to the same value.

KV-Cache int8 quant (Click to Expand)

Get the scaling factor by calibration

Support to calibrate the KV-cache by datasets:

[examples/int8/calibrate.py] calibrate and save to pth.
[export_kv_params.py] save scaling factors to json.

Using KV-Cache int8

kv_cache_dtype="int8"
kv_quant_params_path=kv_quant_params_path
kv_quant_group=kv_quant_group

Signed-off-by: Yanyun Duan <[email protected]>

github-actions · 2024-11-15T06:01:04Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-17T02:04:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

csrc/attention/attention_kernels.cuh

…tensors

YanyunDuanIEI · 2024-12-11T05:59:15Z

Would it be viable to hasten the review process?

int8 kv-cache support

b8e7779

Signed-off-by: Yanyun Duan <[email protected]>

YanyunDuanIEI requested review from WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners November 15, 2024 06:00

mergify bot added the needs-rebase label Nov 17, 2024

Merge branch 'main' into int8-kv-cache

6309f86

YanyunDuanIEI requested review from robertgshaw2-neuralmagic and ywang96 as code owners November 18, 2024 01:59

mergify bot removed the needs-rebase label Nov 18, 2024

mgoin reviewed Nov 19, 2024

View reviewed changes

csrc/attention/attention_kernels.cuh Outdated Show resolved Hide resolved

csrc/attention/attention_kernels.cuh Outdated Show resolved Hide resolved

YanyunDuanIEI added 13 commits November 20, 2024 10:58

Update dtype_fp8.cuh

3b3d784

Update quant_utils.cuh

f09c559

Update quant_utils.cuh

03476e5

Update cache_kernels.cu

9a426c0

Update attention_kernels.cuh

2a67ed4

Update paged_attention_v1.cu

83b3d41

Update paged_attention_v2.cu

1056a5f

Update layer.py

e870565

Update selector.py

95effce

Update config.py

4e8ddb6

Update utils.py

cf98d49

Update model_runner.py

c5da7f5

merge the k_scale/v_scale and k_scaling_factor/v_scaling_factor into …

36afe99

…tensors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KV-Cache int8 quant support #10354

Add KV-Cache int8 quant support #10354

YanyunDuanIEI commented Nov 15, 2024

github-actions bot commented Nov 15, 2024

mergify bot commented Nov 17, 2024

YanyunDuanIEI commented Dec 11, 2024

Add KV-Cache int8 quant support #10354

Are you sure you want to change the base?

Add KV-Cache int8 quant support #10354

Conversation

YanyunDuanIEI commented Nov 15, 2024

Get the scaling factor by calibration

Using KV-Cache int8

github-actions bot commented Nov 15, 2024

mergify bot commented Nov 17, 2024

YanyunDuanIEI commented Dec 11, 2024