Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use qserve with tensorrt-llm raise an error #31

Open
2 of 4 tasks
anaivebird opened this issue Nov 27, 2024 · 1 comment
Open
2 of 4 tasks

use qserve with tensorrt-llm raise an error #31

anaivebird opened this issue Nov 27, 2024 · 1 comment

Comments

@anaivebird
Copy link

System Info

  • GPU: NVIDIA H100 80G
  • TensorRT-LLM branch main
  • TensorRT-LLM commit: 535c9cc6730f5ac999e4b1cb621402b58138f819

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

huggingface-cli download meta-llama/Llama-2-7b-hf --local-dir ./llama2-7b
git clone https://github.com/mit-han-lab/deepcompressor

cd /root/deepcompressor

conda env create -f environment.yml
poetry install

python -m deepcompressor.app.llm.ptq \
    examples/llm/configs/qoq-g128.yaml \
    --model-name llama-2-7b --model-path /root/llama2-7b \
    --smooth-proj-alpha 0 --smooth-proj-beta 1 \
    --smooth-attn-alpha 0.5 --smooth-attn-beta 0 \
    --save-model /root/quantized-llama2-7b

export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quantized-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

Expected behavior

no error

actual behavior


user@/app/tensorrt_llm/examples/llama$ export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quantized-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
0.16.0.dev2024111900
[11/27/2024-11:19:05] [TRT-LLM] [I] Loading weights from lmquant torch checkpoint for QServe W4A8 inference...
[11/27/2024-11:19:12] [TRT-LLM] [I] Processing weights in layer: 0
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 555, in <module>
    main()
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 547, in main
    convert_and_save_hf(args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 488, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 495, in execute
    f(args, rank)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 472, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 416, in from_hugging_face
    weights = load_weights_from_lmquant(quant_ckpt_path, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2086, in load_weights_from_lmquant
    process_weight_and_params(qkv, f'{tllm_prex}.attention.qkv'))
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2015, in process_weight_and_params
    qweight = qserve_quantize_weight_per_group(weight, s1_scales,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 328, in qserve_quantize_weight_per_group
    linear_weight.max() <= 15), "Stage 2: Quantized weight out of range"
AssertionError: Stage 2: Quantized weight out of range

additional notes

no

@bobboli
Copy link
Contributor

bobboli commented Nov 27, 2024

Please refer to NVIDIA/TensorRT-LLM#2507 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants