use qserve with tensorrt-llm raise an error #31

anaivebird · 2024-11-27T03:32:40Z

System Info

GPU： NVIDIA H100 80G
TensorRT-LLM branch main
TensorRT-LLM commit: 535c9cc6730f5ac999e4b1cb621402b58138f819

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

huggingface-cli download meta-llama/Llama-2-7b-hf --local-dir ./llama2-7b
git clone https://github.com/mit-han-lab/deepcompressor

cd /root/deepcompressor

conda env create -f environment.yml
poetry install

python -m deepcompressor.app.llm.ptq \
    examples/llm/configs/qoq-g128.yaml \
    --model-name llama-2-7b --model-path /root/llama2-7b \
    --smooth-proj-alpha 0 --smooth-proj-beta 1 \
    --smooth-attn-alpha 0.5 --smooth-attn-beta 0 \
    --save-model /root/quantized-llama2-7b

export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quantized-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

Expected behavior

no error

actual behavior


user@/app/tensorrt_llm/examples/llama$ export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quantized-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900
0.16.0.dev2024111900
[11/27/2024-11:19:05] [TRT-LLM] [I] Loading weights from lmquant torch checkpoint for QServe W4A8 inference...
[11/27/2024-11:19:12] [TRT-LLM] [I] Processing weights in layer: 0
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 555, in <module>
    main()
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 547, in main
    convert_and_save_hf(args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 488, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 495, in execute
    f(args, rank)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 472, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 416, in from_hugging_face
    weights = load_weights_from_lmquant(quant_ckpt_path, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2086, in load_weights_from_lmquant
    process_weight_and_params(qkv, f'{tllm_prex}.attention.qkv'))
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2015, in process_weight_and_params
    qweight = qserve_quantize_weight_per_group(weight, s1_scales,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 328, in qserve_quantize_weight_per_group
    linear_weight.max() <= 15), "Stage 2: Quantized weight out of range"
AssertionError: Stage 2: Quantized weight out of range

additional notes

no

The text was updated successfully, but these errors were encountered:

bobboli · 2024-11-27T09:33:06Z

Please refer to NVIDIA/TensorRT-LLM#2507 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use qserve with tensorrt-llm raise an error #31

use qserve with tensorrt-llm raise an error #31

anaivebird commented Nov 27, 2024

bobboli commented Nov 27, 2024

use qserve with tensorrt-llm raise an error #31

use qserve with tensorrt-llm raise an error #31

Comments

anaivebird commented Nov 27, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

bobboli commented Nov 27, 2024