We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
examples
huggingface-cli download meta-llama/Llama-2-7b-hf --local-dir ./llama2-7b git clone https://github.com/mit-han-lab/deepcompressor cd /root/deepcompressor conda env create -f environment.yml poetry install python -m deepcompressor.app.llm.ptq \ examples/llm/configs/qoq-g128.yaml \ --model-name llama-2-7b --model-path /root/llama2-7b \ --smooth-proj-alpha 0 --smooth-proj-beta 1 \ --smooth-attn-alpha 0.5 --smooth-attn-beta 0 \ --save-model /root/quantized-llama2-7b export TRTLLM_DISABLE_UNIFIED_CONVERTER=1 python convert_checkpoint.py --model_dir /root/llama2-7b \ --output_dir /root/trtllm-llama2-7b \ --dtype float16 \ --quant_ckpt_path /root/quantized-llama2-7b \ --use_qserve \ --per_group \ --tp_size 1
no error
user@/app/tensorrt_llm/examples/llama$ export TRTLLM_DISABLE_UNIFIED_CONVERTER=1 python convert_checkpoint.py --model_dir /root/llama2-7b \ --output_dir /root/trtllm-llama2-7b \ --dtype float16 \ --quant_ckpt_path /root/quantized-llama2-7b \ --use_qserve \ --per_group \ --tp_size 1 [TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900 0.16.0.dev2024111900 [11/27/2024-11:19:05] [TRT-LLM] [I] Loading weights from lmquant torch checkpoint for QServe W4A8 inference... [11/27/2024-11:19:12] [TRT-LLM] [I] Processing weights in layer: 0 Traceback (most recent call last): File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 555, in <module> main() File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 547, in main convert_and_save_hf(args) File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 488, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 495, in execute f(args, rank) File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 472, in convert_and_save_rank llama = LLaMAForCausalLM.from_hugging_face( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 416, in from_hugging_face weights = load_weights_from_lmquant(quant_ckpt_path, config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2086, in load_weights_from_lmquant process_weight_and_params(qkv, f'{tllm_prex}.attention.qkv')) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 2015, in process_weight_and_params qweight = qserve_quantize_weight_per_group(weight, s1_scales, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 328, in qserve_quantize_weight_per_group linear_weight.max() <= 15), "Stage 2: Quantized weight out of range" AssertionError: Stage 2: Quantized weight out of range
no
The text was updated successfully, but these errors were encountered:
Please refer to NVIDIA/TensorRT-LLM#2507 (comment)
Sorry, something went wrong.
No branches or pull requests
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
no error
actual behavior
additional notes
no
The text was updated successfully, but these errors were encountered: