We are release the new 0.2.0 version of rtp-llm, featuring some major updates:
- rpc mode of scheduler
- device backend implementation of models
- more quantization methods
rpc mode
Rpc mode refactored inference scheduler with c++, eliminating the performance bottleneck of query batching.
To use rpc mode, start with env USE_RPC_MODEL=1
.
device backend with fully managed gpu memory
The newly introduced device implementation preallocates all gpu memory and optimized gpu memory usage.
To use device backend, you must enable rpc mode, then start with env USE_NEW_DEVICE_IMPL=1
to enable.
Set DEVICE_RESERVE_MEMORY_BYTES
to change the bytes of gpu memory reserved for rtp-llm. A negative value means reserving all available memory but leave these bytes free. Default is -134217728 (preallocate all gpu memories but leave 128MB free).
Set HOST_RESERVE_MEMORY_BYTES
is similar but reserves host memory. This improves framework performance, default is 2GB.
quantization
Smoothquant and omniquant are supported on llama and qwen models.
Using smoothquant requires smoothquant.ini
under checkpoint dir.
Using omniquant, GPTQ or AWQ requires adding quant fields in config:
"quantization_config": {
"bits": 8,
"quant_method": "omni_quant"
}
Now all quantization methods support start from SM70.
other improvements
- GLM4, GLM4V, llava-next, Qwen2 supported
- optimized performance on nvidia A100