Release v0.2.0 · alibaba/rtp-llm

We are release the new 0.2.0 version of rtp-llm, featuring some major updates:

rpc mode of scheduler
device backend implementation of models
more quantization methods

rpc mode

Rpc mode refactored inference scheduler with c++, eliminating the performance bottleneck of query batching.

To use rpc mode, start with env USE_RPC_MODEL=1.

device backend with fully managed gpu memory

The newly introduced device implementation preallocates all gpu memory and optimized gpu memory usage.

To use device backend, you must enable rpc mode, then start with env USE_NEW_DEVICE_IMPL=1 to enable.

Set DEVICE_RESERVE_MEMORY_BYTES to change the bytes of gpu memory reserved for rtp-llm. A negative value means reserving all available memory but leave these bytes free. Default is -134217728 (preallocate all gpu memories but leave 128MB free).

Set HOST_RESERVE_MEMORY_BYTES is similar but reserves host memory. This improves framework performance, default is 2GB.

quantization

Smoothquant and omniquant are supported on llama and qwen models.

Using smoothquant requires smoothquant.ini under checkpoint dir.

Using omniquant, GPTQ or AWQ requires adding quant fields in config:

"quantization_config": {
    "bits": 8,
    "quant_method": "omni_quant"
}

Now all quantization methods support start from SM70.

other improvements

GLM4, GLM4V, llava-next, Qwen2 supported
optimized performance on nvidia A100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

rpc mode

device backend with fully managed gpu memory

quantization

other improvements