Upgrade to support latest vLLM version (max_lora_rank) #2389

dreamiter · 2024-09-16T02:58:43Z

Description

In the current version (using LMI sagemaker image), we are running into the following error:

File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1288, in __post_init__
raise ValueError(
ValueError: max_lora_rank (128) must be one of (8, 16, 32, 64)

Looks like above error was fixed in vllm version v0.5.5.
See release notes here: https://github.com/vllm-project/vllm/releases/tag/v0.5.5
See PR here: vllm-project/vllm#7146

References

N/A

The text was updated successfully, but these errors were encountered:

dreamiter · 2024-09-16T02:59:08Z

Hi @frankfliu - would you be able to help? Thanks.

siddvenk · 2024-09-16T15:51:21Z

We are planning a release that will use vllm 0.6.0 (or 0.6.1.post2) soon.

In the meantime, you can try providing a requirements.txt file with vllm==0.5.5 (or later version) to get around this.

dreamiter · 2024-09-18T02:33:07Z

Thank you @siddvenk for your suggestions.

I tried rebuilding the custom image by running pip install vllm==0.5.5 in a Dockerfile, from your latest stable image 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

We specified the followings in serving.properties file:

option.model_id=unsloth/mistral-7b-instruct-v0.3
option.engine=Python
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.enable_lora=true
option.gpu_memory_utilization=0.95
option.max_model_len=16000
option.max_lora_rank=128

We tried setting max_token to a really high number but the response is still very short.
We also get this log, and appears the vLLM backend does not support max_tokens param.

The following parameters are not supported by vllm with rolling batch: {'logprobs', 'temperature', 'seed', 'max_tokens'}. The supported parameters are set()

Do you have any insights?

siddvenk · 2024-09-18T03:24:53Z

Yes, you should use max_new_tokens.

You can find the schema for our inference api here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/lmi_input_output_schema.md.

We also support the openai chat completions schema, details here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/chat_input_output_schema.md.

dreamiter · 2024-09-18T03:44:19Z

Thanks again for your quick response @siddvenk -

Just want to make sure, should we:

Add max_new_tokens to the serving.properties file, e.g. option.max_new_tokens=16000
Or, pass max_new_tokens as a parameter when invoking the endpoint, such as

curl -X POST https://my.sample.endpoint.com/invocations \
  - H 'Content-Type: application/json' \
  - d '
    {
        "inputs" : "What is Deep Learning?", 
        "parameters" : {
            "do_sample": true,
            "max_new_tokens": 16000,
            "details": true,
        },
        "stream": true, 
    }'

dreamiter · 2024-09-18T03:45:11Z

btw, forgot to mention, we are deploying this to sagemaker

siddvenk · 2024-09-18T03:49:38Z

There are two different configurations.

On a per request basis, you can specify max_new_tokens to limit the number of generated tokens. This is just a limit on the output, not on the total sequence length.

You can limit the maximum length of sequences globally by setting option.max_model_len in serving.properties. This enforces a limit that applies to all requests, which includes both the input (prompt) tokens and generated output tokens.

dreamiter · 2024-09-18T15:29:28Z

Thanks, @siddvenk .

We did more tests and it turns out the "short response token" issue was only specific to the custom image I built (mentioned above).

So we suspect we missed some key steps when building the image - can you help us review our process?

Steps:

Create following files

|- Dockerfile
|- requirements.txt

In Dockerfile:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

# Copy files
COPY ./requirements.txt /opt/requirements.txt

# Installs third-party Python dependencies within the Docker environment
RUN pip install --upgrade pip && \
    pip install awscli --trusted-host pypi.org --trusted-host files.pythonhosted.org && \
    pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r /opt/requirements.txt \

In requirements.txt:`

vllm==0.5.5

Build the new docker image using docker build

dreamiter added the enhancement New feature or request label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to support latest vLLM version (max_lora_rank) #2389

Upgrade to support latest vLLM version (max_lora_rank) #2389

dreamiter commented Sep 16, 2024

dreamiter commented Sep 16, 2024

siddvenk commented Sep 16, 2024

dreamiter commented Sep 18, 2024 •

edited

Loading

siddvenk commented Sep 18, 2024

dreamiter commented Sep 18, 2024

dreamiter commented Sep 18, 2024

siddvenk commented Sep 18, 2024

dreamiter commented Sep 18, 2024

Upgrade to support latest vLLM version (max_lora_rank) #2389

Upgrade to support latest vLLM version (max_lora_rank) #2389

Comments

dreamiter commented Sep 16, 2024

Description

References

dreamiter commented Sep 16, 2024

siddvenk commented Sep 16, 2024

dreamiter commented Sep 18, 2024 • edited Loading

siddvenk commented Sep 18, 2024

dreamiter commented Sep 18, 2024

dreamiter commented Sep 18, 2024

siddvenk commented Sep 18, 2024

dreamiter commented Sep 18, 2024

dreamiter commented Sep 18, 2024 •

edited

Loading