vLLM Fork: RuntimeError: CUDA error #21

guydc · 2024-10-20T11:37:57Z

When running the PoC vLLM fork on a g2-standard-48 machine in GKE, and calling the /v1/completions api directly (not via proxy), an internal server error is returned:

curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 500 Internal Server Error
date: Sun, 20 Oct 2024 11:26:19 GMT
server: uvicorn
content-length: 21
content-type: text/plain; charset=utf-8

Internal Server Error

The vLLM container logs show the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

When running a non-forked image vllm/vllm-openai in the same environment, the api calls succeeds.

curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Sun, 20 Oct 2024 10:36:00 GMT
server: uvicorn
content-length: 747
content-type: application/json

{"id":"cmpl-0f853acbec694cbda25c881446bf3709","object":"text_completion","created":1729420560,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100}}

The text was updated successfully, but these errors were encountered:

zhaohuabing · 2024-10-21T11:30:22Z

@terrytangyuan @kfswain Could you please help us with this? Thanks!

liu-cong · 2024-10-21T16:12:16Z

We should be ready to switch to latest vLLM this week, #22 should fix this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Fork: RuntimeError: CUDA error #21

vLLM Fork: RuntimeError: CUDA error #21

guydc commented Oct 20, 2024

zhaohuabing commented Oct 21, 2024

liu-cong commented Oct 21, 2024

vLLM Fork: RuntimeError: CUDA error #21

vLLM Fork: RuntimeError: CUDA error #21

Comments

guydc commented Oct 20, 2024

zhaohuabing commented Oct 21, 2024

liu-cong commented Oct 21, 2024