Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM Fork: RuntimeError: CUDA error #21

Open
guydc opened this issue Oct 20, 2024 · 2 comments
Open

vLLM Fork: RuntimeError: CUDA error #21

guydc opened this issue Oct 20, 2024 · 2 comments

Comments

@guydc
Copy link

guydc commented Oct 20, 2024

When running the PoC vLLM fork on a g2-standard-48 machine in GKE, and calling the /v1/completions api directly (not via proxy), an internal server error is returned:

curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 500 Internal Server Error
date: Sun, 20 Oct 2024 11:26:19 GMT
server: uvicorn
content-length: 21
content-type: text/plain; charset=utf-8

Internal Server Error

The vLLM container logs show the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

When running a non-forked image vllm/vllm-openai in the same environment, the api calls succeeds.

curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Sun, 20 Oct 2024 10:36:00 GMT
server: uvicorn
content-length: 747
content-type: application/json

{"id":"cmpl-0f853acbec694cbda25c881446bf3709","object":"text_completion","created":1729420560,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100}}
@zhaohuabing
Copy link

@terrytangyuan @kfswain Could you please help us with this? Thanks!

@liu-cong
Copy link
Contributor

We should be ready to switch to latest vLLM this week, #22 should fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants