You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the PoC vLLM fork on a g2-standard-48 machine in GKE, and calling the /v1/completions api directly (not via proxy), an internal server error is returned:
curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 500 Internal Server Error
date: Sun, 20 Oct 2024 11:26:19 GMT
server: uvicorn
content-length: 21
content-type: text/plain; charset=utf-8
Internal Server Error
The vLLM container logs show the following error:
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
When running a non-forked image vllm/vllm-openai in the same environment, the api calls succeeds.
curl -i localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Sun, 20 Oct 2024 10:36:00 GMT
server: uvicorn
content-length: 747
content-type: application/json
{"id":"cmpl-0f853acbec694cbda25c881446bf3709","object":"text_completion","created":1729420560,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100}}
The text was updated successfully, but these errors were encountered:
When running the PoC vLLM fork on a
g2-standard-48
machine in GKE, and calling the/v1/completions
api directly (not via proxy), an internal server error is returned:The vLLM container logs show the following error:
When running a non-forked image
vllm/vllm-openai
in the same environment, the api calls succeeds.The text was updated successfully, but these errors were encountered: