RuntimeError: Engine loop has died #9418

jovi-s · 2024-10-16T11:20:36Z

jovi-s
Oct 16, 2024

Hi everyone,
I am trying to perform inference using TheBloke/Mistral-7B-Instruct-v0.2-AWQ with vLLM Installation with CPU using Docker and I keep receiving this error:

CRITICAL 10-16 09:58:33 launcher.py:99] MQLLMEngine is already dead, terminating server process

I am successful in building the CPU Docker image, and when using the default facebook/opt-125m model, I am also successful in running the server and performing inference to receive a completions response.

As for the Mistral model, AWQ is a part of the supported hardware for quantization kernels, and I am able to start the server with my Docker run command as follows:

docker run -it --rm -v Mistral:/mnt/models/Mistral --network=host --ipc=host -e VLLM_CPU_KVCACHE_SPACE=40 vllm-cpu-env --model="/mnt/models/Mistral/Mistral-7B-Instruct-v0.2-AWQ" --dtype="half" --quantization awq --device "cpu" --max-model-len 2048

When I send an inference query, I am also able to see the following log:

INFO engine.py:292] Added request cmpl-4f556df75d1f4acd96e83932879a8273-0

It is only after a few seconds that I receive RuntimeError('Engine loop has died') which kills the server, and shuts down the Docker container.

I have tried various parameters of the VLLM_CPU_KVCACHE_SPACE value and have increased VLLM_ENGINE_ITERATION_TIMEOUT_S, as well as setting VLLM_CPU_OMP_THREADS_BIND to my physical cores, but to no avail.

I'm reaching out in the hopes that this error can be rectified. Thank you for your attention thus far. Cheers

AMohamedAakhil · 2024-10-21T02:34:22Z

AMohamedAakhil
Oct 21, 2024

I'm having the same issue

1 reply

Chengyiao0730 Oct 22, 2024

me too. i use model qwen2.5 in Ubuntu24.04.I am trying to switch between the model and the virtual environment But it still reports an error.

ERROR 10-22 12:20:26 client.py:250] RuntimeError('Engine loop has died')
ERROR 10-22 12:20:26 client.py:250] Traceback (most recent call last):
ERROR 10-22 12:20:26 client.py:250] File "/home/chengyiao/miniconda3/envs/cls3/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 150, in run_heartbeat_loop

SusanLiu0709 · 2024-10-24T10:35:26Z

SusanLiu0709
Oct 24, 2024

Same error here.

1 reply

gxm651182644 Oct 28, 2024

same to me
model :Qwen2-VL-72B-Instruct-GPTQ-Int4
env : (A10 24g )* 8
scripts:
vllm serve /model-repo/Qwen2-VL-72B-Instruct-GPTQ-Int4 --max-model-len 32768 --served-model-name Qwen2-VL-72B-Instruct-GPTQ-Int4 --device cuda --tensor-parallel-size $gpus --limit_mm_per_prompt 'image=10' --gpu_memory_utilization 0.8

vllm server can run successfully
but when i curl model concurrently
RuntimeError: Engine loop has died

colinTmx · 2024-10-31T08:18:44Z

colinTmx
Oct 31, 2024

same to me
model:qwen2.5-32b-instruct-gptq-int3

0 replies

rjwharry · 2024-11-04T10:07:27Z

rjwharry
Nov 4, 2024

Consider increasing the VLLM_RPC_TIMEOUT environment variable to address the timeout issue. In vllm, the run_heartbeat_loop function within the MQLLMEngineClient class (found in vllm/engine/multiprocessing/client.py) is responsible for checking the engine's status periodically. This function relies on VLLM_RPC_TIMEOUT to determine the maximum allowable time for each check. If the function exceeds this duration, a timeout error occurs.

I encountered a similar issue that was resolved by increasing the VLLM_RPC_TIMEOUT value. Adjusting this environment variable allows more time for the engine checks, helping to prevent timeout errors during engine monitoring.

2 replies

hughesadam87 Nov 4, 2024

I thought VLLM_RPC_TIMEOUT was used to disconnect clients that are taking too long to respond - or do I have that wrong?

Why would the MQLLMEngineClient depend on this? For example, what if I want a short RPC timeout? All of the sudden, the MQLLMEngienClient is crashing the entire server every few minutes. SHouldn't this have it's own variable - like VLLM_ENGINE_HC_TIMEOUT or something?

rjwharry Nov 5, 2024

The run_heartbeat_loop method in MQLLMEngineClient is responsible for monitoring the health of the engine by periodically polling the vllm engine. Thus, in my view, VLLM_RPC_TIMEOUT does not refer to disconnecting due to a slow response; rather, it represents the timeout duration for an RPC request sent from MQLLMEngineClient to MQLLMEngine for health checks. If an RPC request times out, MQLLMEngineClient interprets this as an indication of an unhealthy vllm engine and raises exception to shut down the vllm server.

sam-huang1223 · 2024-11-06T18:57:54Z

sam-huang1223
Nov 6, 2024

similar issue #10002

0 replies

simoroma · 2024-11-13T14:51:26Z

simoroma
Nov 13, 2024

Should changing VLLM_RPC_TIMEOUT change the frequency of heartbeat check?

INFO 11-13 14:50:01 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-13 14:50:11 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

No matter what I set, I get checks every 10 seconds.

13 replies

rjwharry Nov 14, 2024

On my server, I ran two vllm using same model you mentioned but no error occurred. Please share your server environment

simoroma Nov 14, 2024

Thanks for testing this @rjwharry . I can run 2 as well. But I cannot run 3 or more.

rjwharry Nov 14, 2024

It's strange.. I ran 4 vllms and test with openai. It works fine. Maybe your error is occurred because of the server environment such as gpu

simoroma Nov 22, 2024

Could you tell me in which environment you can run 4 vllms please? GPU, how much RAM, how much VRAM.

I don't understand why it shouldn't work on a Runpod with a RTX 3090 (24 GB VRAM) and 8 vCPU 30 GB RAM. Maybe it is some other settings. Like might be a limit on concurrency.

simoroma Nov 23, 2024

I tested out other parameters: --num-gpu-blocks-override 200 and --enforce-eager. Still cannot run more than 2 vllm.

I tried different GPUs: 3090, 3070, A2000. I am starting to think that is it a RAM (cpu RAM) issue. I can set gpu blocks but not cpu blocks. And I see that the RAM is full. But I cannot tell exactly what the issue is from the error:

INFO 11-23 21:37:02 model_runner.py:1077] Loading model weights took 0.2543 GB
INFO 11-23 21:37:03 worker.py:232] Memory profiling results: total_gpu_memory=7.69GiB initial_memory_usage=1.57GiB peak_torch_memory=0.71GiB memory_usage_post_profile=1.60GiB non_torch_memory=1.34GiB kv_cache_size=-2.05GiB gpu_memory_utilization=0.00
INFO 11-23 21:37:03 llm_engine.py:491] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=200
INFO 11-23 21:37:03 gpu_executor.py:113] # GPU blocks: 200, # CPU blocks: 11650
INFO 11-23 21:37:03 gpu_executor.py:117] Maximum concurrency for 2048 tokens per request: 1.56x
ERROR 11-23 21:37:10 engine.py:366] CUDA error: invalid argument
ERROR 11-23 21:37:10 engine.py:366] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-23 21:37:10 engine.py:366] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-23 21:37:10 engine.py:366] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-23 21:37:10 engine.py:366] Traceback (most recent call last):
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-23 21:37:10 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-23 21:37:10 engine.py:366]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-23 21:37:10 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 11-23 21:37:10 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 11-23 21:37:10 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 11-23 21:37:10 engine.py:366]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 350, in __init__
ERROR 11-23 21:37:10 engine.py:366]     self._initialize_kv_caches()
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 500, in _initialize_kv_caches
ERROR 11-23 21:37:10 engine.py:366]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 120, in initialize_cache
ERROR 11-23 21:37:10 engine.py:366]     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 276, in initialize_cache
ERROR 11-23 21:37:10 engine.py:366]     self._init_cache_engine()
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 281, in _init_cache_engine
ERROR 11-23 21:37:10 engine.py:366]     self.cache_engine = [
ERROR 11-23 21:37:10 engine.py:366]                         ^
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 282, in <listcomp>
ERROR 11-23 21:37:10 engine.py:366]     CacheEngine(self.cache_config, self.model_config,
ERROR 11-23 21:37:10 engine.py:366]   File "/root/miniconda/envs/llm-venv/lib/python3.11/site-packages/vllm/worker/cache_engine.py", line 64, in __init__
ERROR 11-23 21:37:10 engine.py:366]

simoroma · 2024-11-14T15:05:31Z

simoroma
Nov 14, 2024

I will check it out and let you know. Thanks.

…

On Thu, Nov 14, 2024, 4:28 PM JAEWON ROH ***@***.***> wrote: It's strange.. I ran 4 vllms and test with openai. It works fine. Maybe your error is occurred because of the server environment such as gpu — Reply to this email directly, view it on GitHub <#9418 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJG6WKDUULUFYRM3WHTA6ML2ASXRXAVCNFSM6AAAAABQBHGDEOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMRVGUZDKNA> . You are receiving this because you commented.Message ID: ***@***.***>

2 replies

simoroma Nov 15, 2024

Just an update - I am running it in Runpod with a RTX 3090 (24 GB VRAM) and 8 vCPU 30 GB RAM.

The problem is that when I add the 3rd process one of the other two running gets the heatbeat error. I don't understand if it is a concurrency problem. But I cannot really tell now from the errors I see.

rjwharry Nov 17, 2024

I'd have to test it in a similar environment, I don't know at this point.

donpromax · 2024-11-22T08:48:58Z

What does that option mean?

Force to use AsyncEngine instead of MQEngine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Engine loop has died #9418

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RuntimeError: Engine loop has died #9418

Replies: 8 comments · 22 replies

Replies: 8 comments 22 replies