Use startup time in async worker thread instead of worker timeout #3315

mreso · 2024-09-13T22:30:49Z

Description

This PR introduces startup time as a parameter to llm_launcher and uses it in async worker thread instead of work_timeout (which is usually only the python worker coming up without loading the model)

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test setting startup timeout too short to trigger killing the worker

 CUDA_VISIBLE_DEVICES=0,1 python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct  --disable_token_auth --startup_timeout 5                                                                                                                                                                           [76/92358]
TorchServe is not currently running.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-09-13T22:18:08,653 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-09-13T22:18:08,693 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-09-13T22:18:08,841 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml
2024-09-13T22:18:09,004 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.11.1
TS Home: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve
Current directory: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve
Temp directory: /tmp
Metrics config path: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml
Number of GPUs: 2
Number of CPUs: 2
Max heap size: 30688 M
Python executable: /data/home/mreso/miniconda3/envs/serve/bin/python
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/model_store
Initial Models: model
Log dir: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/logs
Metrics dir: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false                                                                                                                                                                                                                                                                                                                                             Workflow Store: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: false
2024-09-13T22:18:09,022 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-09-13T22:18:09,082 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model
2024-09-13T22:18:09,105 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/e4a76e6a0b544ec0a7012228b017fde4
2024-09-13T22:18:09,108 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/e4a76e6a0b544ec0a7012228b017fde4/model
2024-09-13T22:18:09,159 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model model
2024-09-13T22:18:09,161 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model model
2024-09-13T22:18:09,162 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2024-09-13T22:18:09,162 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: model, count: 1
2024-09-13T22:18:09,186 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Device Ids: null
2024-09-13T22:18:09,189 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-09-13T22:18:09,196 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/data/home/mreso/miniconda3/envs/serve/bin/python, /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mre
so/serve/ts/configs/metrics.yaml, --async]
2024-09-13T22:18:09,448 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-09-13T22:18:09,449 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-09-13T22:18:09,451 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-09-13T22:18:09,453 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-09-13T22:18:09,454 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-09-13T22:18:10,200 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
[...]
2024-09-13T22:18:19,547 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Worker did not load the model within 5 seconds
at org.pytorch.serve.wlm.AsyncWorkerThread.run(AsyncWorkerThread.java:85) [server-1.0.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]

Test with default value

CUDA_VISIBLE_DEVICES=0,1 python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct  --disable_token_auth
TorchServe is not currently running.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-09-13T22:24:04,821 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-09-13T22:24:04,859 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-09-13T22:24:04,998 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml
2024-09-13T22:24:05,123 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.11.1
TS Home: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve                                                                                                                                                                                                                                                                                                        Current directory: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve
Temp directory: /tmp
Metrics config path: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml
Number of GPUs: 2
Number of CPUs: 2
Max heap size: 30688 M
Python executable: /data/home/mreso/miniconda3/envs/serve/bin/python
Config file: N/A
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/model_store
Initial Models: model
Log dir: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/logs
Metrics dir: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/model_store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: false
2024-09-13T22:24:05,139 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-09-13T22:24:05,186 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model
2024-09-13T22:24:05,195 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/4fc49bb14ebf4b8c8c03febc8a1d615a
2024-09-13T22:24:05,204 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/4fc49bb14ebf4b8c8c03febc8a1d615a/model
2024-09-13T22:24:05,225 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model model
2024-09-13T22:24:05,226 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model model
2024-09-13T22:24:05,226 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2024-09-13T22:24:05,226 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: model, count: 1
2024-09-13T22:24:05,238 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-09-13T22:24:05,273 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Device Ids: null
2024-09-13T22:24:05,288 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/data/home/mreso/miniconda3/envs/serve/bin/python, /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mre
so/serve/ts/configs/metrics.yaml, --async]
2024-09-13T22:24:05,414 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2024-09-13T22:24:05,414 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-09-13T22:24:05,423 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2024-09-13T22:24:05,423 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-09-13T22:24:05,424 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2024-09-13T22:24:06,116 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
[....]
2024-09-13T22:24:10,019 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=2716751
2024-09-13T22:24:10,022 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2024-09-13T22:24:10,046 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/mreso/serve/ts/configs/metrics.yaml.
2024-09-13T22:24:10,046 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]2716751
2024-09-13T22:24:10,046 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.                                                                                                                                                                                                                                                                                 2024-09-13T22:24:10,046 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.14
2024-09-13T22:24:10,047 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-model_1.0 State change null -> WORKER_STARTED
2024-09-13T22:24:10,052 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Connecting to: /tmp/.ts.sock.9000
2024-09-13T22:24:10,079 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2024-09-13T22:24:10,079 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - handle_connection_async
2024-09-13T22:24:10,086 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Getting requests from model: org.pytorch.serve.wlm.Model@51460249
2024-09-13T22:24:10,086 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Adding job to jobs: 84d0cdcf-f405-48d7-b66a-1df94319a89f
2024-09-13T22:24:10,087 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1726266250087
2024-09-13T22:24:10,170 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-09-13T22:24:10,171 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Successfully flushed req
2024-09-13T22:24:12,941 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 09-13 22:24:12 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
2024-09-13T22:24:24,029 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-09-13T22:24:24,035 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - OpenVINO is not enabled
2024-09-13T22:24:24,036 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-09-13T22:24:24,036 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-09-13T22:24:24,257 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:24 config.py:897] Defaulting to use mp for distributed inference
2024-09-13T22:24:24,257 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 09-13 22:24:24 arg_utils.py:890] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
2024-09-13T22:24:24,258 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:24 config.py:1006] Chunked prefill is enabled with max_num_batched_tokens=512.
2024-09-13T22:24:24,271 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:24 llm_engine.py:232] Initializing an LLM engine (v0.6.1) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_conf
ig=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, dev
ice_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=Fa
lse, use_async_output_proc=True)
2024-09-13T22:24:25,088 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 09-13 22:24:25 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2024-09-13T22:24:25,686 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:25 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2024-09-13T22:24:31,426 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 09-13 22:24:31 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
2024-09-13T22:24:37,165 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:24:37 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
2024-09-13T22:24:38,374 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:24:38 utils.py:977] Found nccl from library libnccl.so.2
2024-09-13T22:24:38,377 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:38 utils.py:977] Found nccl from library libnccl.so.2
2024-09-13T22:24:38,391 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:24:38 pynccl.py:63] vLLM is using nccl==2.20.5
2024-09-13T22:24:38,394 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:38 pynccl.py:63] vLLM is using nccl==2.20.5
2024-09-13T22:24:43,002 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:24:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /data/home/mreso/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
2024-09-13T22:24:43,002 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /data/home/mreso/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
2024-09-13T22:24:43,031 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:43 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f99792cead0>, local_subscribe_port=48355, remote_subscribe_port=None)
2024-09-13T22:24:43,189 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:24:43 model_runner.py:997] Starting to load model meta-llama/Meta-Llama-3.1-8B-Instruct...
2024-09-13T22:24:43,190 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:43 model_runner.py:997] Starting to load model meta-llama/Meta-Llama-3.1-8B-Instruct...
2024-09-13T22:24:43,858 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:24:43 weight_utils.py:242] Using model weights format ['*.safetensors']
2024-09-13T22:24:43,890 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:24:43 weight_utils.py:242] Using model weights format ['*.safetensors']
2024-09-13T22:24:43,990 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:24:43,991 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
2024-09-13T22:24:57,204 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:24:57,205 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:13<00:39, 13.21s/it]
2024-09-13T22:25:12,518 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:1.1|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,533 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:535.6403617858887|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,535 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:84.42888641357422|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,543 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:13.6|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,543 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.267916477641952|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,544 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:9190.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,545 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.267916477641952|#Level:Host,DeviceId:1|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,545 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:9190.0|#Level:Host,DeviceId:1|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,546 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.004904425017472014|#Level:Host,DeviceId:2|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,555 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:4.0|#Level:Host,DeviceId:2|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,556 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.004904425017472014|#Level:Host,DeviceId:3|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,556 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:4.0|#Level:Host,DeviceId:3|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,557 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:1.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,565 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:1.0|#Level:Host,DeviceId:1|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,566 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:2|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,566 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:3|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,566 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:1931181.68359375|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:104237.60546875|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:12,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:5.7|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266312
2024-09-13T22:25:16,545 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:25:16,545 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:32<00:33, 16.82s/it]
2024-09-13T22:25:20,252 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:25:20,252 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:36<00:10, 10.83s/it]
2024-09-13T22:25:37,526 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:25:37,527 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:53<00:00, 13.38s/it]
2024-09-13T22:25:37,531 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:25:37,532 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:53<00:00, 13.39s/it]
2024-09-13T22:25:37,531 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -                                                                                                                                                                                                                                                                                              [0/92748]
2024-09-13T22:25:37,532 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:53<00:00, 13.39s/it]
2024-09-13T22:25:37,532 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -
2024-09-13T22:25:38,258 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:25:38 model_runner.py:1008] Loading model weights took 7.5122 GB
2024-09-13T22:25:38,325 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:25:38 model_runner.py:1008] Loading model weights took 7.5122 GB
2024-09-13T22:25:41,496 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:25:41 distributed_gpu_executor.py:57] # GPU blocks: 61913, # CPU blocks: 4096
2024-09-13T22:25:48,692 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:25:48 model_runner.py:1309] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-09-13T22:25:48,693 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:25:48 model_runner.py:1313] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-09-13T22:25:48,760 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:25:48 model_runner.py:1309] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-09-13T22:25:48,762 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:25:48 model_runner.py:1313] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-09-13T22:26:09,852 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:1.4|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,853 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:535.5823554992676|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,853 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:84.48689270019531|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,854 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:13.6|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,854 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:88.73085741610367|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,854 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:72368.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,855 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:88.39735651491559|#Level:Host,DeviceId:1|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,855 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:72096.0|#Level:Host,DeviceId:1|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,855 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.004904425017472014|#Level:Host,DeviceId:2|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,856 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:4.0|#Level:Host,DeviceId:2|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,856 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.004904425017472014|#Level:Host,DeviceId:3|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,857 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:4.0|#Level:Host,DeviceId:3|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,857 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,857 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:1|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,869 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:2|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,870 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:3|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,870 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:1922531.46484375|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,870 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:104620.89453125|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:09,871 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:6.1|#Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266369
2024-09-13T22:26:20,870 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:26:20 custom_all_reduce.py:223] Registering 2275 cuda graph addresses
2024-09-13T22:26:20,874 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:26:20 custom_all_reduce.py:223] Registering 2275 cuda graph addresses
2024-09-13T22:26:20,874 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 09-13 22:26:20 model_runner.py:1428] Graph capturing finished in 32 secs.
2024-09-13T22:26:20,874 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - (VllmWorkerProcess pid=2716901) INFO 09-13 22:26:20 model_runner.py:1428] Graph capturing finished in 32 secs.
2024-09-13T22:26:22,782 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.AsyncBatchAggregator - Predictions is empty. This is from initial load....
2024-09-13T22:26:22,783 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.AsyncWorkerThread - Worker loaded the model successfully
2024-09-13T22:26:22,784 [DEBUG] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - W-9000-model_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-09-13T22:26:22,784 [INFO ] epollEventLoopGroup-5-1 TS_METRICS - WorkerLoadTime.Milliseconds:137550.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:cr6-p548xlarge-3,timestamp:1726266382

Checklist:

Did you have fun?

agunapal

LGTM

Use startup time in async worker thread instead of worker timeout

86f81d4

mreso requested review from agunapal and namannandan September 13, 2024 22:31

Merge branch 'master' into fix/startuptime_in_async_thread

1a41f98

agunapal approved these changes Sep 17, 2024

View reviewed changes

Merge branch 'master' into fix/startuptime_in_async_thread

814a518

mreso enabled auto-merge September 17, 2024 23:25

mreso added 2 commits September 18, 2024 00:35

Fix lint

79f0e41

Merge branch 'master' into fix/startuptime_in_async_thread

f6a6d9d

mreso disabled auto-merge September 18, 2024 00:46

mreso added 2 commits September 18, 2024 00:48

Update yaml files to use startupTimeout

0a13ab0

Update vllm/lora readme

e1f00f6

mreso added this pull request to the merge queue Sep 18, 2024

Merged via the queue into master with commit e212294 Sep 18, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use startup time in async worker thread instead of worker timeout #3315

Use startup time in async worker thread instead of worker timeout #3315

mreso commented Sep 13, 2024

agunapal left a comment

Use startup time in async worker thread instead of worker timeout #3315

Use startup time in async worker thread instead of worker timeout #3315

Conversation

mreso commented Sep 13, 2024

Description

Type of change

Feature/Issue validation/testing

Checklist:

agunapal left a comment

Choose a reason for hiding this comment