You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ts.llm_launcher most definitely has --dtype specfied (but the list of possible values for dtype aren't specified at all):
22:05 $ python -m ts.llm_launcher --help
usage: llm_launcher.py [-h] [--model_name MODEL_NAME] [--model_store MODEL_STORE] [--model_id MODEL_ID] [--disable_token_auth] [--vllm_engine.max_num_seqs VLLM_ENGINE.MAX_NUM_SEQS]
[--vllm_engine.max_model_len VLLM_ENGINE.MAX_MODEL_LEN] [--vllm_engine.download_dir VLLM_ENGINE.DOWNLOAD_DIR] [--startup_timeout STARTUP_TIMEOUT] [--engine ENGINE]
[--dtype DTYPE] [--trt_llm_engine.max_batch_size TRT_LLM_ENGINE.MAX_BATCH_SIZE]
[--trt_llm_engine.kv_cache_free_gpu_memory_fraction TRT_LLM_ENGINE.KV_CACHE_FREE_GPU_MEMORY_FRACTION]
options:
-h, --help show this help message and exit
--model_name MODEL_NAME
Model name
--model_store MODEL_STORE
Model store
--model_id MODEL_ID Model id
--disable_token_auth Disable token authentication
--vllm_engine.max_num_seqs VLLM_ENGINE.MAX_NUM_SEQS
Max sequences in vllm engine
--vllm_engine.max_model_len VLLM_ENGINE.MAX_MODEL_LEN
Model context length
--vllm_engine.download_dir VLLM_ENGINE.DOWNLOAD_DIR
Cache dir
--startup_timeout STARTUP_TIMEOUT
Model startup timeout in seconds
--engine ENGINE LLM engine
--dtype DTYPE Data type# This is what I am taking about
--trt_llm_engine.max_batch_size TRT_LLM_ENGINE.MAX_BATCH_SIZE
Max batch size
--trt_llm_engine.kv_cache_free_gpu_memory_fraction TRT_LLM_ENGINE.KV_CACHE_FREE_GPU_MEMORY_FRACTION
KV Cache free gpu memory fraction
🐛 Describe the bug
tl; dr: Quickstart Examples not working as expected
Our team is currently evaluating
torchserve
to serve various LLM models, once of which ismeta-llama/Llama-3.1-8B-Instruct
.The GPU that we are relying on is
We have only begun exploring
torchserve
, and I'd like to report issues here with the vLLM examples specified in the Quickstart sections:While trying out the quickstart examples to deploy the specified Llama-3.1 model, we encountered various issues:
1.
ts.llm_launche
crashing withValueError
ts.llm_launcher
was started like so:this resulted in the server raising an exception like so:
2. On setting
--dtype
tohalf
,ts.llm_launcher
isn't honoring that flag.This time,
ts.llm_launcher
was invoked like so:ts.llm_launcher
most definitely has--dtype
specfied (but the list of possible values fordtype
aren't specified at all):3. The Quickstart LLM with Docker example points to a model that doesn't exist
The
docker....
snippet provided looks like so:When executing this, this (abridged) stack trace is produced
On searching for this model on HuggingFace, I see this, indicating that his model probably doesn't exist.
Error logs
When running with
--dtype
and otherwise:When starting the docker container
Installation instructions
Install torchserve from Source: No
Clone this repo as-is: Yes
Docker image built like so (as specified in Quickstart)
Model Packaging
We do not know if calling
ts.llm_launcher
packages the model for us.config.properties
No response
Versions
This script is unable to locate
java
, hence:Repro instructions
master
Possible Solution
We don't know what the possible solution could be for these issues,
and as a result cannot propose a solution.The text was updated successfully, but these errors were encountered: