Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart vLLM examples do not work as expected #3365

Open
prashant-warrier-echelonvi opened this issue Nov 18, 2024 · 0 comments
Open

Quickstart vLLM examples do not work as expected #3365

prashant-warrier-echelonvi opened this issue Nov 18, 2024 · 0 comments

Comments

@prashant-warrier-echelonvi
Copy link

prashant-warrier-echelonvi commented Nov 18, 2024

🐛 Describe the bug

tl; dr: Quickstart Examples not working as expected

Our team is currently evaluating torchserve to serve various LLM models, once of which is meta-llama/Llama-3.1-8B-Instruct.

The GPU that we are relying on is

22:16 $ nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: <UUID-1>)
GPU 1: Tesla V100-PCIE-32GB (UUID: <UUID-2>)

We have only begun exploring torchserve, and I'd like to report issues here with the vLLM examples specified in the Quickstart sections:

  1. Quickstart with TorchServe
  2. Quickstart LLM Deployment
  3. Quickstart LLM Deployment with Docker

While trying out the quickstart examples to deploy the specified Llama-3.1 model, we encountered various issues:

1. ts.llm_launche crashing with ValueError

ts.llm_launcher was started like so:

python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth

this resulted in the server raising an exception like so:


2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 301, in <module>
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     worker.run_server()
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 266, in run_server
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.handle_connection_async(cl_socket)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 220, in handle_connection_async
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 133, in load_model
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/torch_handler/vllm_handler.py", line 47, in initialize
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     engine = cls(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.engine = self._engine_class(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 257, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     super().__init__(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 317, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.model_executor = executor_class(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 222, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self._init_executor()
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _init_executor
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self._run_workers("init_device")
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     driver_worker_output = driver_worker_method(*args, **kwargs)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 168, in init_device
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     _check_if_gpu_supports_dtype(self.model_config.dtype)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 461, in _check_if_gpu_supports_dtype
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     raise ValueError(
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

2. On setting --dtype to half, ts.llm_launcher isn't honoring that flag.

This time, ts.llm_launcher was invoked like so:

python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth --dtype=half

ts.llm_launcher most definitely has --dtype specfied (but the list of possible values for dtype aren't specified at all):

22:05 $ python -m ts.llm_launcher --help
usage: llm_launcher.py [-h] [--model_name MODEL_NAME] [--model_store MODEL_STORE] [--model_id MODEL_ID] [--disable_token_auth] [--vllm_engine.max_num_seqs VLLM_ENGINE.MAX_NUM_SEQS]
                       [--vllm_engine.max_model_len VLLM_ENGINE.MAX_MODEL_LEN] [--vllm_engine.download_dir VLLM_ENGINE.DOWNLOAD_DIR] [--startup_timeout STARTUP_TIMEOUT] [--engine ENGINE]
                       [--dtype DTYPE] [--trt_llm_engine.max_batch_size TRT_LLM_ENGINE.MAX_BATCH_SIZE]
                       [--trt_llm_engine.kv_cache_free_gpu_memory_fraction TRT_LLM_ENGINE.KV_CACHE_FREE_GPU_MEMORY_FRACTION]

options:
  -h, --help            show this help message and exit
  --model_name MODEL_NAME
                        Model name
  --model_store MODEL_STORE
                        Model store
  --model_id MODEL_ID   Model id
  --disable_token_auth  Disable token authentication
  --vllm_engine.max_num_seqs VLLM_ENGINE.MAX_NUM_SEQS
                        Max sequences in vllm engine
  --vllm_engine.max_model_len VLLM_ENGINE.MAX_MODEL_LEN
                        Model context length
  --vllm_engine.download_dir VLLM_ENGINE.DOWNLOAD_DIR
                        Cache dir
  --startup_timeout STARTUP_TIMEOUT
                        Model startup timeout in seconds
  --engine ENGINE       LLM engine
  --dtype DTYPE         Data type # This is what I am taking about
  --trt_llm_engine.max_batch_size TRT_LLM_ENGINE.MAX_BATCH_SIZE
                        Max batch size
  --trt_llm_engine.kv_cache_free_gpu_memory_fraction TRT_LLM_ENGINE.KV_CACHE_FREE_GPU_MEMORY_FRACTION
                        KV Cache free gpu memory fraction

3. The Quickstart LLM with Docker example points to a model that doesn't exist

The docker.... snippet provided looks like so:


docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth

When executing this, this (abridged) stack trace is produced


2024-11-18T16:43:45,496 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     get_hf_file_metadata(url, token=token)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     return fn(*args, **kwargs)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     r = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     response = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     hf_raise_for_status(response)
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     raise _format(GatedRepoError, message, response) from e
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - huggingface_hub.errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-673b6ec1-716b9ee4567a8483100930bb;77b39a73-4aed-4ccd-9b6b-3fdf1c8cd81c)
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
2024-11-18T16:43:45,501 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct to ask for access.

On searching for this model on HuggingFace, I see this, indicating that his model probably doesn't exist.

Screenshot from 2024-11-18 22-14-41

Error logs

When running with --dtype and otherwise:


2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 301, in <module>
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     worker.run_server()
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 266, in run_server
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.handle_connection_async(cl_socket)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 220, in handle_connection_async
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 133, in load_model
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/torch_handler/vllm_handler.py", line 47, in initialize
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     engine = cls(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.engine = self._engine_class(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 257, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     super().__init__(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 317, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self.model_executor = executor_class(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 222, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self._init_executor()
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _init_executor
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     self._run_workers("init_device")
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     driver_worker_output = driver_worker_method(*args, **kwargs)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 168, in init_device
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     _check_if_gpu_supports_dtype(self.model_config.dtype)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 461, in _check_if_gpu_supports_dtype
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     raise ValueError(
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

When starting the docker container


2024-11-18T16:43:45,496 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     get_hf_file_metadata(url, token=token)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     return fn(*args, **kwargs)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     r = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     response = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     hf_raise_for_status(response)
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -     raise _format(GatedRepoError, message, response) from e
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - huggingface_hub.errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-673b6ec1-716b9ee4567a8483100930bb;77b39a73-4aed-4ccd-9b6b-3fdf1c8cd81c)
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
2024-11-18T16:43:45,501 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct to ask for access.

Installation instructions

Install torchserve from Source: No

Clone this repo as-is: Yes

Docker image built like so (as specified in Quickstart)

docker build --pull . -f docker/Dockerfile.vllm -t ts/vllm

Model Packaging

We do not know if calling ts.llm_launcher packages the model for us.

config.properties

No response

Versions


------------------------------------------------------------------------------------------                                                                                                    
Environment headers                                                                                                                                                                           
------------------------------------------------------------------------------------------                                                                                                    
Torchserve branch:                                                                                                                                                                            
                                                                                                                                                                                              
torchserve==0.12.0                                                                                                                                                                            
torch-model-archiver==0.12.0                                                                                                                                                                  
                                                                                                                                                                                              
Python version: 3.10 (64-bit runtime)                                                                                                                                                         
Python executable: /home/<USER>/torchenv/bin/python                                                                                                                                    
                                                                                                                                                                                              
Versions of relevant python libraries:                                                                                                                                                        
captum==0.6.0                                                                                                                                                                                 
numpy==1.24.3                                                                                                                                                                                 
nvgpu==0.10.0                                                                                                                                                                                 
pillow==10.3.0                                                                                                                                                                                
psutil==5.9.8                                                                                                                                                                                 
requests==2.32.0                                                                                                                                                                              
sentencepiece==0.2.0
torch==2.4.0+cu121
torch-model-archiver==0.12.0
torch-workflow-archiver==0.2.15
torchaudio==2.4.0+cu121
torchserve==0.12.0
torchvision==0.19.0+cu121
transformers==4.46.2
wheel==0.42.0
torch==2.4.0+cu121
**Warning: torchtext not present ..
torchvision==0.19.0+cu121
torchaudio==2.4.0+cu121

Java Version:


OS: Ubuntu 22.04.5 LTS
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
CMake version: N/A

Is CUDA available: Yes
CUDA runtime version: N/A
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.5


Environment:
library_path (LD_/DYLD_): 

This script is unable to locate java, hence:


22:19 $ command -v java
/usr/bin/java
(torchenv) ✔ ~/workspace/github/serve [master|…1] 
22:21 $ 

Repro instructions

  1. Clone this repo
  2. Checkout master
  3. Execute:
python ./ts_scripts/install_dependencies.py --cuda=cu121
  1. For Quickstart LLM Deployment:
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth
  1. For quickstart LLM Deployment with Docker:
docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth

Possible Solution

We don't know what the possible solution could be for these issues, and as a result cannot propose a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant