Skip to content

Commit

Permalink
README update, remove model_source
Browse files Browse the repository at this point in the history
  • Loading branch information
justinthelaw committed Sep 16, 2024
1 parent 962dd4d commit a7e3e88
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 20 deletions.
21 changes: 6 additions & 15 deletions packages/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,15 @@ See the LeapfrogAI documentation website for [system requirements](https://docs.

### Model Selection

The default model that comes with this backend in this repository's officially released images is a [4-bit quantization of the Hermes-2-Pro-Mistral-7B model](https://huggingface.co/defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g).
The default model that comes with this backend in this repository's officially released images is a [4-bit quantization of the Synthia-7b model](https://huggingface.co/TheBloke/SynthIA-7B-v2.0-GPTQ).

You can optionally specify different models or quantization types using the following Docker build arguments:
You can, optionally, specify different models during Zarf create:

- `--build-arg MAX_CONTEXT_LENGTH="32768"`: Max context length, cannot exceed model's max length - the greater length the greater the vRAM requirements
- `--build-arg TENSOR_PARALLEL_SIZE="1"`: The number of gpus to spread the tensor processing across
- `--build-arg TRUST_REMOTE_CODE="True"`: Whether to trust inferencing code downloaded as part of the model download
- `--build-arg ENGINE_USE_RAY="False"`: Distributed, multi-node inferencing mode for the engine
- `--build-arg WORKER_USE_RAY="False"`: Distributed, multi-node inferencing mode for the worker(s)
- `--build-arg GPU_MEMORY_UTILIZATION="0.90"`: Max memory utilization (fraction, out of 1.0) for the vLLM process
- `--build-arg ENFORCE_EAGER="False"`: Disable CUDA graphs for faster time-to-first-token inferencing speed at the cost of more GPU memory (set to False for production)
- `--build-arg QUANTIZATION="None"`: None is recommended, as vLLM auto-detect model configuration and optimizes from there. For example, GPTQ can be converted to GPTQ Marlin, in certain cases, increasing time-to-first-token and tokens/second performance.

## Prompt Formats

The pre-packaged model, defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g, contains special prompt templates for activating the function calling and JSON response modes. The default prompt template is the ChatML format.
```bash
uds zarf package create --confirm --set MODEL_REPO_ID=defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g --set MODEL_REVISION=main
```

These are a result of its training data and process. Please refer to [this section of the Hugging Face model card](https://huggingface.co/defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g#prompt-format-for-function-calling) for more details.
If your model changes there will likely be a need to change generation and engine runtime configurations, please see the [Zarf package definition](./zarf.yaml) and [values override file](./values/upstream-values.yaml) for details on what runtime parameters can be modified. These parameters are model-specific, and can be found in the HuggingFace model cards and/or configuration files (e.g., prompt templates).

### Deployment

Expand Down
5 changes: 0 additions & 5 deletions packages/vllm/src/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,6 @@


class ConfigOptions(BaseConfig):
model_source: str = Field(
title="Model Files Location",
description="Location of the model files to be loaded into the vLLM engine.",
examples=["/data/.model"],
)
tensor_parallel_size: int = Field(
title="GPU Utilization Count",
description="The number of gpus to spread the tensor processing across."
Expand Down

0 comments on commit a7e3e88

Please sign in to comment.