README update, remove model_source

defenseunicorns · Sep 16, 2024 · a7e3e88 · a7e3e88
1 parent 962dd4d
commit a7e3e88
Show file tree

Hide file tree

Showing 2 changed files with 6 additions and 20 deletions.
diff --git a/packages/vllm/README.md b/packages/vllm/README.md
@@ -14,24 +14,15 @@ See the LeapfrogAI documentation website for [system requirements](https://docs.
 
 ### Model Selection
 
-The default model that comes with this backend in this repository's officially released images is a [4-bit quantization of the Hermes-2-Pro-Mistral-7B model](https://huggingface.co/defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g).
+The default model that comes with this backend in this repository's officially released images is a [4-bit quantization of the Synthia-7b model](https://huggingface.co/TheBloke/SynthIA-7B-v2.0-GPTQ).
 
-You can optionally specify different models or quantization types using the following Docker build arguments:
+You can, optionally, specify different models during Zarf create:
 
-- `--build-arg MAX_CONTEXT_LENGTH="32768"`: Max context length, cannot exceed model's max length - the greater length the greater the vRAM requirements
-- `--build-arg TENSOR_PARALLEL_SIZE="1"`: The number of gpus to spread the tensor processing across
-- `--build-arg TRUST_REMOTE_CODE="True"`: Whether to trust inferencing code downloaded as part of the model download
-- `--build-arg ENGINE_USE_RAY="False"`: Distributed, multi-node inferencing mode for the engine
-- `--build-arg WORKER_USE_RAY="False"`: Distributed, multi-node inferencing mode for the worker(s)
-- `--build-arg GPU_MEMORY_UTILIZATION="0.90"`: Max memory utilization (fraction, out of 1.0) for the vLLM process
-- `--build-arg ENFORCE_EAGER="False"`: Disable CUDA graphs for faster time-to-first-token inferencing speed at the cost of more GPU memory (set to False for production)
-- `--build-arg QUANTIZATION="None"`: None is recommended, as vLLM auto-detect model configuration and optimizes from there. For example, GPTQ can be converted to GPTQ Marlin, in certain cases, increasing time-to-first-token and tokens/second performance.
-
-## Prompt Formats
-
-The pre-packaged model, defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g, contains special prompt templates for activating the function calling and JSON response modes. The default prompt template is the ChatML format.
+```bash
+uds zarf package create --confirm --set MODEL_REPO_ID=defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g --set MODEL_REVISION=main
+```
 
-These are a result of its training data and process. Please refer to [this section of the Hugging Face model card](https://huggingface.co/defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g#prompt-format-for-function-calling) for more details.
+If your model changes there will likely be a need to change generation and engine runtime configurations, please see the [Zarf package definition](./zarf.yaml) and [values override file](./values/upstream-values.yaml) for details on what runtime parameters can be modified. These parameters are model-specific, and can be found in the HuggingFace model cards and/or configuration files (e.g., prompt templates).
 
 ### Deployment
 

diff --git a/packages/vllm/src/config.py b/packages/vllm/src/config.py
@@ -5,11 +5,6 @@
 
 
 class ConfigOptions(BaseConfig):
-    model_source: str = Field(
-        title="Model Files Location",
-        description="Location of the model files to be loaded into the vLLM engine.",
-        examples=["/data/.model"],
-    )
     tensor_parallel_size: int = Field(
         title="GPU Utilization Count",
         description="The number of gpus to spread the tensor processing across."