Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Memory - GRPCs do not get reused or alternatively removed #1729

Open
35develr opened this issue Feb 20, 2024 · 4 comments
Open

CUDA Memory - GRPCs do not get reused or alternatively removed #1729

35develr opened this issue Feb 20, 2024 · 4 comments
Labels
bug Something isn't working unconfirmed

Comments

@35develr
Copy link

LocalAI version:
master-cublas-cuda12-ffmpeg (from 19.02.2024)

Environment, CPU architecture, OS, and Version:
Ubuntu 22.04.3 LTS, LocalAI running in Docker, CUDA12

Config:
docker-compose.yaml:

services:

  api:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

gpt-4.yaml:

context_size: 4096
f16: false
#gpu_layers: 10 # GPU Layers (only used when built with cublas)
mmap: true
name: gpt-4
parameters:
  model: mistral-7b-instruct-v0.2.Q5_K_M.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
stopwords:
- <|im_end|>
template:
  chat: chatml-block
  chat_message: chatml
  completion: completion
threads: 6

Describe the bug
For each chat request a new GRPC Process is started, but not unloaded when completed request.
So my Cuda Memory gets full and fuller until CUDA out of memory

nvitop:

╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.3.2      Driver Version: 525.147.05      CUDA Driver Version: 12.0 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│   0  GeForce GTX 1060    On   │ 00000000:01:00.0 Off │                  N/A │
│ N/A   43C    P8      6W / 88W │    6126MiB / 6144MiB │      0%      Default │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╛
[ CPU: ██████████▏ 35.1%                 ]  ( Load Average:  0.77  1.02  1.14 )
[ MEM: ███████████████████▍ 66.6%        ]  [ SWP: ▍ 1.6%                     ]

╒═════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                   max@max-PA70Hx │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM     TIME  COMMAND            │
╞═════════════════════════════════════════════════════════════════════════════╡
│   0    3113 G    root  4.45MiB   0   0.0   0.6  3:57:11  /usr/lib/xorg/Xo.. │
│   0  112775 C    root  1588MiB   0   0.0  11.6  1:48:22  python /build/ba.. │
│   0  165903 C    root  1588MiB   0   0.0  11.6    37:17  python /build/ba.. │
│   0  170069 C    root  1588MiB   0   0.0  11.6    33:23  python /build/ba.. │
│   0  175275 C    root  1282MiB   0   0.0  10.6    29:09  python /build/ba.. │
╘═════════════════════════════════════════════════════════════════════════════╛

To Reproduce

Expected behavior
Stopping GPRC process after complete and free CUDA Memory or reuse existing GPRC Process.

Logs
10:52AM DBG Stopping all backends except 'paraphrase-multilingual-mpnet-base-v2'

10:52AM DBG Parameter Config: &{PredictionOptions:{Model:paraphrase-multilingual-mpnet-base-v2 Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:paraphrase-multilingual-mpnet-base-v2 F16:false Threads:4 Debug:true Roles:map[] Embeddings:true Backend:sentencetransformers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[us der Versicherungsbranche.] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}

10:52AM DBG [single-backend] Stopping mistral-7b-instruct-v0.2.Q5_K_M.gguf

10:52AM INF Loading model 'paraphrase-multilingual-mpnet-base-v2' with backend sentencetransformers

10:52AM DBG Loading model in memory from file: /models/paraphrase-multilingual-mpnet-base-v2

10:52AM DBG Loading Model paraphrase-multilingual-mpnet-base-v2 with gRPC (file: /models/paraphrase-multilingual-mpnet-base-v2) (backend: sentencetransformers): {backendString:sentencetransformers model:paraphrase-multilingual-mpnet-base-v2 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000640000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}

10:52AM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh

10:52AM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh

10:52AM DBG GRPC Service for paraphrase-multilingual-mpnet-base-v2 will be running at: '127.0.0.1:38613'

10:52AM DBG GRPC Service state dir: /tmp/go-processmanager1753323238

10:52AM DBG GRPC Service Started

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Server started. Listening on: 127.0.0.1:38613

10:52AM DBG GRPC Service Ready

10:52AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:paraphrase-multilingual-mpnet-base-v2 ContextSize:0 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:true NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/paraphrase-multilingual-mpnet-base-v2 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self.fget.get(instance, owner)()

10:52AM DBG Stopping all backends except 'paraphrase-multilingual-mpnet-base-v2'

10:52AM DBG Model already loaded in memory: paraphrase-multilingual-mpnet-base-v2

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Calculated embeddings for: Gebe 20 Beispiele für Arbeitgeberpositionierung a

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Traceback (most recent call last):

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr response_or_iterator = behavior(argument, context)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/build/backend/python/sentencetransformers/sentencetransformers.py", line 80, in Embedding

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr sentence_embeddings = self.model.encode(request.Embeddings)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr self.to(device)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self._apply(convert)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr [Previous line repeated 4 more times]

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr param_applied = fn(param)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.93 GiB of which 17.62 MiB is free. Process 112775 has 1.55 GiB memory in use. Process 165903 has 1.55 GiB memory in use. Process 170069 has 1.55 GiB memory in use. Process 175275 has 1.25 GiB memory in use. Of the allocated memory 878.23 MiB is allocated by PyTorch, and 17.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[172.18.0.1]:57394 500 - POST /v1/embeddings

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Calculated embeddings for: us der Versicherungsbranche.

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Traceback (most recent call last):

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr response_or_iterator = behavior(argument, context)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/build/backend/python/sentencetransformers/sentencetransformers.py", line 80, in Embedding

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr sentence_embeddings = self.model.encode(request.Embeddings)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr self.to(device)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self._apply(convert)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr [Previous line repeated 4 more times]

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr param_applied = fn(param)

[172.18.0.1]:57408 500 - POST /v1/embeddings

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.93 GiB of which 17.62 MiB is free. Process 112775 has 1.55 GiB memory in use. Process 165903 has 1.55 GiB memory in use. Process 170069 has 1.55 GiB memory in use. Process 175275 has 1.25 GiB memory in use. Of the allocated memory 878.23 MiB is allocated by PyTorch, and 17.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@35develr 35develr added bug Something isn't working unconfirmed labels Feb 20, 2024
@35develr
Copy link
Author

only the GRPC Process of the embedding is affected, when i run the embedding on another machine without CUDA and
the main model on the local machine with CUDA, Memory Management is fine.

Embedding config:

name: paraphrase-multilingual-mpnet-base-v2
backend: sentencetransformers
embeddings: true
parameters:
model: paraphrase-multilingual-mpnet-base-v2

@thfrei
Copy link

thfrei commented Apr 14, 2024

Same issue here. I'm running localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11

I first try

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
   "model": "gpt-4-vision-preview",
   "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Which works, and then another model

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "prompt": "Why is the sky blue? Short and concise answer",
  "temperature": 0.1, "top_p": 0.1
}'

Which fails.

Doign them separate after docker restart localai always work, just not one after another. So I run out of memory always, when using different models.

Relevant section of docker log

3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5563.66 MiB on device 0: cudaMalloc failed: out of memory

docker logs localai Log
===> LocalAI All-in-One (AIO) container starting...
NVIDIA GPU detected
Sun Apr 14 15:42:17 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:09:00.0  On |                  N/A |
|  0%   48C    P0              43W / 170W |   1893MiB / 12288MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                        
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
NVIDIA GPU detected. Attempting to find memory size...
Total GPU Memory: 12288 MiB
===> Starting LocalAI[gpu-8g] with the following models: /aio/gpu-8g/embeddings.yaml,/aio/gpu-8g/text-to-speech.yaml,/aio/gpu-8g/image-gen.yaml,/aio/gpu-8g/text-to-text.yaml,/aio/gpu-8g/speech-to-text.yaml,/aio/gpu-8g/vision.yaml
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name      : AMD Ryzen 7 5700X 8-Core Processor
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
3:42PM INF Starting LocalAI using 1 threads, with models path: /build/models
3:42PM INF LocalAI version: v2.12.4 (0004ec8be3ca150ce6d8b79f2991bfe3a9dc65ad)
3:42PM DBG [startup] resolved local model: /aio/gpu-8g/embeddings.yaml
3:42PM DBG [startup] resolved local model: /aio/gpu-8g/text-to-speech.yaml
3:42PM DBG [startup] resolved local model: /aio/gpu-8g/image-gen.yaml
3:42PM DBG [startup] resolved local model: /aio/gpu-8g/text-to-text.yaml
3:42PM DBG [startup] resolved local model: /aio/gpu-8g/speech-to-text.yaml
3:42PM DBG [startup] resolved local model: /aio/gpu-8g/vision.yaml
3:42PM INF Preloading models from /build/models

Model name: gpt-4                                                           



curl http://localhost:8080/v1/chat/completions -H "Content-Type:            
application/json" -d '{ "model": "gpt-4", "messages": [{"role": "user",     
"content": "How are you doing?", "temperature": 0.1}] }'                    


3:42PM DBG Checking "ggml-whisper-base.bin" exists and matches SHA
3:42PM DBG File "/build/models/ggml-whisper-base.bin" already exists and matches the SHA. Skipping download

Model name: whisper-1                                                       



## example audio file                                                       
                                                                            
wget --quiet --show-progress -O gb1.ogg                                     
https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
                                                                            
## Send the example audio file to the transcriptions endpoint               
                                                                            
curl http://localhost:8080/v1/audio/transcriptions  -H "Content-Type:       
multipart/form-data"  -F file="@$PWD/gb1.ogg" -F model="whisper-1"          


3:42PM DBG Checking "DreamShaper_8_pruned.safetensors" exists and matches SHA
3:42PM DBG File "/build/models/DreamShaper_8_pruned.safetensors" already exists. Skipping download

Model name: stablediffusion                                                 



3:42PM DBG Checking "llava-v1.6-mistral-7b.Q5_K_M.gguf" exists and matches SHA
curl http://localhost:8080/v1/images/generations  -H "Content-Type:         
application/json"  -d '{ "prompt": "|", "step": 25, "size": "512x512" }'    


3:42PM DBG File "/build/models/llava-v1.6-mistral-7b.Q5_K_M.gguf" already exists. Skipping download
3:42PM DBG Checking "llava-v1.6-7b-mmproj-f16.gguf" exists and matches SHA
3:42PM DBG File "/build/models/llava-v1.6-7b-mmproj-f16.gguf" already exists. Skipping download

Model name: gpt-4-vision-preview                                            



curl http://localhost:8080/v1/chat/completions -H "Content-Type:            
3:42PM DBG Checking "voice-en-us-amy-low.tar.gz" exists and matches SHA
3:42PM DBG File "/build/models/voice-en-us-amy-low.tar.gz" already exists. Skipping download
application/json" -d '{ "model": "gpt-4-vision-preview", "messages": [{"role":
"user", "content": [{"type":"text", "text": "What is in the image?"},       
{"type": "image_url", "image_url": {"url":                                  
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-   
madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-   
boardwalk.jpg" }}], "temperature": 0.9}]}'                                  



Model name: tts-1                                                           



To test if this model works as expected, you can use the following curl     
command:                                                                    
                                                                            
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{    
"model":"tts-1", "input": "Hi, this is a test." }'                          



Model name: text-embedding-ada-002                                          



You can test this model with curl like this:                                
                                                                            
curl http://localhost:8080/embeddings -X POST -H "Content-Type:             
application/json" -d '{ "input": "Your text string goes here", "model": "text-
embedding-ada-002" }'                                                       


3:42PM DBG Model: text-embedding-ada-002 (config: {PredictionOptions:{Model:all-MiniLM-L6-v2 Language: N:0 TopP:0xc0004a7f20 TopK:0xc0004a7f28 Temperature:0xc0004a7f30 Maxtokens:0xc0004a7f38 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0004a7f60 TypicalP:0xc0004a7f58 Seed:0xc0004a7f78 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text-embedding-ada-002 F16:0xc0004a7f18 Threads:0xc0004a7f10 Debug:0xc0004a7f70 Roles:map[] Embeddings:false Backend:sentencetransformers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0004a7f50 MirostatTAU:0xc0004a7f48 Mirostat:0xc0004a7f40 NGPULayers:0xc0004a7f68 MMap:0xc0004a7f70 MMlock:0xc0004a7f71 LowVRAM:0xc0004a7f71 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0004a7f08 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:You can test this model with curl like this:

curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
"input": "Your text string goes here",
"model": "text-embedding-ada-002"
}'})
3:42PM DBG Model: gpt-4 (config: {PredictionOptions:{Model:5c7cd056ecf9a4bb5b527410b97f48cb Language: N:0 TopP:0xc000014830 TopK:0xc000014838 Temperature:0xc000014840 Maxtokens:0xc000014848 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000014890 TypicalP:0xc000014868 Seed:0xc0000148c8 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4 F16:0xc000014600 Threads:0xc000014820 Debug:0xc0000148c0 Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
ChatMessage:<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}
{{- if .FunctionCall }}<tool_call>{{end}}
{{- if eq .RoleName "tool" }}<tool_result>{{end }}
{{- if .Content}}
{{.Content}}
{{- end }}
{{- if .FunctionCall}}{{toJson .FunctionCall}}{{end }}
{{- if .FunctionCall }}</tool_call>{{end }}
{{- if eq .RoleName "tool" }}</tool_result>{{end }}
<|im_end|>
Completion:{{.Input}}
Edit: Functions:<|im_start|>system
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
<tools>
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
</tools>
Use the following pydantic model json schema for each tool call you will make:
{'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}
For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call>
<|im_end|>
{{.Input -}}
<|im_start|>assistant
<tool_call>
} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000014860 MirostatTAU:0xc000014858 Mirostat:0xc000014850 NGPULayers:0xc000014898 MMap:0xc00001456d MMlock:0xc0000148c1 LowVRAM:0xc0000148c1 Grammar: StopWords:[<|im_end|> <dummy32000> 
</tool_call> 


] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000145f0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
}'
})
3:42PM DBG Model: whisper-1 (config: {PredictionOptions:{Model:ggml-whisper-base.bin Language: N:0 TopP:0xc000014a38 TopK:0xc000014a50 Temperature:0xc000014a58 Maxtokens:0xc000014a60 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000014a98 TypicalP:0xc000014a90 Seed:0xc000014ad0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:whisper-1 F16:0xc000014a30 Threads:0xc000014a28 Debug:0xc000014aa8 Roles:map[] Embeddings:false Backend:whisper TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000014a78 MirostatTAU:0xc000014a70 Mirostat:0xc000014a68 NGPULayers:0xc000014aa0 MMap:0xc000014aa8 MMlock:0xc000014aa9 LowVRAM:0xc000014aa9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000014a20 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:ggml-whisper-base.bin SHA256:60ed5bc3dd14eea856493d334349b405782ddcaf0028d4b5df4088345fba2efe URI:https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin}] Description: Usage:## example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions \
    -H "Content-Type: multipart/form-data" \
    -F file="@$PWD/gb1.ogg" -F model="whisper-1"
})
3:42PM DBG Model: stablediffusion (config: {PredictionOptions:{Model:DreamShaper_8_pruned.safetensors Language: N:0 TopP:0xc000014da8 TopK:0xc000014db0 Temperature:0xc000014db8 Maxtokens:0xc000014dc0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000014de8 TypicalP:0xc000014de0 Seed:0xc000014e10 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:stablediffusion F16:0xc000014cf5 Threads:0xc000014d78 Debug:0xc000014df8 Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000014dd8 MirostatTAU:0xc000014dd0 Mirostat:0xc000014dc8 NGPULayers:0xc000014df0 MMap:0xc000014df8 MMlock:0xc000014df9 LowVRAM:0xc000014df9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000014d70 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType:StableDiffusionPipeline SchedulerType:k_dpmpp_2m EnableParameters:negative_prompt,num_inference_steps CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:25 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:DreamShaper_8_pruned.safetensors SHA256: URI:huggingface://Lykon/DreamShaper/DreamShaper_8_pruned.safetensors}] Description: Usage:curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
    "prompt": "<positive prompt>|<negative prompt>",
    "step": 25,
    "size": "512x512"
}'})
3:42PM DBG Model: gpt-4-vision-preview (config: {PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc0000159c8 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
})
3:42PM DBG Model: tts-1 (config: {PredictionOptions:{Model:en-us-amy-low.onnx Language: N:0 TopP:0xc000015b38 TopK:0xc000015b40 Temperature:0xc000015b48 Maxtokens:0xc000015b50 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000015b88 TypicalP:0xc000015b80 Seed:0xc000015ba0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:tts-1 F16:0xc000015b30 Threads:0xc000015b28 Debug:0xc000015b98 Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015b78 MirostatTAU:0xc000015b70 Mirostat:0xc000015b58 NGPULayers:0xc000015b90 MMap:0xc000015b98 MMlock:0xc000015b99 LowVRAM:0xc000015b99 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000015b20 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:voice-en-us-amy-low.tar.gz SHA256: URI:https://github.com/rhasspy/piper/releases/download/v0.0.2/voice-en-us-amy-low.tar.gz}] Description: Usage:To test if this model works as expected, you can use the following curl command:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model":"tts-1",
"input": "Hi, this is a test."
}'})
3:42PM DBG Extracting backend assets files to /tmp/localai/backend_data
3:42PM INF core/startup process completed!
3:42PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
3:42PM DBG No configuration file found at /tmp/localai/config/assistants.json
3:42PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json

┌───────────────────────────────────────────────────┐ 
│                   Fiber v2.52.0                   │ 
│               http://127.0.0.1:8080               │ 
│       (bound on host 0.0.0.0 and port 8080)       │ 
│                                                   │ 
│ Handlers ........... 181  Processes ........... 1 │ 
│ Prefork ....... Disabled  PID ................. 1 │ 
└───────────────────────────────────────────────────┘ 

[127.0.0.1]:39546 200 - GET /readyz
3:43PM DBG Request received: {"model":"gpt-4-vision-preview","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":[{"text":"What is in the image?","type":"text"},{"image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},"type":"image_url"}]}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
3:43PM DBG Configuration read: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc00034e0e8 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
}
3:43PM DBG Parameters: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc00034e0e8 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
}
3:43PM DBG Prompt (before templating): USER:[img-0]What is in the image?
3:43PM DBG Template found, input modified to: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
USER:[img-0]What is in the image?
ASSISTANT:

3:43PM DBG Prompt (after templating): A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
USER:[img-0]What is in the image?
ASSISTANT:

3:43PM INF Loading model 'llava-v1.6-mistral-7b.Q5_K_M.gguf' with backend llama-cpp
3:43PM DBG Loading model in memory from file: /build/models/llava-v1.6-mistral-7b.Q5_K_M.gguf
3:43PM DBG Loading Model llava-v1.6-mistral-7b.Q5_K_M.gguf with gRPC (file: /build/models/llava-v1.6-mistral-7b.Q5_K_M.gguf) (backend: llama-cpp): {backendString:llama-cpp model:llava-v1.6-mistral-7b.Q5_K_M.gguf threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9800 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:43PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
3:43PM DBG GRPC Service for llava-v1.6-mistral-7b.Q5_K_M.gguf will be running at: '127.0.0.1:36827'
3:43PM DBG GRPC Service state dir: /tmp/go-processmanager2197649184
3:43PM DBG GRPC Service Started
3:43PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout Server listening on 127.0.0.1:36827
3:43PM DBG GRPC Service Ready
3:43PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:llava-v1.6-mistral-7b.Q5_K_M.gguf ContextSize:4096 Seed:1939866591 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/llava-v1.6-mistral-7b.Q5_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:43PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109439,"level":"INFO","function":"load_model","line":449,"message":"Multi Modal Mode Enabled"}
3:43PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
3:43PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
3:43PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr ggml_cuda_init: found 1 CUDA devices:
3:43PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /build/models/llava-v1.6-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   0:                       general.architecture str              = llama
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   1:                               general.name str              = 1.6
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   4:                          llama.block_count u32              = 32
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  11:                          general.file_type u32              = 17
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - kv  23:               general.quantization_version u32              = 2
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - type  f32:   65 tensors
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - type q5_K:  193 tensors
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_model_loader: - type q6_K:   33 tensors
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_vocab: special tokens definition check successful ( 259/32000 ).
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: format           = GGUF V3 (latest)
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: arch             = llama
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: vocab type       = SPM
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_vocab          = 32000
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_merges         = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_ctx_train      = 32768
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_embd           = 4096
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_head           = 32
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_head_kv        = 8
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_layer          = 32
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_rot            = 128
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_embd_head_k    = 128
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_embd_head_v    = 128
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_gqa            = 4
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_embd_k_gqa     = 1024
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_embd_v_gqa     = 1024
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: f_norm_eps       = 0.0e+00
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: f_logit_scale    = 0.0e+00
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_ff             = 14336
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_expert         = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_expert_used    = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: causal attn      = 1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: pooling type     = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: rope type        = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: rope scaling     = linear
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: freq_base_train  = 1000000.0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: freq_scale_train = 1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: n_yarn_orig_ctx  = 32768
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: rope_finetuned   = unknown
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: ssm_d_conv       = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: ssm_d_inner      = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: ssm_d_state      = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: ssm_dt_rank      = 0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: model type       = 7B
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: model ftype      = Q5_K - Medium
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: model params     = 7.24 B
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: general.name     = 1.6
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: BOS token        = 1 '<s>'
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: EOS token        = 2 '</s>'
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: UNK token        = 0 '<unk>'
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: PAD token        = 0 '<unk>'
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_print_meta: LF token         = 13 '<0x0A>'
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_tensors: ggml ctx size =    0.22 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_tensors: offloading 32 repeating layers to GPU
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_tensors: offloading non-repeating layers to GPU
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_tensors: offloaded 33/33 layers to GPU
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_tensors:        CPU buffer size =    85.94 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llm_load_tensors:      CUDA0 buffer size =  4807.05 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr ...................................................................................................
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: n_ctx      = 4096
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: n_batch    = 512
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: n_ubatch   = 512
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: freq_base  = 1000000.0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: freq_scale = 1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model:      CUDA0 compute buffer size =   296.00 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: graph nodes  = 1030
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stderr llama_new_context_with_model: graph splits = 2
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: model name:   vit-large336-custom
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: description:  image encoder for LLaVA
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: GGUF version: 3
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: alignment:    32
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: n_tensors:    378
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: n_kv:         25
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: ftype:        f16
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout 
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: loaded meta data with 25 key-value pairs and 378 tensors from /build/models/llava-v1.6-7b-mmproj-f16.gguf
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   0:                       general.architecture str              = clip
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   4:                          general.file_type u32              = 1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   5:                               general.name str              = vit-large336-custom
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   7:                        clip.projector_type str              = mlp
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  16:           clip.vision.image_grid_pinpoints arr[i32,10]      = [336, 672, 672, 336, 672, 672, 1008, ...
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  17:          clip.vision.image_crop_resolution u32              = 224
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  18:             clip.vision.image_aspect_ratio str              = anyres
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  19:         clip.vision.image_split_resolution u32              = 224
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  20:            clip.vision.mm_patch_merge_type str              = spatial_unpad
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  21:              clip.vision.mm_projector_type str              = mlp2x_gelu
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - kv  24:                              clip.use_gelu bool             = false
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - type  f32:  236 tensors
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: - type  f16:  142 tensors
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: CLIP using CUDA backend
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: text_encoder:   0
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: vision_encoder: 1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: llava_projector:  1
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: model size:     595.50 MB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: metadata size:  0.14 MB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: params backend buffer size =  595.50 MB (378 tensors)
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout clip_model_load: compute allocated memory: 32.89 MB
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109441,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":1}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109441,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":4096}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109441,"level":"INFO","function":"launch_slot_with_data","line":884,"message":"slot is processing task","slot_id":0,"task_id":0}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109441,"level":"INFO","function":"update_slots","line":1783,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":0,"p0":0}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: 5 segments encoded in   297.51 ms
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: image embedding created: 2880 tokens
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout 
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: image encoded in   364.55 ms by CLIP (    0.13 ms per image patch)
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109447,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time     =    2977.07 ms /     1 tokens ( 2977.07 ms per token,     0.34 tokens per second)","slot_id":0,"task_id":0,"t_prompt_processing":2977.069,"num_prompt_tokens_processed":1,"t_token":2977.069,"n_tokens_second":0.33590084744424803}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109447,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time =    2828.64 ms /   117 runs   (   24.18 ms per token,    41.36 tokens per second)","slot_id":0,"task_id":0,"t_token_generation":2828.635,"n_decoded":117,"t_token":24.176367521367524,"n_tokens_second":41.362706747247344}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109447,"level":"INFO","function":"print_timings","line":351,"message":"          total time =    5805.70 ms","slot_id":0,"task_id":0,"t_prompt_processing":2977.069,"t_token_generation":2828.635,"t_total":5805.704}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109447,"level":"INFO","function":"update_slots","line":1594,"message":"slot released","slot_id":0,"task_id":0,"n_ctx":4096,"n_past":3041,"n_system_tokens":0,"n_cache_tokens":118,"truncated":false}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109447,"level":"INFO","function":"update_slots","line":1547,"message":"all slots are idle and system prompt is empty, clear the KV cache"}
3:44PM DBG Response: {"created":1713109338,"object":"chat.completion","id":"00b8b7a8-86f2-47cb-9435-5a7e7c748366","model":"gpt-4-vision-preview","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The image shows a wooden pathway leading through a field of tall grass. The pathway appears to be a boardwalk or a wooden walkway, and it is surrounded by a natural landscape. The sky is clear and blue, suggesting a sunny day. The grass is green and appears to be quite tall, indicating that it might be late spring or early summer. There are no visible buildings or other man-made structures in the immediate vicinity of the pathway. The overall scene is peaceful and serene, with a sense of tranquility and solitude. "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[172.18.0.1]:44274 200 - POST /v1/chat/completions
3:44PM DBG Request received: {"model":"gpt-4-vision-preview","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":[{"text":"What is in the image?","type":"text"},{"image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},"type":"image_url"}]}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
3:44PM DBG Configuration read: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc000128ad8 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
}
3:44PM DBG Parameters: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc000128ad8 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
}
3:44PM DBG Prompt (before templating): USER:[img-0]What is in the image?
3:44PM DBG Template found, input modified to: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
USER:[img-0]What is in the image?
ASSISTANT:

3:44PM DBG Prompt (after templating): A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
USER:[img-0]What is in the image?
ASSISTANT:

3:44PM INF Loading model 'llava-v1.6-mistral-7b.Q5_K_M.gguf' with backend llama-cpp
3:44PM DBG Model already loaded in memory: llava-v1.6-mistral-7b.Q5_K_M.gguf
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109450,"level":"INFO","function":"launch_slot_with_data","line":884,"message":"slot is processing task","slot_id":0,"task_id":119}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109450,"level":"INFO","function":"update_slots","line":1783,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":119,"p0":0}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: 5 segments encoded in   306.36 ms
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: image embedding created: 2880 tokens
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout 
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: image encoded in   372.08 ms by CLIP (    0.13 ms per image patch)
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109455,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time     =    2988.96 ms /     1 tokens ( 2988.96 ms per token,     0.33 tokens per second)","slot_id":0,"task_id":119,"t_prompt_processing":2988.962,"num_prompt_tokens_processed":1,"t_token":2988.962,"n_tokens_second":0.3345643069400012}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109455,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time =    2087.34 ms /    87 runs   (   23.99 ms per token,    41.68 tokens per second)","slot_id":0,"task_id":119,"t_token_generation":2087.337,"n_decoded":87,"t_token":23.992379310344827,"n_tokens_second":41.67990123300646}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109455,"level":"INFO","function":"print_timings","line":351,"message":"          total time =    5076.30 ms","slot_id":0,"task_id":119,"t_prompt_processing":2988.962,"t_token_generation":2087.337,"t_total":5076.299}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109455,"level":"INFO","function":"update_slots","line":1594,"message":"slot released","slot_id":0,"task_id":119,"n_ctx":4096,"n_past":3011,"n_system_tokens":0,"n_cache_tokens":88,"truncated":false}
3:44PM DBG Response: {"created":1713109338,"object":"chat.completion","id":"00b8b7a8-86f2-47cb-9435-5a7e7c748366","model":"gpt-4-vision-preview","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The image shows a wooden pathway leading through a field of tall grass. The pathway appears to be a boardwalk or a wooden walkway, and it is surrounded by lush green grass. The sky is clear and blue, indicating a sunny day. There are no visible people or animals in the image. The overall scene suggests a peaceful, natural setting, possibly in a rural or semi-rural area. "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[172.18.0.1]:33348 200 - POST /v1/chat/completions
3:44PM DBG Request received: {"model":"gpt-4-vision-preview","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":[{"text":"What is in the image?","type":"text"},{"image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"},"type":"image_url"}]}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
3:44PM DBG Configuration read: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc0003da658 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
}
3:44PM DBG Parameters: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.Q5_K_M.gguf Language: N:0 TopP:0xc000015910 TopK:0xc0000158e8 Temperature:0xc0000158c8 Maxtokens:0xc000015980 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0000159a8 TypicalP:0xc0000159a0 Seed:0xc000015930 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4-vision-preview F16:0xc0000158c0 Threads:0xc000015958 Debug:0xc0003da658 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama-cpp TemplateConfig:{Chat:A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
{{.Input}}
ASSISTANT:
ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000015998 MirostatTAU:0xc000015990 Mirostat:0xc000015988 NGPULayers:0xc0000159c0 MMap:0xc0000158c1 MMlock:0xc0000159c9 LowVRAM:0xc0000159c9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000158b0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj:llava-v1.6-7b-mmproj-f16.gguf RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[{Filename:llava-v1.6-mistral-7b.Q5_K_M.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf} {Filename:llava-v1.6-7b-mmproj-f16.gguf SHA256: URI:huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf}] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4-vision-preview",
    "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
}
3:44PM DBG Prompt (before templating): USER:[img-0]What is in the image?
3:44PM DBG Template found, input modified to: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
USER:[img-0]What is in the image?
ASSISTANT:

3:44PM DBG Prompt (after templating): A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
USER:[img-0]What is in the image?
ASSISTANT:

3:44PM INF Loading model 'llava-v1.6-mistral-7b.Q5_K_M.gguf' with backend llama-cpp
3:44PM DBG Model already loaded in memory: llava-v1.6-mistral-7b.Q5_K_M.gguf
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109457,"level":"INFO","function":"launch_slot_with_data","line":884,"message":"slot is processing task","slot_id":0,"task_id":208}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109457,"level":"INFO","function":"update_slots","line":1783,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":208,"p0":0}
[127.0.0.1]:39302 200 - GET /readyz
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: 5 segments encoded in   304.81 ms
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: image embedding created: 2880 tokens
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout 
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout encode_image_with_clip: image encoded in   368.64 ms by CLIP (    0.13 ms per image patch)
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109462,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time     =    2986.61 ms /     1 tokens ( 2986.61 ms per token,     0.33 tokens per second)","slot_id":0,"task_id":208,"t_prompt_processing":2986.609,"num_prompt_tokens_processed":1,"t_token":2986.609,"n_tokens_second":0.33482789344035324}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109462,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time =    2532.34 ms /   105 runs   (   24.12 ms per token,    41.46 tokens per second)","slot_id":0,"task_id":208,"t_token_generation":2532.345,"n_decoded":105,"t_token":24.117571428571427,"n_tokens_second":41.463544659199286}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109462,"level":"INFO","function":"print_timings","line":351,"message":"          total time =    5518.95 ms","slot_id":0,"task_id":208,"t_prompt_processing":2986.609,"t_token_generation":2532.345,"t_total":5518.954}
3:44PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:36827): stdout {"timestamp":1713109462,"level":"INFO","function":"update_slots","line":1594,"message":"slot released","slot_id":0,"task_id":208,"n_ctx":4096,"n_past":3029,"n_system_tokens":0,"n_cache_tokens":106,"truncated":false}
3:44PM DBG Response: {"created":1713109338,"object":"chat.completion","id":"00b8b7a8-86f2-47cb-9435-5a7e7c748366","model":"gpt-4-vision-preview","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The image shows a wooden pathway leading through a field of tall grass. The pathway appears to be a simple, unpaved walkway, possibly in a rural or natural setting. The sky is clear and blue, suggesting a sunny day. The grass is green and appears to be quite tall, indicating that it might be late spring or early summer. There are no visible buildings or structures in the immediate vicinity of the pathway, which adds to the sense of tranquility and natural beauty. "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[172.18.0.1]:33350 200 - POST /v1/chat/completions
3:44PM DBG Request received: {"model":"gpt-4","language":"","n":0,"top_p":0.1,"top_k":null,"temperature":0.1,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":"Why is the sky blue? Short and concise answer","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
3:44PM DBG `input`: &{PredictionOptions:{Model:gpt-4 Language: N:0 TopP:0xc00039e500 TopK:<nil> Temperature:0xc00039e450 Maxtokens:<nil> Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:<nil> TypicalP:<nil> Seed:<nil> NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Context:context.Background.WithCancel Cancel:0x4ab9a0 File: ResponseFormat:{Type:} Size: Prompt:Why is the sky blue? Short and concise answer Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Tools:[] ToolsChoice:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil> Backend: ModelBaseName:}
3:44PM DBG Parameter Config: &{PredictionOptions:{Model:5c7cd056ecf9a4bb5b527410b97f48cb Language: N:0 TopP:0xc00039e500 TopK:0xc000014838 Temperature:0xc00039e450 Maxtokens:0xc000014848 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000014890 TypicalP:0xc000014868 Seed:0xc0000148c8 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4 F16:0xc000014600 Threads:0xc000014820 Debug:0xc00039fa58 Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
ChatMessage:<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}
{{- if .FunctionCall }}<tool_call>{{end}}
{{- if eq .RoleName "tool" }}<tool_result>{{end }}
{{- if .Content}}
{{.Content}}
{{- end }}
{{- if .FunctionCall}}{{toJson .FunctionCall}}{{end }}
{{- if .FunctionCall }}</tool_call>{{end }}
{{- if eq .RoleName "tool" }}</tool_result>{{end }}
<|im_end|>
Completion:{{.Input}}
Edit: Functions:<|im_start|>system
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
<tools>
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
</tools>
Use the following pydantic model json schema for each tool call you will make:
{'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}
For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call>
<|im_end|>
{{.Input -}}
<|im_start|>assistant
<tool_call>
} PromptStrings:[Why is the sky blue? Short and concise answer] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000014860 MirostatTAU:0xc000014858 Mirostat:0xc000014850 NGPULayers:0xc000014898 MMap:0xc00001456d MMlock:0xc0000148c1 LowVRAM:0xc0000148c1 Grammar: StopWords:[<|im_end|> <dummy32000> 
</tool_call> 


] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0000145f0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
}'
}
3:44PM DBG Template found, input modified to: Why is the sky blue? Short and concise answer

3:44PM INF Trying to load the model '5c7cd056ecf9a4bb5b527410b97f48cb' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/diffusers/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/petals/run.sh
3:44PM INF [llama-cpp] Attempting to load
3:44PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-cpp
3:44PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:44PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-cpp): {backendString:llama-cpp model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:44PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
3:44PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44793'
3:44PM DBG GRPC Service state dir: /tmp/go-processmanager1974255675
3:44PM DBG GRPC Service Started
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stdout Server listening on 127.0.0.1:44793
3:44PM DBG GRPC Service Ready
3:44PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb (version GGUF V3 (latest))
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   0:                       general.architecture str              = llama
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   1:                               general.name str              = jeffq
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   4:                          llama.block_count u32              = 32
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  11:                          general.file_type u32              = 18
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32032]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32032]   = [0.000000, 0.000000, 0.000000, 0.0000...
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32032]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - kv  21:               general.quantization_version u32              = 2
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - type  f32:   65 tensors
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_loader: - type q6_K:  226 tensors
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_vocab: special tokens definition check successful ( 291/32032 ).
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: format           = GGUF V3 (latest)
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: arch             = llama
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: vocab type       = SPM
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_vocab          = 32032
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_merges         = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_ctx_train      = 32768
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_embd           = 4096
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_head           = 32
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_head_kv        = 8
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_layer          = 32
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_rot            = 128
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_embd_head_k    = 128
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_embd_head_v    = 128
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_gqa            = 4
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_embd_k_gqa     = 1024
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_embd_v_gqa     = 1024
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: f_norm_eps       = 0.0e+00
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: f_logit_scale    = 0.0e+00
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_ff             = 14336
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_expert         = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_expert_used    = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: causal attn      = 1
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: pooling type     = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: rope type        = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: rope scaling     = linear
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: freq_base_train  = 10000.0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: freq_scale_train = 1
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: n_yarn_orig_ctx  = 32768
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: rope_finetuned   = unknown
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: ssm_d_conv       = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: ssm_d_inner      = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: ssm_d_state      = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: ssm_dt_rank      = 0
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: model type       = 7B
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: model ftype      = Q6_K
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: model params     = 7.24 B
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: model size       = 5.53 GiB (6.56 BPW) 
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: general.name     = jeffq
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: BOS token        = 1 '<s>'
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: UNK token        = 0 '<unk>'
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_print_meta: LF token         = 13 '<0x0A>'
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr ggml_cuda_init: found 1 CUDA devices:
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llm_load_tensors: ggml ctx size =    0.22 MiB
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5563.66 MiB on device 0: cudaMalloc failed: out of memory
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_model_load: error loading model: unable to allocate backend buffer
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_load_model_from_file: failed to load model
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stderr llama_init_from_gpt_params: error: failed to load model '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44793): stdout {"timestamp":1713109491,"level":"ERROR","function":"load_model","line":464,"message":"unable to load model","model":"/build/models/5c7cd056ecf9a4bb5b527410b97f48cb"}
3:44PM INF [llama-cpp] Fails: could not load model: rpc error: code = Canceled desc = 
3:44PM INF [llama-ggml] Attempting to load
3:44PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-ggml
3:44PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:44PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-ggml): {backendString:llama-ggml model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:44PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-ggml
3:44PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:41631'
3:44PM DBG GRPC Service state dir: /tmp/go-processmanager3608292457
3:44PM DBG GRPC Service Started
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr 2024/04/14 15:44:51 gRPC Server listening at 127.0.0.1:41631
3:44PM DBG GRPC Service Ready
3:44PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr create_gpt_params: loading model /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr ggml_init_cublas: found 1 CUDA devices:
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr llama.cpp: loading model from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr error loading model: unknown (magic, version) combination: 46554747, 00000003; is this really a GGML file?
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr llama_load_model_from_file: failed to load model
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr llama_init_from_gpt_params: error: failed to load model '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41631): stderr load_binding_model: error: unable to load model
3:44PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
3:44PM INF [gpt4all] Attempting to load
3:44PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend gpt4all
3:44PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:44PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: gpt4all): {backendString:gpt4all model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:44PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all
3:44PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:43031'
3:44PM DBG GRPC Service state dir: /tmp/go-processmanager2392465823
3:44PM DBG GRPC Service Started
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:43031): stderr 2024/04/14 15:44:57 gRPC Server listening at 127.0.0.1:43031
3:44PM DBG GRPC Service Ready
3:44PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:43031): stderr load_model: error 'Model format not supported (no matching implementation found)'
3:44PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
3:44PM INF [bert-embeddings] Attempting to load
3:44PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend bert-embeddings
3:44PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:44PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: bert-embeddings): {backendString:bert-embeddings model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:44PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/bert-embeddings
3:44PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:40077'
3:44PM DBG GRPC Service state dir: /tmp/go-processmanager3129684704
3:44PM DBG GRPC Service Started
3:44PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:40077): stderr 2024/04/14 15:44:59 gRPC Server listening at 127.0.0.1:40077
3:45PM DBG GRPC Service Ready
3:45PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:40077): stderr bert_load_from_file: invalid model file '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb' (bad magic)
3:45PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:40077): stderr bert_bootstrap: failed to load model from '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
3:45PM INF [rwkv] Attempting to load
3:45PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend rwkv
3:45PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:45PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: rwkv): {backendString:rwkv model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:45PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/rwkv
3:45PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:46379'
3:45PM DBG GRPC Service state dir: /tmp/go-processmanager851098290
3:45PM DBG GRPC Service Started
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr 2024/04/14 15:45:01 gRPC Server listening at 127.0.0.1:46379
3:45PM DBG GRPC Service Ready
3:45PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr 
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr /build/sources/go-rwkv/rwkv.cpp/rwkv_file_format.inc:93: header.magic == 0x67676d66
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr Invalid file header
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr /build/sources/go-rwkv/rwkv.cpp/rwkv_model_loading.inc:158: rwkv_fread_file_header(file.file, model.header)
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr 
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr /build/sources/go-rwkv/rwkv.cpp/rwkv.cpp:63: rwkv_load_model_from_file(file_path, *ctx->model)
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:46379): stderr 2024/04/14 15:45:03 InitFromFile /build/models/5c7cd056ecf9a4bb5b527410b97f48cb failed
3:45PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
3:45PM INF [whisper] Attempting to load
3:45PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend whisper
3:45PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:45PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: whisper): {backendString:whisper model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:45PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/whisper
3:45PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:38449'
3:45PM DBG GRPC Service state dir: /tmp/go-processmanager18879996
3:45PM DBG GRPC Service Started
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38449): stderr 2024/04/14 15:45:03 gRPC Server listening at 127.0.0.1:38449
3:45PM DBG GRPC Service Ready
3:45PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38449): stderr whisper_init_from_file_with_params_no_state: loading model from '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38449): stderr whisper_model_load: loading model
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38449): stderr whisper_model_load: invalid model data (bad magic)
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38449): stderr whisper_init_with_params_no_state: failed to load model
3:45PM INF [whisper] Fails: could not load model: rpc error: code = Unknown desc = unable to load model
3:45PM INF [stablediffusion] Attempting to load
3:45PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend stablediffusion
3:45PM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
3:45PM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: stablediffusion): {backendString:stablediffusion model:5c7cd056ecf9a4bb5b527410b97f48cb threads:1 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001f9a00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
3:45PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/stablediffusion
3:45PM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:41989'
3:45PM DBG GRPC Service state dir: /tmp/go-processmanager437365595
3:45PM DBG GRPC Service Started
3:45PM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:41989): stderr 2024/04/14 15:45:05 gRPC Server listening at 127.0.0.1:41989
3:45PM DBG GRPC Service Ready
3:45PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:755952728 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:1 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
3:45PM INF [stablediffusion] Loads OK
[172.18.0.1]:54412 500 - POST /v1/completions
docker-compose.yml
version: "3.9"
    services:
    api:
        image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
        container_name: localai
        # For a specific version:
        # image: localai/localai:v2.12.4-aio-cpu
        # For Nvidia GPUs decomment one of the following (cuda11 or cuda12):
        # Find out which version: `nvcc --version` (be aware, `nvidia-smi` only gives you max compatibility, it is
        # not the nvidia container toolkit version installed)
        # image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
        # image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-12
        # image: localai/localai:latest-aio-gpu-nvidia-cuda-11
        # image: localai/localai:latest-aio-gpu-nvidia-cuda-12
        healthcheck:
        test: [ "CMD", "curl", "-f", "http://localhost:8080/readyz" ]
        interval: 1m
        timeout: 20m
        retries: 5
        ports:
        - 8080:8080
        environment:
        - DEBUG=true
        - THREADS=1
        #- GALLERIES: '[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]'
        volumes:
        - ./models:/build/models:cached
        #- ./images:/tmp
        # decomment the following piece if running with Nvidia GPUs
        restart: unless-stopped
        deploy:
        resources:
            reservations:
            devices:
                - driver: nvidia
                count: 1
                capabilities: [ gpu ]

@thfrei
Copy link

thfrei commented Apr 14, 2024

@35develr I think I found a solution. After some digging into the #1341 watchdog implementation, I found another issue that mentions a flag for "one single active backend": #909

Solution: #925

SINGLE_ACTIVE_BACKEND=true

docker-compose.yml

version: "3.9"
services:
  api:
    image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
    container_name: localai
    # For a specific version:
    # image: localai/localai:v2.12.4-aio-cpu
    # For Nvidia GPUs decomment one of the following (cuda11 or cuda12):
    # Find out which version: `nvcc --version` (be aware, `nvidia-smi` only gives you max compatibility, it is
    # not the nvidia container toolkit version installed)
    # image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
    # image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-12
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-11
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8080/readyz" ]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      - SINGLE_ACTIVE_BACKEND=true
      - PARALLEL_REQUESTS=false
      - WATCHDOG_IDLE=true
      - WATCHDOG_BUSY=true
      - WATCHDOG_IDLE_TIMEOUT=5m
      - WATCHDOG_BUSY_TIMEOUT=5m
      #- GALLERIES: '[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]'
    volumes:
      - ./models:/build/models:cached
      #- ./images:/tmp
    # decomment the following piece if running with Nvidia GPUs
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]

EDIT: I also added the WATCHDOG env vars, this will clear all VRAM from time to time.

@ER-EPR
Copy link

ER-EPR commented Apr 27, 2024

Thanks for the work around, but I still would want a better strategy. e.g. when VRAM can't hold a new model, kill all present IDLE processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

3 participants