-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Memory - GRPCs do not get reused or alternatively removed #1729
Comments
only the GRPC Process of the embedding is affected, when i run the embedding on another machine without CUDA and Embedding config:
|
Same issue here. I'm running I first try
Which works, and then another model
Which fails. Doign them separate after Relevant section of docker log
docker logs localaiLog
docker-compose.yml
|
@35develr I think I found a solution. After some digging into the #1341 watchdog implementation, I found another issue that mentions a flag for "one single active backend": #909 Solution: #925
docker-compose.yml version: "3.9"
services:
api:
image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
container_name: localai
# For a specific version:
# image: localai/localai:v2.12.4-aio-cpu
# For Nvidia GPUs decomment one of the following (cuda11 or cuda12):
# Find out which version: `nvcc --version` (be aware, `nvidia-smi` only gives you max compatibility, it is
# not the nvidia container toolkit version installed)
# image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-11
# image: localai/localai:v2.12.4-aio-gpu-nvidia-cuda-12
# image: localai/localai:latest-aio-gpu-nvidia-cuda-11
# image: localai/localai:latest-aio-gpu-nvidia-cuda-12
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8080/readyz" ]
interval: 1m
timeout: 20m
retries: 5
ports:
- 8080:8080
environment:
- DEBUG=true
- SINGLE_ACTIVE_BACKEND=true
- PARALLEL_REQUESTS=false
- WATCHDOG_IDLE=true
- WATCHDOG_BUSY=true
- WATCHDOG_IDLE_TIMEOUT=5m
- WATCHDOG_BUSY_TIMEOUT=5m
#- GALLERIES: '[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]'
volumes:
- ./models:/build/models:cached
#- ./images:/tmp
# decomment the following piece if running with Nvidia GPUs
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
EDIT: I also added the WATCHDOG env vars, this will clear all VRAM from time to time. |
Thanks for the work around, but I still would want a better strategy. e.g. when VRAM can't hold a new model, kill all present IDLE processes. |
LocalAI version:
master-cublas-cuda12-ffmpeg (from 19.02.2024)
Environment, CPU architecture, OS, and Version:
Ubuntu 22.04.3 LTS, LocalAI running in Docker, CUDA12
Config:
docker-compose.yaml:
gpt-4.yaml:
Describe the bug
For each chat request a new GRPC Process is started, but not unloaded when completed request.
So my Cuda Memory gets full and fuller until CUDA out of memory
nvitop:
To Reproduce
Expected behavior
Stopping GPRC process after complete and free CUDA Memory or reuse existing GPRC Process.
Logs
10:52AM DBG Stopping all backends except 'paraphrase-multilingual-mpnet-base-v2'
10:52AM DBG Parameter Config: &{PredictionOptions:{Model:paraphrase-multilingual-mpnet-base-v2 Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:paraphrase-multilingual-mpnet-base-v2 F16:false Threads:4 Debug:true Roles:map[] Embeddings:true Backend:sentencetransformers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[us der Versicherungsbranche.] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
10:52AM DBG [single-backend] Stopping mistral-7b-instruct-v0.2.Q5_K_M.gguf
10:52AM INF Loading model 'paraphrase-multilingual-mpnet-base-v2' with backend sentencetransformers
10:52AM DBG Loading model in memory from file: /models/paraphrase-multilingual-mpnet-base-v2
10:52AM DBG Loading Model paraphrase-multilingual-mpnet-base-v2 with gRPC (file: /models/paraphrase-multilingual-mpnet-base-v2) (backend: sentencetransformers): {backendString:sentencetransformers model:paraphrase-multilingual-mpnet-base-v2 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000640000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
10:52AM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
10:52AM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
10:52AM DBG GRPC Service for paraphrase-multilingual-mpnet-base-v2 will be running at: '127.0.0.1:38613'
10:52AM DBG GRPC Service state dir: /tmp/go-processmanager1753323238
10:52AM DBG GRPC Service Started
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Server started. Listening on: 127.0.0.1:38613
10:52AM DBG GRPC Service Ready
10:52AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:paraphrase-multilingual-mpnet-base-v2 ContextSize:0 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:true NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/paraphrase-multilingual-mpnet-base-v2 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr /opt/conda/envs/transformers/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self.fget.get(instance, owner)()
10:52AM DBG Stopping all backends except 'paraphrase-multilingual-mpnet-base-v2'
10:52AM DBG Model already loaded in memory: paraphrase-multilingual-mpnet-base-v2
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Calculated embeddings for: Gebe 20 Beispiele für Arbeitgeberpositionierung a
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Traceback (most recent call last):
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr response_or_iterator = behavior(argument, context)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/build/backend/python/sentencetransformers/sentencetransformers.py", line 80, in Embedding
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr sentence_embeddings = self.model.encode(request.Embeddings)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr self.to(device)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self._apply(convert)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr [Previous line repeated 4 more times]
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr param_applied = fn(param)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.93 GiB of which 17.62 MiB is free. Process 112775 has 1.55 GiB memory in use. Process 165903 has 1.55 GiB memory in use. Process 170069 has 1.55 GiB memory in use. Process 175275 has 1.25 GiB memory in use. Of the allocated memory 878.23 MiB is allocated by PyTorch, and 17.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[172.18.0.1]:57394 500 - POST /v1/embeddings
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Calculated embeddings for: us der Versicherungsbranche.
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr Traceback (most recent call last):
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/grpc/_server.py", line 552, in _call_behavior
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr response_or_iterator = behavior(argument, context)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/build/backend/python/sentencetransformers/sentencetransformers.py", line 80, in Embedding
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr sentence_embeddings = self.model.encode(request.Embeddings)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr self.to(device)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return self._apply(convert)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr module._apply(fn)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr [Previous line repeated 4 more times]
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr param_applied = fn(param)
[172.18.0.1]:57408 500 - POST /v1/embeddings
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr File "/opt/conda/envs/transformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10:52AM DBG GRPC(paraphrase-multilingual-mpnet-base-v2-127.0.0.1:38613): stderr torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.93 GiB of which 17.62 MiB is free. Process 112775 has 1.55 GiB memory in use. Process 165903 has 1.55 GiB memory in use. Process 170069 has 1.55 GiB memory in use. Process 175275 has 1.25 GiB memory in use. Of the allocated memory 878.23 MiB is allocated by PyTorch, and 17.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered: