You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have multi GPU system and use koboldcpp-rocm with row-split. (7900XTX + 2x 7600XT/Kubuntu 24.04 LTS)
The process speed is much slower, but the generate speed is faster (~70%.)
In version 1.78 splitting is different and I get now out of memory with largest model (120B IQ3-XSS models) which works with version 1.77 and before.
Also there is CPU offloading now.
-llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
1.78.yr0:
llm_load_print_meta: max token length = 48
llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 89/89 layers to GPU
llm_load_tensors: ROCm0_Split model buffer size = 18665.34 MiB
llm_load_tensors: ROCm1_Split model buffer size = 13116.19 MiB
llm_load_tensors: ROCm2_Split model buffer size = 12875.72 MiB
llm_load_tensors: CPU model buffer size = 165.00 MiB
llm_load_tensors: ROCm0 model buffer size = 3.47 MiB
llm_load_tensors: ROCm1 model buffer size = 2.44 MiB
llm_load_tensors: ROCm2 model buffer size = 2.39 MiB
load_all_data: buffer type ROCm0_Split is not the default buffer type for device ROCm0 for async uploads
.........................................load_all_data: buffer type ROCm1_Split is not the default buffer type for device ROCm1 for async uploads
.............................load_all_data: buffer type ROCm2_Split is not the default buffer type for device ROCm2 for async uploads
.............................load_all_data: device CPU does not support async, host buffers or events
load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
load_all_data: using async uploads for device ROCm1, buffer type ROCm1, backend ROCm1
load_all_data: using async uploads for device ROCm2, buffer type ROCm2, backend ROCm2
.
Applying Tensor Split...Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 12288
llama_new_context_with_model: n_ctx_per_seq = 12288
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (12288) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: ROCm0 KV buffer size = 499.50 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 351.00 MiB
llama_kv_cache_init: ROCm2 KV buffer size = 337.50 MiB
llama_new_context_with_model: KV self size = 1188.00 MiB, K (q4_0): 594.00 MiB, V (q4_0): 594.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm1 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm2 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 4
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
1.77.yr1:
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 1.31 MiB
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 89/89 layers to GPU
llm_load_tensors: ROCm_Split buffer size = 44657.25 MiB
llm_load_tensors: ROCm0 buffer size = 8.30 MiB
llm_load_tensors: ROCm_Host buffer size = 165.00 MiB
load_all_data: buffer type ROCm_Split is not the default buffer type for device ROCm0 for async uploads
...................................................................................................load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
load_all_data: buffer type ROCm_Host is not the default buffer type for device ROCm0 for async uploads
.
Applying Tensor Split...Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx = 12288
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 1188.00 MiB
llama_new_context_with_model: KV self size = 1188.00 MiB, K (q4_0): 594.00 MiB, V (q4_0): 594.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 2
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have multi GPU system and use koboldcpp-rocm with row-split. (7900XTX + 2x 7600XT/Kubuntu 24.04 LTS)
The process speed is much slower, but the generate speed is faster (~70%.)
In version 1.78 splitting is different and I get now out of memory with largest model (120B IQ3-XSS models) which works with version 1.77 and before.
Also there is CPU offloading now.
-llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
1.78.yr0:
llm_load_print_meta: max token length = 48
llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 89/89 layers to GPU
llm_load_tensors: ROCm0_Split model buffer size = 18665.34 MiB
llm_load_tensors: ROCm1_Split model buffer size = 13116.19 MiB
llm_load_tensors: ROCm2_Split model buffer size = 12875.72 MiB
llm_load_tensors: CPU model buffer size = 165.00 MiB
llm_load_tensors: ROCm0 model buffer size = 3.47 MiB
llm_load_tensors: ROCm1 model buffer size = 2.44 MiB
llm_load_tensors: ROCm2 model buffer size = 2.39 MiB
load_all_data: buffer type ROCm0_Split is not the default buffer type for device ROCm0 for async uploads
.........................................load_all_data: buffer type ROCm1_Split is not the default buffer type for device ROCm1 for async uploads
.............................load_all_data: buffer type ROCm2_Split is not the default buffer type for device ROCm2 for async uploads
.............................load_all_data: device CPU does not support async, host buffers or events
load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
load_all_data: using async uploads for device ROCm1, buffer type ROCm1, backend ROCm1
load_all_data: using async uploads for device ROCm2, buffer type ROCm2, backend ROCm2
.
Applying Tensor Split...Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 12288
llama_new_context_with_model: n_ctx_per_seq = 12288
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (12288) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: ROCm0 KV buffer size = 499.50 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 351.00 MiB
llama_kv_cache_init: ROCm2 KV buffer size = 337.50 MiB
llama_new_context_with_model: KV self size = 1188.00 MiB, K (q4_0): 594.00 MiB, V (q4_0): 594.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm1 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm2 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 4
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
1.77.yr1:
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 1.31 MiB
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 89/89 layers to GPU
llm_load_tensors: ROCm_Split buffer size = 44657.25 MiB
llm_load_tensors: ROCm0 buffer size = 8.30 MiB
llm_load_tensors: ROCm_Host buffer size = 165.00 MiB
load_all_data: buffer type ROCm_Split is not the default buffer type for device ROCm0 for async uploads
...................................................................................................load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
load_all_data: buffer type ROCm_Host is not the default buffer type for device ROCm0 for async uploads
.
Applying Tensor Split...Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx = 12288
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 1188.00 MiB
llama_new_context_with_model: KV self size = 1188.00 MiB, K (q4_0): 594.00 MiB, V (q4_0): 594.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 2
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Both load with the same config file.
config
Namespace(model='', model_param='/home/user/program/kobold/Mistral-Large-Instruct-2407.IQ3_XXS-00001-of-00002.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=7, usecublas=['normal', '0', 'mmq', 'rowsplit'], usevulkan=None, useclblast=None, usecpu=False, contextsize=12288, gpulayers=100, tensor_split=[11.0, 8.0, 8.0], checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=7, lora=None, noshift=True, nofastforward=False, nommap=True, usemlock=False, noavx2=False, debugmode=0, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=True, quantkv=2, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=7, sdclamped=0, sdt5xxl='', sdclipl='', sdclipg='', sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)
Beta Was this translation helpful? Give feedback.
All reactions