koboldcpp-rocm-1.78.yr0-ROCm has other behauvior with row-split #82

ATStUrNa · 2024-11-24T08:10:51Z

ATStUrNa
Nov 24, 2024

I have multi GPU system and use koboldcpp-rocm with row-split. (7900XTX + 2x 7600XT/Kubuntu 24.04 LTS)
The process speed is much slower, but the generate speed is faster (~70%.)

In version 1.78 splitting is different and I get now out of memory with largest model (120B IQ3-XSS models) which works with version 1.77 and before.
Also there is CPU offloading now.
-llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead

1.78.yr0:
llm_load_print_meta: max token length = 48
llm_load_tensors: tensor 'token_embd.weight' (iq3_s) (and 177 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 89/89 layers to GPU
llm_load_tensors: ROCm0_Split model buffer size = 18665.34 MiB
llm_load_tensors: ROCm1_Split model buffer size = 13116.19 MiB
llm_load_tensors: ROCm2_Split model buffer size = 12875.72 MiB
llm_load_tensors: CPU model buffer size = 165.00 MiB
llm_load_tensors: ROCm0 model buffer size = 3.47 MiB
llm_load_tensors: ROCm1 model buffer size = 2.44 MiB
llm_load_tensors: ROCm2 model buffer size = 2.39 MiB
load_all_data: buffer type ROCm0_Split is not the default buffer type for device ROCm0 for async uploads
.........................................load_all_data: buffer type ROCm1_Split is not the default buffer type for device ROCm1 for async uploads
.............................load_all_data: buffer type ROCm2_Split is not the default buffer type for device ROCm2 for async uploads
.............................load_all_data: device CPU does not support async, host buffers or events
load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
load_all_data: using async uploads for device ROCm1, buffer type ROCm1, backend ROCm1
load_all_data: using async uploads for device ROCm2, buffer type ROCm2, backend ROCm2
.
Applying Tensor Split...Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 12288
llama_new_context_with_model: n_ctx_per_seq = 12288
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (12288) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: ROCm0 KV buffer size = 499.50 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 351.00 MiB
llama_kv_cache_init: ROCm2 KV buffer size = 337.50 MiB
llama_new_context_with_model: KV self size = 1188.00 MiB, K (q4_0): 594.00 MiB, V (q4_0): 594.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm1 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm2 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 4
Load Text Model OK: True
Embedded KoboldAI Lite loaded.

1.77.yr1:
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 1.31 MiB
llm_load_tensors: offloading 88 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 89/89 layers to GPU
llm_load_tensors: ROCm_Split buffer size = 44657.25 MiB
llm_load_tensors: ROCm0 buffer size = 8.30 MiB
llm_load_tensors: ROCm_Host buffer size = 165.00 MiB
load_all_data: buffer type ROCm_Split is not the default buffer type for device ROCm0 for async uploads
...................................................................................................load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
load_all_data: buffer type ROCm_Host is not the default buffer type for device ROCm0 for async uploads
.
Applying Tensor Split...Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx = 12288
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 1188.00 MiB
llama_new_context_with_model: KV self size = 1188.00 MiB, K (q4_0): 594.00 MiB, V (q4_0): 594.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 196.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 48.01 MiB
llama_new_context_with_model: graph nodes = 2471
llama_new_context_with_model: graph splits = 2
Load Text Model OK: True
Embedded KoboldAI Lite loaded.

Both load with the same config file.
config
Namespace(model='', model_param='/home/user/program/kobold/Mistral-Large-Instruct-2407.IQ3_XXS-00001-of-00002.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=7, usecublas=['normal', '0', 'mmq', 'rowsplit'], usevulkan=None, useclblast=None, usecpu=False, contextsize=12288, gpulayers=100, tensor_split=[11.0, 8.0, 8.0], checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=7, lora=None, noshift=True, nofastforward=False, nommap=True, usemlock=False, noavx2=False, debugmode=0, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=True, quantkv=2, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=7, sdclamped=0, sdt5xxl='', sdclipl='', sdclipg='', sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

koboldcpp-rocm-1.78.yr0-ROCm has other behauvior with row-split #82

{{title}}

Replies: 0 comments

Select a reply

koboldcpp-rocm-1.78.yr0-ROCm has other behauvior with row-split #82

ATStUrNa Nov 24, 2024

Replies: 0 comments

ATStUrNa
Nov 24, 2024