hipBLAS(ROCm)+noavx2? #60

EmanuelOverride · 2024-07-27T09:51:26Z

EmanuelOverride
Jul 27, 2024

Doesn't seem to be a standard feature, but wondering whether it's technically possible and what I would need to do to make it happen.

I'm trying out the different configurations available to me on an i7-3770 and RX-480 to squeeze as much performance out at the lowest possible cost of memory, and right now that's Vulkan. (Update: Actually Zluda-Cuda Noavx2) But, it's not like I have a lot of other options that utilize my GPU without AVX2 support, although using Zluda with noavx2 CLBlast curiously shaves off 100-150sec on the benchmark, and if not for the long-term garbage accumulation in VRAM, that could be another contender since it uses less memory than Vulkan at the same settings, at least in my experience, despite whatever CUDA-esque data Zluda initially stores in VRAM.

Well, I also tried hipBLAS(ROCm) with AVX2 emulation via Intel SDE, and it's stable enough and produces some of the quickest generation speeds on my hardware, with low VRAM overhead, excellent GPU utilization and, from what I could see, no residual junk in VRAM. But, RAM gets almost entirely eaten up by SDE for AVX2 emulation and processing time increases by a factor of 30, initially at least -- it improves slightly, to maybe 20 times slower than Vulkan, after prolonged use. It also causes stutters in my PC's responsiveness.

But using AVX2 emulation seems rather silly. At least to a layman like myself, it looks like Kobold utilizes similar hip and ROCm chicanery that Zluda relies on to run CUDA applications on AMD cards. I mean, I'm getting as good performance as I could reasonably expect with image models, ip-adapters and controlnets using this sort of setup without AVX2, but I freely admit that I don't know whether the comparison is applicable and hoping that someone can enlighten me: is hipblas-noavx2 just a missing dll, is it rebuilding the application from the ground up, or is it some impossible contradiction in terms?

Update: I've started messing with my own compilations and feel like I'm gradually getting the hang of it. However, I don't have any previous experience with this sort of stuff, so my approach is to wing it and then study the aftermath to hopefully understand what's going on. The experimental approach and results so far:

Precompiled client with hipBLAS gives a DLL error:

OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed

Emulating AVX2 fixes.

Default settings Selfcompiled client with hipBLAS, using both own and precompiled koboldcpp_hipblas.dll same result, breaks without AVX2, works with AVX2 emulation. Building dll with LLAMA_HIPBLAS as part of Kobold fails (paths in make file?), building with w64dev fails (access to clang denied), building separately with cmake as p. instructions works.
Building Kobold with LLAMA_NOAVX2, hipblas.dll with DLLAMA_AVX2=off generates seemingly appropriate ninja file. Client works with default Vulkan/OpenCL with AVX2 off, successfully loads hipblas.dll, fails on model load:

OSError: [WinError -1073741795] Windows Error 0xc000001d

Possible discrepancy FMA enabled on hipblas? tbc.

Update2: Ok, so I guess that was an oversight on my part. Adding -DLLAMA_FMA=off gets me a tick further in, gui now detects GPU right away, still fails on model load with

OSError: exception: access violation reading 0x0000000000000000

This all with 1.71, also tried 1.67. Huge size difference of the dll generated 60mb v.s 17mb. No discernible difference in operation, errors out in exact same part of sequence. Guess I'll comb over the wreckage to try and understand what it is I'm actually doing, because atm it's like I'm just flipping random switches. :P

Update3: Tried building Llamacpp with DGGML_HIPBLAS=ON using same general process and it works, by which I mean it's able to load the model. However, benchmark produces same result as AVX2 emulation -- stutters in PC responsiveness, extreme generation times. Ergo, the responsible party is not SDE but ROCm (presumably in combo with my gpu). (Related? ROCm/rocBLAS#1218 (comment)) Confirmed hypothesis by running Vulkan in AVX2 via SDE and noticing no real performance impact. Further testing necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hipBLAS(ROCm)+noavx2? #60

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

hipBLAS(ROCm)+noavx2? #60

EmanuelOverride Jul 27, 2024

Replies: 0 comments

EmanuelOverride
Jul 27, 2024