hipBLAS(ROCm)+noavx2? #60
Unanswered
EmanuelOverride
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Doesn't seem to be a standard feature, but wondering whether it's technically possible and what I would need to do to make it happen.
I'm trying out the different configurations available to me on an i7-3770 and RX-480 to squeeze as much performance out at the lowest possible cost of memory, and right now that's Vulkan. (Update: Actually Zluda-Cuda Noavx2) But, it's not like I have a lot of other options that utilize my GPU without AVX2 support, although using Zluda with noavx2 CLBlast curiously shaves off 100-150sec on the benchmark, and if not for the long-term garbage accumulation in VRAM, that could be another contender since it uses less memory than Vulkan at the same settings, at least in my experience, despite whatever CUDA-esque data Zluda initially stores in VRAM.
Well, I also tried hipBLAS(ROCm) with AVX2 emulation via Intel SDE, and it's stable enough and produces some of the quickest generation speeds on my hardware, with low VRAM overhead, excellent GPU utilization and, from what I could see, no residual junk in VRAM. But, RAM gets almost entirely eaten up by SDE for AVX2 emulation and processing time increases by a factor of 30, initially at least -- it improves slightly, to maybe 20 times slower than Vulkan, after prolonged use. It also causes stutters in my PC's responsiveness.
But using AVX2 emulation seems rather silly. At least to a layman like myself, it looks like Kobold utilizes similar hip and ROCm chicanery that Zluda relies on to run CUDA applications on AMD cards. I mean, I'm getting as good performance as I could reasonably expect with image models, ip-adapters and controlnets using this sort of setup without AVX2, but I freely admit that I don't know whether the comparison is applicable and hoping that someone can enlighten me: is hipblas-noavx2 just a missing dll, is it rebuilding the application from the ground up, or is it some impossible contradiction in terms?
Update: I've started messing with my own compilations and feel like I'm gradually getting the hang of it. However, I don't have any previous experience with this sort of stuff, so my approach is to wing it and then study the aftermath to hopefully understand what's going on. The experimental approach and results so far:
Emulating AVX2 fixes.
Possible discrepancy FMA enabled on hipblas? tbc.
Update2: Ok, so I guess that was an oversight on my part. Adding -DLLAMA_FMA=off gets me a tick further in, gui now detects GPU right away, still fails on model load with
This all with 1.71, also tried 1.67. Huge size difference of the dll generated 60mb v.s 17mb. No discernible difference in operation, errors out in exact same part of sequence. Guess I'll comb over the wreckage to try and understand what it is I'm actually doing, because atm it's like I'm just flipping random switches. :P
Update3: Tried building Llamacpp with DGGML_HIPBLAS=ON using same general process and it works, by which I mean it's able to load the model. However, benchmark produces same result as AVX2 emulation -- stutters in PC responsiveness, extreme generation times. Ergo, the responsible party is not SDE but ROCm (presumably in combo with my gpu). (Related? ROCm/rocBLAS#1218 (comment)) Confirmed hypothesis by running Vulkan in AVX2 via SDE and noticing no real performance impact. Further testing necessary.
Beta Was this translation helpful? Give feedback.
All reactions