Replies: 2 comments
-
Hoping to catch someone's eye here. Looking a little more into the problem I thought I'd try explicitly pointing to the main GPU per the notes here:
Setting the environment variable I can control the GPU being selected, however pointing OneTrainer/ZLUDA to the 6800XT still fails to display a valid index, and immediately returns the same fail message for the ZLUDA test code. I'm still curious as to why the implementation for OneTrainer doesn't seem to work when a very similar arrangement in SDnext has no issue. This no longer seems to be a problem linked to the presence of an iGPU, so I'm not sure what to look at next from this point. Appreciate any help offered! |
Beta Was this translation helpful? Give feedback.
-
Following this up in case it helps others in future. Thanks to the discord chat, I can confirm this is not an issue relating to the presence of both an iGPU and a GPU, but rather what seems to be an incompatibility with pytorch 2.5 builds and ZLUDA. Rolling back to the last pytorch 2.3 build of OneTrainer now allows my GPU to pass the basic ZLUDA test without issue. |
Beta Was this translation helpful? Give feedback.
-
Hi all, hoping someone can kindly assist with an issue using ZLUDA with OneTrainer after a pc upgrade. I've recently moved to a new system (carrying over my old gpu, rx6800xt) which has an integrated gpu. Since the upgrade I've been unable to use OneTrainer when requesting ZLUDA. I receive the following error as soon as any training starts:
ZLUDA device failed to pass basic operation test: index=None, device_name=AMD Radeon RX 6800 XT [ZLUDA]
Caching steps are completed however no training iterations occur past that point.
I've read elsewhere there is an issue with the HIP/rocm sdk when an iGPU is present. What is interesting to me is that SDnext does not have the same issue, whilst appearing to run very similar code. Looking at the zluda.py in the module folder of both applications the basic operation test is identical. I'm also curious as to the fact that SDnext seems to be able to provide specific GPU information as well as the device id while OneTrainer only seems to get the device name correct. I can see SDnext is looking for an 'optimal device', but understand OneTrainer is generally set up to take a more manual approach, relying on the user to define the primary and backup devices in the UI.
I'm hesitant to call this a bug, as I'm not familiar enough with how this all works however if people think that is appropriate I can raise it separately as an issue. In any case, I'd be very grateful for any help to get OneTrainer up and running again, thankyou!
Beta Was this translation helpful? Give feedback.
All reactions