Help with ZLUDA (likely integrated GPU issue) #552

xenlaRZ · 2024-11-09T12:49:30Z

xenlaRZ
Nov 9, 2024

Hi all, hoping someone can kindly assist with an issue using ZLUDA with OneTrainer after a pc upgrade. I've recently moved to a new system (carrying over my old gpu, rx6800xt) which has an integrated gpu. Since the upgrade I've been unable to use OneTrainer when requesting ZLUDA. I receive the following error as soon as any training starts:

ZLUDA device failed to pass basic operation test: index=None, device_name=AMD Radeon RX 6800 XT [ZLUDA]

Caching steps are completed however no training iterations occur past that point.

I've read elsewhere there is an issue with the HIP/rocm sdk when an iGPU is present. What is interesting to me is that SDnext does not have the same issue, whilst appearing to run very similar code. Looking at the zluda.py in the module folder of both applications the basic operation test is identical. I'm also curious as to the fact that SDnext seems to be able to provide specific GPU information as well as the device id while OneTrainer only seems to get the device name correct. I can see SDnext is looking for an 'optimal device', but understand OneTrainer is generally set up to take a more manual approach, relying on the user to define the primary and backup devices in the UI.

I'm hesitant to call this a bug, as I'm not familiar enough with how this all works however if people think that is appropriate I can raise it separately as an issue. In any case, I'd be very grateful for any help to get OneTrainer up and running again, thankyou!

xenlaRZ · 2024-11-12T22:32:35Z

xenlaRZ
Nov 12, 2024
Author

Hoping to catch someone's eye here. Looking a little more into the problem I thought I'd try explicitly pointing to the main GPU per the notes here:

If both integrated AMD GPU and dedicated AMD GPU are present in the system, ZLUDA uses the integrated GPU.
This is a bug in underlying ROCm/HIP runtime. You can work around it by disabling the integrated GPU.
On Windows we recommend you use environment variable HIP_VISIBLE_DEVICES=1 environment variable (more here) or disable it system-wide in Device Manager.

Setting the environment variable I can control the GPU being selected, however pointing OneTrainer/ZLUDA to the 6800XT still fails to display a valid index, and immediately returns the same fail message for the ZLUDA test code.

I'm still curious as to why the implementation for OneTrainer doesn't seem to work when a very similar arrangement in SDnext has no issue. This no longer seems to be a problem linked to the presence of an iGPU, so I'm not sure what to look at next from this point. Appreciate any help offered!

0 replies

xenlaRZ · 2024-11-27T11:25:39Z

xenlaRZ
Nov 27, 2024
Author

Following this up in case it helps others in future. Thanks to the discord chat, I can confirm this is not an issue relating to the presence of both an iGPU and a GPU, but rather what seems to be an incompatibility with pytorch 2.5 builds and ZLUDA. Rolling back to the last pytorch 2.3 build of OneTrainer now allows my GPU to pass the basic ZLUDA test without issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with ZLUDA (likely integrated GPU issue) #552

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Help with ZLUDA (likely integrated GPU issue) #552

xenlaRZ Nov 9, 2024

Replies: 2 comments

xenlaRZ Nov 12, 2024 Author

xenlaRZ Nov 27, 2024 Author

xenlaRZ
Nov 9, 2024

xenlaRZ
Nov 12, 2024
Author

xenlaRZ
Nov 27, 2024
Author