Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic GPU crashes on load #39

Open
kvark opened this issue Jul 9, 2023 · 1 comment
Open

Sporadic GPU crashes on load #39

kvark opened this issue Jul 9, 2023 · 1 comment
Labels
type: bug Something isn't working

Comments

@kvark
Copy link
Owner

kvark commented Jul 9, 2023

Seeing them occasionally as VK_ERROR_DEVICE_LOST coming out of vkQueueSubmit.
Symptoms:

  • only happens on Sponza, which is a giant scene.
  • once it starts happening, it keeps happening. Once it works, it keeps working.
  • seems to be more likely when on battery?
  • seems to be more likely in debug mode?
  • often falls on the first or second frame rendered with UI.
  • when I needed to present this on Rust Gamedev meetup, I disabled most of the UI rendering, and it seemed to improve things.
  • Markers from GPU crash handler in Vulkan #38 always point to BLAS construction. There is only one BLAS, and it's giant.
  • All primitive/instance/geometry counts are well within limits. All device address alignments are good, too.
@kvark kvark added the type: bug Something isn't working label Jul 9, 2023
@kvark
Copy link
Owner Author

kvark commented Jul 9, 2023

I think it's just a TDR in the driver.

Story

I clear the asset cache and then try to load the big scene. The model is being processed, and then served. This is where BLAS is constructed. It schedules a bunch of transfers (for big meshes as well as textures that are loaded on the side), and then have this BLAS construction.all

This is a giant BLAS, and constructing it on GPU take some significant time. However, all the CPU threads are busy doing the texture compression of the assets that haven't been cached yet. So the AMD power management can't allocate enough power for the GPU operations. More to this, we are running on an integrated APU, which means the memory bandwidth is shared between the CPU and GPU operations. It's easy to starve this while heavy-loading assets on many threads.

This is also affected by: whether or not we run on battery, and what other kind of rendering is requested (UI may need some texture updates as well, and there are other apps like WezTerm consuming GPU). Result is - job gets too much time and is considered to be handing. Job is getting killed by the driver, I'm getting DEVICE_LOST. And all of the textures in process are dropped, meaning they will be converted again on the next run, repeating the cycle.

Workarounds

  1. They might be a way to configure TDR? Probably locally only, which isn't going to help other users.
  2. Mark texture loading to be dependent on the model being served. This would mean there are less (or no) things running during BLAS construction.
  3. Detect if the system has an integrated GPU and limit the number of worker threads more, e.g. 1/2 instead of 2/3 of the logical cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant