Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return errors more often #160

Closed
wants to merge 1 commit into from

Conversation

ConradIrwin
Copy link
Contributor

At Zed, by far the most common panics we experience across the userbase
are from Linux users who crash during a draw call in the Blade
renderer.

We'd like to be able to detect that something has gone wrong and show a
notification via dbus to help people understand the problem and debug.

This PR is a work in progress, as before I completely change the API I wanted
to check that this is the right direction.

Even with these changes I'd stil like to try and reduce the frequency that
users see these problems, but I'm not sure how to do that.

At Zed, by far the most common panics we experience across the userbase
are from Linux users who crash during a `draw` call in the Blade
renderer.

We'd like to be able to detect that something has gone wrong and show a
notification via dbus to help people understand the problem and debug.
@kvark
Copy link
Owner

kvark commented Aug 23, 2024

I'd strongly lean on the way of ergonomics. Blade isn't fully safe by design (that's what wgpu is for). So it can't comprehensively return errors from submit() anyway. Instead, the user should be responsible for not doing silly things that crash GPU or cause UB.
What kind of errors are you trying to surface up to the user? I believe any error in submit() should be treated like "internal" and addressed as a bug (preferably, quickly).

@ConradIrwin
Copy link
Contributor Author

@kvark In the last 2 hours:

  • 9 of:
GPU has crashed, and no debug information is available.
core::panicking::panic_fmt
<blade_graphics::hal::Context as blade_graphics::traits::CommandDevice>::submit
gpui::platform::blade::blade_renderer::BladeRenderer::draw
<gpui::platform::linux::x11::window::X11Window as gpui::platform::PlatformWindow>::draw
  • 7 of
Aquire image error ERROR_SURFACE_LOST_KHR
core::panicking::panic_fmt
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::acquire_frame
gpui::platform::blade::blade_renderer::BladeRenderer::draw
<gpui::platform::linux::x11::window::X11Window as 
  • 3 of:
called `Result::unwrap()` on an `Err` value: ERROR_INITIALIZATION_FAILED
core::panicking::panic_fmt
core::result::unwrap_failed
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::init_impl
<gpui::platform::linux::x11::client::X11Client as gpui::platform::linux::platform::LinuxClient>::open_window
  • 1 of:
called `Result::unwrap()` on an `Err` value: ERROR_OUT_OF_HOST_MEMORY
core::panicking::panic_fmt
core::result::unwrap_failed
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::resize

Unfortunately the way our panic reporting works, it's tricky for me to get more information about these. I also know that some of these are caused by the initialization failures on intel we havent' fixed yet: #144.

@kvark
Copy link
Owner

kvark commented Aug 25, 2024

Thanks for the info! That's quite high of a number. I wish we had one of the users to work with...
Would you have a way to collect hardware infos from those users? Something like vulkaninfo plus the log coming from Blade (e.g. at "Info" level) would help a lot.

@ConradIrwin
Copy link
Contributor Author

In theory we could run vulkaninfo during the panic handler (or maybe during the upload process), but it makes me a bit squeamish to upload everything.

Are there other specific things you think we should check for that would make the output size smaller / could be queried in-process? or is it more of an unknown unknown at this point?

@kvark
Copy link
Owner

kvark commented Aug 29, 2024

Are you concerned that VulkanInfo contains too much private information about the machine? I imagine this would be behind some kind of uesr concern.
At the very minimum, we'd want to know the instance version/extensions, and the list of physical devices and the basic info about them: vendor, driver version, supported extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants