Return errors more often #160

ConradIrwin · 2024-08-21T17:17:44Z

At Zed, by far the most common panics we experience across the userbase
are from Linux users who crash during a draw call in the Blade
renderer.

We'd like to be able to detect that something has gone wrong and show a
notification via dbus to help people understand the problem and debug.

This PR is a work in progress, as before I completely change the API I wanted
to check that this is the right direction.

Even with these changes I'd stil like to try and reduce the frequency that
users see these problems, but I'm not sure how to do that.

At Zed, by far the most common panics we experience across the userbase are from Linux users who crash during a `draw` call in the Blade renderer. We'd like to be able to detect that something has gone wrong and show a notification via dbus to help people understand the problem and debug.

kvark · 2024-08-23T04:08:45Z

I'd strongly lean on the way of ergonomics. Blade isn't fully safe by design (that's what wgpu is for). So it can't comprehensively return errors from submit() anyway. Instead, the user should be responsible for not doing silly things that crash GPU or cause UB.
What kind of errors are you trying to surface up to the user? I believe any error in submit() should be treated like "internal" and addressed as a bug (preferably, quickly).

ConradIrwin · 2024-08-23T17:49:48Z

@kvark In the last 2 hours:

9 of:

GPU has crashed, and no debug information is available.
core::panicking::panic_fmt
<blade_graphics::hal::Context as blade_graphics::traits::CommandDevice>::submit
gpui::platform::blade::blade_renderer::BladeRenderer::draw
<gpui::platform::linux::x11::window::X11Window as gpui::platform::PlatformWindow>::draw

7 of

Aquire image error ERROR_SURFACE_LOST_KHR
core::panicking::panic_fmt
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::acquire_frame
gpui::platform::blade::blade_renderer::BladeRenderer::draw
<gpui::platform::linux::x11::window::X11Window as

3 of:

called `Result::unwrap()` on an `Err` value: ERROR_INITIALIZATION_FAILED
core::panicking::panic_fmt
core::result::unwrap_failed
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::init_impl
<gpui::platform::linux::x11::client::X11Client as gpui::platform::linux::platform::LinuxClient>::open_window

1 of:

called `Result::unwrap()` on an `Err` value: ERROR_OUT_OF_HOST_MEMORY
core::panicking::panic_fmt
core::result::unwrap_failed
blade_graphics::hal::init::<impl blade_graphics::hal::Context>::resize

Unfortunately the way our panic reporting works, it's tricky for me to get more information about these. I also know that some of these are caused by the initialization failures on intel we havent' fixed yet: #144.

kvark · 2024-08-25T02:05:12Z

Thanks for the info! That's quite high of a number. I wish we had one of the users to work with...
Would you have a way to collect hardware infos from those users? Something like vulkaninfo plus the log coming from Blade (e.g. at "Info" level) would help a lot.

ConradIrwin · 2024-08-29T03:10:40Z

In theory we could run vulkaninfo during the panic handler (or maybe during the upload process), but it makes me a bit squeamish to upload everything.

Are there other specific things you think we should check for that would make the output size smaller / could be queried in-process? or is it more of an unknown unknown at this point?

kvark · 2024-08-29T05:34:43Z

Are you concerned that VulkanInfo contains too much private information about the machine? I imagine this would be behind some kind of uesr concern.
At the very minimum, we'd want to know the instance version/extensions, and the list of physical devices and the basic info about them: vendor, driver version, supported extensions.

ConradIrwin closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return errors more often #160

Return errors more often #160

ConradIrwin commented Aug 21, 2024

kvark commented Aug 23, 2024

ConradIrwin commented Aug 23, 2024

kvark commented Aug 25, 2024

ConradIrwin commented Aug 29, 2024

kvark commented Aug 29, 2024

Return errors more often #160

Return errors more often #160

Conversation

ConradIrwin commented Aug 21, 2024

kvark commented Aug 23, 2024

ConradIrwin commented Aug 23, 2024

kvark commented Aug 25, 2024

ConradIrwin commented Aug 29, 2024

kvark commented Aug 29, 2024