Device valid #4163

bradwerth · 2023-09-21T19:25:19Z

Checklist

Run cargo clippy.
Run cargo clippy --target wasm32-unknown-unknown if applicable.
Add change to CHANGELOG.md. See simple instructions inside file.

Connections
None.

Description
This implements a simple version of "lose the device", which marks a Device as invalid, causing all subsequent operations to return an error. Nothing actually calls this yet.

Testing
This adds an integration test device_lose_then_more that attempts various operations on a lost device, which are all expected to fail.

This tracks Device validity explicity, but it could be changed to instead change Device Storage to Error when the Device is invalid. This also could go further with standardizing the error reporting for an invalid Device.

This tracks device validity explicitly, and short-circuits almost all operations on devices when they are invalid. This doesn't replace the storage of the device with an Error, because it's important for poll_devices to be able to monitor the devices (even if invalid) for emptiness so they can be cleaned up.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Co-authored-by: Connor Fitzgerald <[email protected]>

Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.47 to 1.0.48. - [Release notes](https://github.com/dtolnay/thiserror/releases) - [Commits](dtolnay/thiserror@1.0.47...1.0.48) --- updated-dependencies: - dependency-name: thiserror dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [serde](https://github.com/serde-rs/serde) from 1.0.186 to 1.0.188. - [Release notes](https://github.com/serde-rs/serde/releases) - [Commits](serde-rs/serde@v1.0.186...v1.0.188) --- updated-dependencies: - dependency-name: serde dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…fx-rs#4115) Rename `wgpu_hal::vulkan::Instance::required_extensions` to `desired_extensions`, to match its behavior. Document the function to clarify its role.

Due to an issue with Mesa versions less than 21.2 presentation on Vulkan was forced to Nvidia only. This in itself brought new issues around the Nvidia driver specfic format modifers. As of Mesa 21.2 the Intel vulkan issue is fixed. This commit enables presentation on versions 21.2 and above for Intel. References: - NVIDIA/egl-wayland#72 Closes [gfx-rs#4101](gfx-rs#4101)

For each backend `blah`, log::debug/trace whether we were able to populate `Instance::blah`.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…fx-rs#4130) Co-authored-by: Nicolas Silva <[email protected]>

Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.105 to 1.0.106. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](serde-rs/json@v1.0.105...v1.0.106) --- updated-dependencies: - dependency-name: serde_json dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Co-authored-by: Nicolas Silva <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Implements timer queries via write_timestamp on Metal for encoders (whenever timer queries are available) and passes (for Intel/AMD GPUs, where we should advertise TIMESTAMP_QUERY_INSIDE_PASSES now). Due to some bugs in Metal this was a lot harder than expected. I believe the solution is close to optimal with the current restrictions in place. For details see code comments.

Co-authored-by: Connor Fitzgerald <[email protected]>

Bumps [profiling](https://github.com/aclysma/profiling) from 1.0.10 to 1.0.11. - [Changelog](https://github.com/aclysma/profiling/blob/master/CHANGELOG.md) - [Commits](https://github.com/aclysma/profiling/commits) --- updated-dependencies: - dependency-name: profiling dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [termcolor](https://github.com/BurntSushi/termcolor) from 1.2.0 to 1.3.0. - [Commits](BurntSushi/termcolor@1.2.0...1.3.0) --- updated-dependencies: - dependency-name: termcolor dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…evice_lose.

Wumpf

Thank you for your contribution! But I fear this might be the wrong general direction for implementing device loss 😟

According to spec, entering the device-lost state is not directly exposed other than via a destroy call, but otherwise is an error state the device can enter.
https://www.w3.org/TR/webgpu/#lose-the-device

There is however a lost promise that we need to implement here, MDN has some example code on this what it should look like from JS https://developer.mozilla.org/en-US/docs/Web/API/GPUDevice/lost

I think we should try to handle device lost states through the HAL layer - that's where they typically occur. Device loss via destroy is harder to map to that ofc, but we should investigate if we can "poison" the hal device so that this actually has the desired effect through the entire stack.

nical · 2023-09-22T13:14:07Z

I think we should try to handle device lost states through the HAL layer - that's where they typically occur. Device loss via destroy is harder to map to that ofc, but we should investigate if we can "poison" the hal device so that this actually has the desired effect through the entire stack.

I think that we should be able to implement most of device.destroy and device loss via the same infrastructure in wgpu-core, it would be simple to keep the associated validation there. Also, all of the asynchronous deallocation logic is in core so I don't think we can escape having the destroyed state handle there.

What advantages do you expect from moving it in hal?

Wumpf · 2023-09-22T13:21:25Z

Generally, a device loss event is an actual thing that happens to the hal device; different implementations have different error states. Relatedly, I think destroying/loosing a device should actually have an actual effect that is communicated with the driver.

Wumpf · 2023-09-22T13:24:35Z

I fear if we put too much logic in wgpu-core, we have two sources of device-lost state: The hal device that is actually in a bad state / deallocated / etc. and some flag on wgpu-core. So to keep this in sync we could instead solely rely on hal owning this state.

nical · 2023-09-22T13:34:36Z

It looks like you are looking at this primarily through the lense of device loss, while Brad's priority here is to get device.destroy working (we direly need it for CTS runs to not randomly OOM). The effect of that will be to clean up all of the device's allocated resources (while the handles are still alive, hence the need for validation everywhere which is in core) but that's not reflected in the PR yet.

I haven't looked at the PR in details yet but perhaps some of the "lose" terminology in there could be replaced by destroy to make that clearer.

We could store the device's state in the hal structure, but it would still be mostly accessed by the validation in wgpu-core.

bradwerth · 2023-09-22T14:55:40Z

This is definitely a partial implementation of "lose the device" which is itself a partial implementation of device.destroy. From my read of the spec, we lose the device in 3 ways:

It's the final steps of device.destroy.
As needed from each backend.
As part of the error response in adapter.request_device. This case seems least important since it already returns an error and not a device, so there's nothing for the caller to check a lost promise against. But if there's internal state cleanup associated with "lose the device" it happens in this case, too.

In this patch, my goal was to support these cases at least in theory, but since the integration test is checking the outcome of lose the device, it makes it look like a new callable function on the device object itself. It also doesn't attempt to handle the lost promise resolution for two reasons: because of my lack of familiarity with the direct and web protocols, and the implementation in Firefox will be external to wgpu. Regarding the direct and web implementations of the lost promises, I should either add much better TODOs to document the gaps, or fill the gaps with an implementation.

… direct and web.

jimblandy · 2023-09-22T18:32:15Z

@Wumpf I think you're right to ask how this code is going to interact with device loss reported from the underlying platform APIs. This PR needs to include a clear plan for how that is going to be incorporated.

As @nical said, our immediate goal is to implement something corresponding to the spec's GPUDevice.destroy.

However, in doing so we've run into an unfortunate part of wgpu's current architecture.

When the hal reports a lost device, WebGPU says we should execute the "lose the device" steps you mentioned above. The first step there is "Make device invalid."

Everywhere else in wgpu-core treats WebGPU invalidity by placing an Element::Error in the appropriate Registry field of the Hub. However, that was designed for objects that are invalid from the point of creation, not objects that become invalid after having previously been valid. If we were to implement GPUDevice.destroy by removing the Device from the Hub, it would be impossible for wgpu-core to continue to process the queue, as required by steps 4 and 5 of "lose the device":

Complete any outstanding mapAsync() steps.

Complete any outstanding onSubmittedWorkDone() steps.

In gpuweb/gpuweb#4294 I confirmed with @Kangz that we're not supposed to be able to make further requests of the device after losing it, so it does seem important that the device be marked invalid at the indicated point in execution.

@bradwerth explained this to me, and said that adding the valid flag to the wgpu_core Device seemed like it would be required in order to prevent the Device from handling new user requests, while still being able to process the queue. We agreed it was ugly and unfortunate, but we couldn't see a better way to do it.

I believe #3626 will change wpgu-core to use Arc for all internal references to devices, so when that lands we may be able to remove the valid flag, and implement "lose the device" simply by removing the Device's entry from its Registry. Then all API requests will automatically be unable to access the Device, and we can remove the valid flag.

bradwerth · 2023-09-22T19:09:23Z

Presuming that the wgpu-core implementation of "lose the device" can be correctly signaled by the driver, then I don't think there is any danger of inconsistent state. Since wgpu-core device has the valid/invalid state, and prevents further requests from reaching the hal device, then I don't believe there is double-state that can become inconsistent. The hal device is never again accessed.

jimblandy · 2023-09-22T20:11:19Z

Presuming that the wgpu-core implementation of "lose the device" can be correctly signaled by the driver, then I don't think there is any danger of inconsistent state. Since wgpu-core device has the valid/invalid state, and prevents further requests from reaching the hal device, then I don't believe there is double-state that can become inconsistent. The hal device is never again accessed.

This isn't quite right. For example, Vulkan requires that all the resources created from a device be freed before the device itself is freed, and this requirement holds even if vkQueueSubmit returns VK_ERROR_DEVICE_LOST. The architecture of wgpu is that wgpu-hal is not responsible for tracking a device's resources and cleaning them up; enforcing anything but the simplest invariants is wgpu-core's role. So when wgpu_hal::vulkan::Queue::submit calls ash::device::Device::queue_submit (the Rust binding for vkQueueSubmit) and that returns vk::Result::ERROR_DEVICE_LOST, it just returns it directly. Then it's up to wgpu_core to go about cleaning up the resources - and that will definitely entail continuing to use the hal device.

It looks to me like wgpu_core doesn't implement this cleanup, however. EDIT: Operations will continue to fail after device loss, and resources get cleaned up when the user drops the device.

Wumpf

All very good arguments why we should get this started with a validity flag. Alright, let's do it then and evolve it as we add lost device handling on HAL!

However, (digging a bit more into the code now) we should try a bit harder not to expose device_lose on the user facing wgpu layer as the spec doesn't include it. I'm also very skeptical of having it as a non public function on wgpu-core (wgpu-core is usually just an implementation of the public interface), but I can see how that might be useful in some places.
It seems pretty clear that destroy is what serves the actual purpose for the user. If the desire is to move fast on giving the user the ability to cause a device loss, we should instead expose destroy already and note down that its implementation is still work in progress - causing the device invalidation but no other sideeffects like resource destruction yet :)
Hope that makes sense!

CHANGELOG.md

.cargo/config.toml

tests/tests/device.rs

wgpu/src/lib.rs

wgpu/src/backend/web.rs

wgpu-core/src/device/global.rs

…the lose function.

Wumpf

nice! Thanks for adding all those verbose & helpful comments in there!

bradwerth and others added 30 commits September 11, 2023 16:44

Fixup merge.

dfb34f5

Drop texture clear_views in surface_texture_discard (gfx-rs#4057)

2887ad6

Bump profiling from 1.0.9 to 1.0.10 (gfx-rs#4102)

74bbc7d

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Fix limits interface on web. (gfx-rs#4107)

451dfeb

Add details to InstanceError and CreateSurfaceError. (gfx-rs#4066)

8fc3eb5

Co-authored-by: Connor Fitzgerald <[email protected]>

Skip test_multithreaded_compute on MoltenVK. (gfx-rs#4096)

8fc8962

Co-authored-by: Connor Fitzgerald <[email protected]>

hal/vulkan: Instance::required_extensions -> desired_extensions (g…

cdcf934

…fx-rs#4115) Rename `wgpu_hal::vulkan::Instance::required_extensions` to `desired_extensions`, to match its behavior. Document the function to clarify its role.

wgpu_core: Add logging to Instance::new.

7c47c49

For each backend `blah`, log::debug/trace whether we were able to populate `Instance::blah`.

Fix D3D12 Surface Leak (gfx-rs#4106)

1acf857

Bump bytemuck from 1.13.1 to 1.14.0 (gfx-rs#4123)

6ac37a5

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Update naga to 0.13.0@git:cc87b8f9eb30bb55d0735b89d3df3e099e1a6e7c (g…

8c75d36

…fx-rs#4130) Co-authored-by: Nicolas Silva <[email protected]>

Print errors in a more readable format in the player. (gfx-rs#4137)

118b7e6

Co-authored-by: Nicolas Silva <[email protected]>

Workaround NV bug (gfx-rs#4132)

cd3e1f3

Bump serde_json from 1.0.106 to 1.0.107 (gfx-rs#4133)

e8b839d

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bump libc from 0.2.147 to 0.2.148 (gfx-rs#4134)

eede6bd

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Add details to RequestDeviceError. (gfx-rs#4145)

2f5dc46

Tests for wgpu#4139. (gfx-rs#4148)

118359a

Update Naga to df8107b (2023-9-15). (gfx-rs#4149)

4dbff2e

[d3d12] Document map_blend_factor (gfx-rs#4151)

36a21e4

Make StoreOp an enum instead of a bool (gfx-rs#4147)

573fdee

Co-authored-by: Connor Fitzgerald <[email protected]>

bradwerth added 4 commits September 21, 2023 13:16

Fixup web.rs implementation of device_lose.

a4a5403

Stub out web.rs device_lose.

e1f3d7c

Fixup test problems on different backends, and fixup stub of web.rs d…

2feef8f

…evice_lose.

Run rustfmt, fix issues in tests/device.rs.

084d036

Wumpf requested changes Sep 22, 2023

View reviewed changes

Test skipped in DX12 for validation errors, more comments in test, in…

b756739

… direct and web.

jimblandy mentioned this pull request Sep 22, 2023

WebGL: Improve error message when a webglcontextlost event occurs? #3679

Open

Wumpf self-requested a review September 22, 2023 21:17

Wumpf added the PR: waiting for review label Sep 22, 2023

cwfitzgerald requested a review from a team September 23, 2023 03:02

Wumpf requested changes Sep 23, 2023

View reviewed changes

Wumpf added PR: waiting for author and removed PR: waiting for review labels Sep 23, 2023

bradwerth added 3 commits September 26, 2023 13:42

Update test to use device.destroy, and remove the public exposure of …

48b2812

…the lose function.

Restore config.toml.

166549c

rustfmt device.rs.

835a733

bradwerth requested a review from Wumpf September 26, 2023 21:02

Wumpf approved these changes Sep 27, 2023

View reviewed changes

Wumpf removed the PR: waiting for author label Sep 27, 2023

Wumpf merged commit 57f8757 into gfx-rs:trunk Sep 27, 2023
20 checks passed

bradwerth deleted the deviceValid branch September 27, 2023 16:25

teoxoy mentioned this pull request Aug 9, 2024

start-capture (and probably stop-capture) are unsound #1625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device valid #4163

Device valid #4163

bradwerth commented Sep 21, 2023

Wumpf left a comment •

edited

Loading

nical commented Sep 22, 2023

Wumpf commented Sep 22, 2023

Wumpf commented Sep 22, 2023 •

edited

Loading

nical commented Sep 22, 2023

bradwerth commented Sep 22, 2023

jimblandy commented Sep 22, 2023

bradwerth commented Sep 22, 2023

jimblandy commented Sep 22, 2023 •

edited

Loading

Wumpf left a comment

Wumpf left a comment

Device valid #4163

Device valid #4163

Conversation

bradwerth commented Sep 21, 2023

Wumpf left a comment • edited Loading

Choose a reason for hiding this comment

nical commented Sep 22, 2023

Wumpf commented Sep 22, 2023

Wumpf commented Sep 22, 2023 • edited Loading

nical commented Sep 22, 2023

bradwerth commented Sep 22, 2023

jimblandy commented Sep 22, 2023

bradwerth commented Sep 22, 2023

jimblandy commented Sep 22, 2023 • edited Loading

Wumpf left a comment

Choose a reason for hiding this comment

Wumpf left a comment

Choose a reason for hiding this comment

Wumpf left a comment •

edited

Loading

Wumpf commented Sep 22, 2023 •

edited

Loading

jimblandy commented Sep 22, 2023 •

edited

Loading