-
Notifications
You must be signed in to change notification settings - Fork 948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device valid #4163
Device valid #4163
Conversation
This tracks Device validity explicity, but it could be changed to instead change Device Storage to Error when the Device is invalid. This also could go further with standardizing the error reporting for an invalid Device.
This tracks device validity explicitly, and short-circuits almost all operations on devices when they are invalid. This doesn't replace the storage of the device with an Error, because it's important for poll_devices to be able to monitor the devices (even if invalid) for emptiness so they can be cleaned up.
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Connor Fitzgerald <[email protected]>
Co-authored-by: Connor Fitzgerald <[email protected]>
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.47 to 1.0.48. - [Release notes](https://github.com/dtolnay/thiserror/releases) - [Commits](dtolnay/thiserror@1.0.47...1.0.48) --- updated-dependencies: - dependency-name: thiserror dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [serde](https://github.com/serde-rs/serde) from 1.0.186 to 1.0.188. - [Release notes](https://github.com/serde-rs/serde/releases) - [Commits](serde-rs/serde@v1.0.186...v1.0.188) --- updated-dependencies: - dependency-name: serde dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…fx-rs#4115) Rename `wgpu_hal::vulkan::Instance::required_extensions` to `desired_extensions`, to match its behavior. Document the function to clarify its role.
Due to an issue with Mesa versions less than 21.2 presentation on Vulkan was forced to Nvidia only. This in itself brought new issues around the Nvidia driver specfic format modifers. As of Mesa 21.2 the Intel vulkan issue is fixed. This commit enables presentation on versions 21.2 and above for Intel. References: - NVIDIA/egl-wayland#72 Closes [gfx-rs#4101](gfx-rs#4101)
For each backend `blah`, log::debug/trace whether we were able to populate `Instance::blah`.
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…fx-rs#4130) Co-authored-by: Nicolas Silva <[email protected]>
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.105 to 1.0.106. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](serde-rs/json@v1.0.105...v1.0.106) --- updated-dependencies: - dependency-name: serde_json dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nicolas Silva <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Implements timer queries via write_timestamp on Metal for encoders (whenever timer queries are available) and passes (for Intel/AMD GPUs, where we should advertise TIMESTAMP_QUERY_INSIDE_PASSES now). Due to some bugs in Metal this was a lot harder than expected. I believe the solution is close to optimal with the current restrictions in place. For details see code comments.
Co-authored-by: Connor Fitzgerald <[email protected]>
Bumps [profiling](https://github.com/aclysma/profiling) from 1.0.10 to 1.0.11. - [Changelog](https://github.com/aclysma/profiling/blob/master/CHANGELOG.md) - [Commits](https://github.com/aclysma/profiling/commits) --- updated-dependencies: - dependency-name: profiling dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [termcolor](https://github.com/BurntSushi/termcolor) from 1.2.0 to 1.3.0. - [Commits](BurntSushi/termcolor@1.2.0...1.3.0) --- updated-dependencies: - dependency-name: termcolor dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! But I fear this might be the wrong general direction for implementing device loss 😟
According to spec, entering the device-lost state is not directly exposed other than via a destroy
call, but otherwise is an error state the device can enter.
https://www.w3.org/TR/webgpu/#lose-the-device
There is however a lost
promise that we need to implement here, MDN has some example code on this what it should look like from JS https://developer.mozilla.org/en-US/docs/Web/API/GPUDevice/lost
I think we should try to handle device lost states through the HAL layer - that's where they typically occur. Device loss via destroy
is harder to map to that ofc, but we should investigate if we can "poison" the hal device so that this actually has the desired effect through the entire stack.
I think that we should be able to implement most of What advantages do you expect from moving it in hal? |
Generally, a device loss event is an actual thing that happens to the hal device; different implementations have different error states. Relatedly, I think destroying/loosing a device should actually have an actual effect that is communicated with the driver. |
I fear if we put too much logic in wgpu-core, we have two sources of device-lost state: The hal device that is actually in a bad state / deallocated / etc. and some flag on wgpu-core. So to keep this in sync we could instead solely rely on hal owning this state. |
It looks like you are looking at this primarily through the lense of device loss, while Brad's priority here is to get I haven't looked at the PR in details yet but perhaps some of the "lose" terminology in there could be replaced by destroy to make that clearer. We could store the device's state in the hal structure, but it would still be mostly accessed by the validation in wgpu-core. |
This is definitely a partial implementation of "lose the device" which is itself a partial implementation of device.destroy. From my read of the spec, we lose the device in 3 ways:
In this patch, my goal was to support these cases at least in theory, but since the integration test is checking the outcome of lose the device, it makes it look like a new callable function on the device object itself. It also doesn't attempt to handle the lost promise resolution for two reasons: because of my lack of familiarity with the direct and web protocols, and the implementation in Firefox will be external to wgpu. Regarding the direct and web implementations of the lost promises, I should either add much better TODOs to document the gaps, or fill the gaps with an implementation. |
@Wumpf I think you're right to ask how this code is going to interact with device loss reported from the underlying platform APIs. This PR needs to include a clear plan for how that is going to be incorporated. As @nical said, our immediate goal is to implement something corresponding to the spec's However, in doing so we've run into an unfortunate part of wgpu's current architecture. When the hal reports a lost device, WebGPU says we should execute the "lose the device" steps you mentioned above. The first step there is "Make device invalid." Everywhere else in
In gpuweb/gpuweb#4294 I confirmed with @Kangz that we're not supposed to be able to make further requests of the device after losing it, so it does seem important that the device be marked invalid at the indicated point in execution. @bradwerth explained this to me, and said that adding the I believe #3626 will change wpgu-core to use |
Presuming that the wgpu-core implementation of "lose the device" can be correctly signaled by the driver, then I don't think there is any danger of inconsistent state. Since wgpu-core device has the valid/invalid state, and prevents further requests from reaching the hal device, then I don't believe there is double-state that can become inconsistent. The hal device is never again accessed. |
This isn't quite right. For example, Vulkan requires that all the resources created from a device be freed before the device itself is freed, and this requirement holds even if It looks to me like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All very good arguments why we should get this started with a validity flag. Alright, let's do it then and evolve it as we add lost device handling on HAL!
However, (digging a bit more into the code now) we should try a bit harder not to expose device_lose
on the user facing wgpu
layer as the spec doesn't include it. I'm also very skeptical of having it as a non public function on wgpu-core (wgpu-core is usually just an implementation of the public interface), but I can see how that might be useful in some places.
It seems pretty clear that destroy
is what serves the actual purpose for the user. If the desire is to move fast on giving the user the ability to cause a device loss, we should instead expose destroy
already and note down that its implementation is still work in progress - causing the device invalidation but no other sideeffects like resource destruction yet :)
Hope that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! Thanks for adding all those verbose & helpful comments in there!
Checklist
cargo clippy
.cargo clippy --target wasm32-unknown-unknown
if applicable.Connections
None.
Description
This implements a simple version of "lose the device", which marks a Device as invalid, causing all subsequent operations to return an error. Nothing actually calls this yet.
Testing
This adds an integration test
device_lose_then_more
that attempts various operations on a lost device, which are all expected to fail.