-
Notifications
You must be signed in to change notification settings - Fork 948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for submissions to complete on Queue
drop
#6413
Conversation
2f16a44
to
eade5cf
Compare
CI was failing because when targeting WebGL we were timing out. eade5cf fixes CI and behaves the same as current trunk where the timeout is ignored (related: #3601 & #4589) but is not correct. This is another argument for requiring users of |
Conclusions after discussing the idea of requiring users of
|
Some follow-up work is still needed though, instead of panicking, we need to add a retry mechanism if we time out or if we run into OOMs. Device loss needs to be propagated to the device (making it invalid). |
bb90304
to
f03855a
Compare
I revised the comment explaining why it's actually ok that we time out on WebGL. |
2d27279
to
86c4b4d
Compare
I added the retry mechanism as well, this should be ready for review. |
This comment was marked as resolved.
This comment was marked as resolved.
…still active submissions `Global::device_drop` was wrongly assuming `device_poll` with `Maintain::Wait` was called but this is not a documented invariant and only `wgpu` was upholding this.
c421f62
to
004a982
Compare
The `Device` should not contain any `Arc`s to resources as that creates cycles (since all resources hold strong references to the `Device`). Note that `PendingWrites` internally has `Arc`s to resources. I think this change also makes more sense conceptually since most operations that use `PendingWrites` are on the `Queue`.
We should rely on the ranks in `wgpu-core\src\lock\rank.rs`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inclined to approve if we can get my questions answered.
The `Device` should not contain any `Arc`s to resources as that creates cycles (since all resources hold strong references to the `Device`). Note that `LifetimeTracker` internally has `Arc`s to resources.
…d `Queue` instead
…ission This gets the `wgpu_test::ray_tracing::as_build::out_of_order_as_build` test to pass. This seems to be an issue even on trunk, looking at the nr of calls to `create_command_encoder` & `destroy_command_encoder` in hal, they are not equal. So, I'm not sure why the validation layers don't raise the `VUID-vkDestroyDevice-device-05137`. There is still an issue with previous command buffers being leaked but I will fix this in a follow-up.
LGTM, with the discussions we've had. If you're still interested in @jimblandy's feedback, I'd suggest giving 'im...maybe 24 hours, if you're feeling patient? Otherwise, I think it's reasonable to do review after merging, if really necessary. |
Let's land it, I already had to rebase it and fix conflicts a few times. |
This PR addresses leaks caused by circular references due to the
Device
still holding strong references to resources when removed from the registry and while submissions are still active.Global::device_drop
was wrongly assumingdevice_poll
withMaintain::Wait
was called but this is not a documented invariant and onlywgpu
was upholding this.Instead of calling
device_poll
indevice_drop
(which would solve the issue) this PR takes a different approach since we want to remove the registries in the future (#5121):LifetimeTracker
andPendingWrites
into theQueue
so that we never have circular referencesQueue
'sDrop
implOne thing to note is that ideally we shouldn't be waiting in the
Queue
'sDrop
implementation but that's whatwgpu
was previously doing (by callingdevice_poll
) and the alternative is more involved: We could put the burden of keeping theQueue
alive onwgpu-core
users if there are any active submissions. With the changes in this PR we will panic if we hit a timeout or ran into any errors; this would solve that too. I want to talk about this approach at the next maintainers call before making any changes though.