Dropped devices leak memory on Windows dx12 #3498

kpreid · 2023-02-19T01:59:38Z

Description
If I repeatedly create devices with substantial resources, then eventually I will get a failure to create a device (or, if one of the existing devices is used further, an allocation failure within it) even if the devices are dropped.

(This was discovered inside my application's test suite — I recognize that creating many devices is an irregular situation.)

Repro steps
Stick this test case in wgpu/tests/ and run it (branch with needed changes @ master...kpreid:wgpu:winleak):

use crate::common::{initialize_adapter, initialize_device};

use wasm_bindgen_test::wasm_bindgen_test;
use wgpu::*;

#[wasm_bindgen_test]
#[test]
fn allocate_large_3d_textures_in_devices() {
    let (adapter, _surface_guard) = initialize_adapter();

    for i in 0..100 {
        let (device, _queue) = pollster::block_on(initialize_device(
            &adapter,
            Features::empty(),
            Limits::default(),
        ));

        println!("{i}");
        // Each one of these textures occupies at least 256*256*256*4 ≈ 67MB.
        let _texture = device.create_texture(&TextureDescriptor {
            label: None,
            size: Extent3d {
                width: 256,
                height: 256,
                depth_or_array_layers: 256,
            },
            mip_level_count: 1,
            sample_count: 1,
            dimension: wgpu::TextureDimension::D3,
            format: wgpu::TextureFormat::Rgba8UnormSrgb,
            view_formats: &[],
            usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
        });
    }
}

The leak does not (visibly) occur if the same textures are created and dropped while using a single device.

Platform

wgpu 0.15 or git master (b534379).
I believe this did not happen with wgpu 0.14.
GitHub Actions Windows runner

Adapter 0:
	   Backend: Dx12
	      Name: "Microsoft Basic Render Driver"
	  VendorID: 5140
	  DeviceID: 140

The text was updated successfully, but these errors were encountered:

ErichDonGubler · 2023-02-20T23:36:52Z

Hmm, both the driver and nature of issue (memory leak, maybe refcounting issue?) Seem like they might be related to #3485?

AdrianEddy · 2023-02-21T02:08:09Z

I've also seen similar issues on Windows. I didn't have time to investigate them yet, so I don't have any details, but definitely something like that happens.

Elabajaba · 2023-02-21T18:21:03Z

Disabling dx12 suballocation makes the test pass on my system.

--- a/wgpu/Cargo.toml
+++ b/wgpu/Cargo.toml
@@ -142,7 +142,7 @@ hal = { workspace = true }
 hal = { workspace = true, features = ["renderdoc"] }

 [target.'cfg(windows)'.dependencies]
-hal = { workspace = true, features = ["dxc_shader_compiler", "renderdoc", "windows_rs"] }
+hal = { workspace = true, features = ["dxc_shader_compiler", "renderdoc"] }

If you have a non-test reproduction, gpu-allocator should be outputting warn level logs if it's leaking memory when we're dropping it.

Elabajaba · 2023-02-21T19:50:17Z

I turned your test code into an example and shoved a println! into Device::exit(...) (in wgpu-hal/src/dx12/device.rs), and it looks like Device::exit doesn't get called until all the loop iterations complete (Vulkan seems to have the same issue, but for whatever reason doesn't end up with the same kind of memory bloat).

edit: More testing, device::exit() seems to only run if the instance creation is in the loop. Note that with instance out of the loop (like in the OP), device::exit(...) runs for every created device in the loop after the loop is completed and the instance is dropped.

Elabajaba · 2023-02-21T21:42:50Z

Some more context:

The texture seems to be freed properly (println debugging shows destroy_texture() being called), and it leaks even without the texture.

The issue is that when you create a new gpu-allocator allocator (which happens when we create a new device) it allocates a block of memory that it uses to suballocate smaller blocks from for buffers and textures, and there's a bug somewhere that's stopping devices from being dropped until the Instance is dropped. (sidenote, I'd almost consider Vulkan not leaking memory here a bug, as I thought our gpu-alloc implementation worked similarly to gpu-allocator with regards to creating and holding onto large ~256MB blocks of memory to suballocate from)

kpreid · 2023-02-22T05:08:49Z

The issue is that when you create a new gpu-allocator allocator (which happens when we create a new device) it allocates a block of memory that it uses to suballocate smaller blocks from for buffers and textures, and there's a bug somewhere that's stopping devices from being dropped until the Instance is dropped.

Ironically, the reasoning behind my original code creating many Devices was “Well, a Device is the largest stateful-in-the-API unit, so creating a new one per test case is probably the best tradeoff between test isolation and performance”. If that's in fact not a good option (perhaps for reasons beyond this particular delayed dealloc bug) then it would be nice if there were clues in the API documentation.

jimblandy · 2023-09-27T17:18:18Z

(This was discovered inside my application's test suite — I recognize that creating many devices is an irregular situation.)

Firefox needs to be able to create endless numbers of devices without leaking, because device creation is under web content's control. We could decouple WebGPU API Devices from wgpu Devices, but it seems just as easy to fix wgpu, and that'd help a broader audience.

Ironically, the reasoning behind my original code creating many Devices was “Well, a Device is the largest stateful-in-the-API unit, so creating a new one per test case is probably the best tradeoff between test isolation and performance”. If that's in fact not a good option (perhaps for reasons beyond this particular delayed dealloc bug) then it would be nice if there were clues in the API documentation.

Creating separate devices for isolation seems like something that ought to work.

cwfitzgerald · 2023-09-27T18:05:28Z

Tests for memory leaks could potentially use https://crates.io/crates/sysinfo to get such information, but it would not help the case where we're leaking gpu memory.

jimblandy · 2023-09-27T19:13:54Z

Are the applications calling poll properly? It seems like dx12's Device::exit gets called in a pretty straightforward way if Global::exit_device is ever called at all. That should always happen if the device is genuinely unreferenced and we're calling poll_devices.

kpreid · 2023-09-27T22:21:36Z

Are the applications calling poll properly?

In the situation which prompted me to file this issue, they were — each device belonged to a headless render test, so it would poll to retrieve the rendered image. Of course, the test doesn't poll after it completes, and won't poll if it somehow panics before getting to that point.

If all that is needed to avoid a leak is to poll after the resources are garbage, maybe wgpu should automatically poll just before creating a device?

Elabajaba · 2024-01-18T22:43:57Z

This seems to be fixed on wgpu 0.19.

kpreid · 2024-06-01T03:35:41Z

I have just learned that the WebGPU specification says that you can only requestDevice() once per Adapter — you're supposed to request a fresh Adapter to request another device. (See #5764 for more.) So, the test case I filed this issue with is incorrect — it should fail when it tries to use the same Adapter a second time. This issue might therefore be entirely moot, or not; I don't know enough about wgpu internals to say.

teoxoy · 2024-06-03T10:25:35Z

Right, D3D12 will also give you back the same device from the same adapter (they are singletons per adapter).

kpreid · 2024-06-03T15:53:31Z

@teoxoy I think you misunderstand. I'm saying that, according to the WebGPU specification, calling requestDevice() more than once on a given Adapter value should return an error. An Adapter value isn't in a 1:1 relationship with the hardware; it's a handle to the opportunity to create one Device value. Therefore, the question of what happens when you have multiple Devices created from one Adapter is moot — that's prohibited. But none of that has anything to do with the cardinality of the underlying driver/platform-API adapter and device entities.

So, the original test code in principle ought to be rewritten to repeatedly call initialize_adapter, not just initialize_device, to make it correct. This API-usage fix is independent of whether or not there's (still) a memory leak, unless it happens that the leak only happens in this case, in which case the bug can be fixed simply by making wgpu spec-conformant.

teoxoy · 2024-06-04T19:11:43Z

I agree we should implement the validation, I was trying to provide relevant info as to why we are leaking without having the validation implemented. After all, this validation is in the spec most likely because of D3D12's behavior.

teoxoy · 2024-07-11T12:39:38Z

Disabling dx12 suballocation makes the test pass on my system.

#5943 fixed this but I'll leave this open since the original issue hasn't been solved.

teoxoy · 2024-07-30T17:28:04Z

The repro in the OP is no longer leaking. This was fixed by 6e21f7a (#3626).

teoxoy added type: bug Something isn't working api: dx12 Issues with DX12 or DXGI labels Feb 21, 2023

grovesNL mentioned this issue Feb 22, 2023

Memory utilization way too high on integrated GPU. #3518

Closed

kpreid mentioned this issue Mar 20, 2023

Possible memory leak on Windows CI kpreid/all-is-cubes#346

Closed

cryscan mentioned this issue May 2, 2024

Some buffer memory does not get freed after the device is dropped #5648

Closed

teoxoy added this to WebGPU for Firefox Jul 26, 2024

github-project-automation bot moved this to Todo in WebGPU for Firefox Jul 26, 2024

teoxoy self-assigned this Jul 26, 2024

teoxoy closed this as completed Jul 30, 2024

github-project-automation bot moved this from Todo to Done in WebGPU for Firefox Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropped devices leak memory on Windows dx12 #3498

Dropped devices leak memory on Windows dx12 #3498

kpreid commented Feb 19, 2023

ErichDonGubler commented Feb 20, 2023

AdrianEddy commented Feb 21, 2023

Elabajaba commented Feb 21, 2023

Elabajaba commented Feb 21, 2023 •

edited

Loading

Elabajaba commented Feb 21, 2023

kpreid commented Feb 22, 2023

jimblandy commented Sep 27, 2023 •

edited

Loading

cwfitzgerald commented Sep 27, 2023

jimblandy commented Sep 27, 2023

kpreid commented Sep 27, 2023

Elabajaba commented Jan 18, 2024 •

edited

Loading

kpreid commented Jun 1, 2024

teoxoy commented Jun 3, 2024

kpreid commented Jun 3, 2024

teoxoy commented Jun 4, 2024

teoxoy commented Jul 11, 2024

teoxoy commented Jul 30, 2024

Dropped devices leak memory on Windows dx12 #3498

Dropped devices leak memory on Windows dx12 #3498

Comments

kpreid commented Feb 19, 2023

ErichDonGubler commented Feb 20, 2023

AdrianEddy commented Feb 21, 2023

Elabajaba commented Feb 21, 2023

Elabajaba commented Feb 21, 2023 • edited Loading

Elabajaba commented Feb 21, 2023

kpreid commented Feb 22, 2023

jimblandy commented Sep 27, 2023 • edited Loading

cwfitzgerald commented Sep 27, 2023

jimblandy commented Sep 27, 2023

kpreid commented Sep 27, 2023

Elabajaba commented Jan 18, 2024 • edited Loading

kpreid commented Jun 1, 2024

teoxoy commented Jun 3, 2024

kpreid commented Jun 3, 2024

teoxoy commented Jun 4, 2024

teoxoy commented Jul 11, 2024

teoxoy commented Jul 30, 2024

Elabajaba commented Feb 21, 2023 •

edited

Loading

jimblandy commented Sep 27, 2023 •

edited

Loading

Elabajaba commented Jan 18, 2024 •

edited

Loading