Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropped devices leak memory on Windows dx12 #3498

Closed
kpreid opened this issue Feb 19, 2023 · 17 comments
Closed

Dropped devices leak memory on Windows dx12 #3498

kpreid opened this issue Feb 19, 2023 · 17 comments
Assignees
Labels
api: dx12 Issues with DX12 or DXGI type: bug Something isn't working

Comments

@kpreid
Copy link
Contributor

kpreid commented Feb 19, 2023

Description
If I repeatedly create devices with substantial resources, then eventually I will get a failure to create a device (or, if one of the existing devices is used further, an allocation failure within it) even if the devices are dropped.

(This was discovered inside my application's test suite — I recognize that creating many devices is an irregular situation.)

Repro steps
Stick this test case in wgpu/tests/ and run it (branch with needed changes @ master...kpreid:wgpu:winleak):

use crate::common::{initialize_adapter, initialize_device};

use wasm_bindgen_test::wasm_bindgen_test;
use wgpu::*;

#[wasm_bindgen_test]
#[test]
fn allocate_large_3d_textures_in_devices() {
    let (adapter, _surface_guard) = initialize_adapter();

    for i in 0..100 {
        let (device, _queue) = pollster::block_on(initialize_device(
            &adapter,
            Features::empty(),
            Limits::default(),
        ));

        println!("{i}");
        // Each one of these textures occupies at least 256*256*256*4 ≈ 67MB.
        let _texture = device.create_texture(&TextureDescriptor {
            label: None,
            size: Extent3d {
                width: 256,
                height: 256,
                depth_or_array_layers: 256,
            },
            mip_level_count: 1,
            sample_count: 1,
            dimension: wgpu::TextureDimension::D3,
            format: wgpu::TextureFormat::Rgba8UnormSrgb,
            view_formats: &[],
            usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
        });
    }
}

The leak does not (visibly) occur if the same textures are created and dropped while using a single device.

Platform

  • wgpu 0.15 or git master (b534379).
  • I believe this did not happen with wgpu 0.14.
  • GitHub Actions Windows runner
Adapter 0:
	   Backend: Dx12
	      Name: "Microsoft Basic Render Driver"
	  VendorID: 5140
	  DeviceID: 140
@ErichDonGubler
Copy link
Member

Hmm, both the driver and nature of issue (memory leak, maybe refcounting issue?) Seem like they might be related to #3485?

@AdrianEddy
Copy link
Contributor

I've also seen similar issues on Windows. I didn't have time to investigate them yet, so I don't have any details, but definitely something like that happens.

@teoxoy teoxoy added type: bug Something isn't working api: dx12 Issues with DX12 or DXGI labels Feb 21, 2023
@Elabajaba
Copy link
Contributor

Disabling dx12 suballocation makes the test pass on my system.

--- a/wgpu/Cargo.toml
+++ b/wgpu/Cargo.toml
@@ -142,7 +142,7 @@ hal = { workspace = true }
 hal = { workspace = true, features = ["renderdoc"] }

 [target.'cfg(windows)'.dependencies]
-hal = { workspace = true, features = ["dxc_shader_compiler", "renderdoc", "windows_rs"] }
+hal = { workspace = true, features = ["dxc_shader_compiler", "renderdoc"] }

If you have a non-test reproduction, gpu-allocator should be outputting warn level logs if it's leaking memory when we're dropping it.

@Elabajaba
Copy link
Contributor

Elabajaba commented Feb 21, 2023

I turned your test code into an example and shoved a println! into Device::exit(...) (in wgpu-hal/src/dx12/device.rs), and it looks like Device::exit doesn't get called until all the loop iterations complete (Vulkan seems to have the same issue, but for whatever reason doesn't end up with the same kind of memory bloat).

edit: More testing, device::exit() seems to only run if the instance creation is in the loop. Note that with instance out of the loop (like in the OP), device::exit(...) runs for every created device in the loop after the loop is completed and the instance is dropped.

@Elabajaba
Copy link
Contributor

Some more context:

The texture seems to be freed properly (println debugging shows destroy_texture() being called), and it leaks even without the texture.

The issue is that when you create a new gpu-allocator allocator (which happens when we create a new device) it allocates a block of memory that it uses to suballocate smaller blocks from for buffers and textures, and there's a bug somewhere that's stopping devices from being dropped until the Instance is dropped. (sidenote, I'd almost consider Vulkan not leaking memory here a bug, as I thought our gpu-alloc implementation worked similarly to gpu-allocator with regards to creating and holding onto large ~256MB blocks of memory to suballocate from)

@kpreid
Copy link
Contributor Author

kpreid commented Feb 22, 2023

The issue is that when you create a new gpu-allocator allocator (which happens when we create a new device) it allocates a block of memory that it uses to suballocate smaller blocks from for buffers and textures, and there's a bug somewhere that's stopping devices from being dropped until the Instance is dropped.

Ironically, the reasoning behind my original code creating many Devices was “Well, a Device is the largest stateful-in-the-API unit, so creating a new one per test case is probably the best tradeoff between test isolation and performance”. If that's in fact not a good option (perhaps for reasons beyond this particular delayed dealloc bug) then it would be nice if there were clues in the API documentation.

@jimblandy
Copy link
Member

jimblandy commented Sep 27, 2023

(This was discovered inside my application's test suite — I recognize that creating many devices is an irregular situation.)

Firefox needs to be able to create endless numbers of devices without leaking, because device creation is under web content's control. We could decouple WebGPU API Devices from wgpu Devices, but it seems just as easy to fix wgpu, and that'd help a broader audience.

Ironically, the reasoning behind my original code creating many Devices was “Well, a Device is the largest stateful-in-the-API unit, so creating a new one per test case is probably the best tradeoff between test isolation and performance”. If that's in fact not a good option (perhaps for reasons beyond this particular delayed dealloc bug) then it would be nice if there were clues in the API documentation.

Creating separate devices for isolation seems like something that ought to work.

@cwfitzgerald
Copy link
Member

Tests for memory leaks could potentially use https://crates.io/crates/sysinfo to get such information, but it would not help the case where we're leaking gpu memory.

@jimblandy
Copy link
Member

Are the applications calling poll properly? It seems like dx12's Device::exit gets called in a pretty straightforward way if Global::exit_device is ever called at all. That should always happen if the device is genuinely unreferenced and we're calling poll_devices.

@kpreid
Copy link
Contributor Author

kpreid commented Sep 27, 2023

Are the applications calling poll properly?

In the situation which prompted me to file this issue, they were — each device belonged to a headless render test, so it would poll to retrieve the rendered image. Of course, the test doesn't poll after it completes, and won't poll if it somehow panics before getting to that point.

If all that is needed to avoid a leak is to poll after the resources are garbage, maybe wgpu should automatically poll just before creating a device?

@Elabajaba
Copy link
Contributor

Elabajaba commented Jan 18, 2024

This seems to be fixed on wgpu 0.19.

@kpreid
Copy link
Contributor Author

kpreid commented Jun 1, 2024

I have just learned that the WebGPU specification says that you can only requestDevice() once per Adapter — you're supposed to request a fresh Adapter to request another device. (See #5764 for more.) So, the test case I filed this issue with is incorrect — it should fail when it tries to use the same Adapter a second time. This issue might therefore be entirely moot, or not; I don't know enough about wgpu internals to say.

@teoxoy
Copy link
Member

teoxoy commented Jun 3, 2024

Right, D3D12 will also give you back the same device from the same adapter (they are singletons per adapter).

@kpreid
Copy link
Contributor Author

kpreid commented Jun 3, 2024

@teoxoy I think you misunderstand. I'm saying that, according to the WebGPU specification, calling requestDevice() more than once on a given Adapter value should return an error. An Adapter value isn't in a 1:1 relationship with the hardware; it's a handle to the opportunity to create one Device value. Therefore, the question of what happens when you have multiple Devices created from one Adapter is moot — that's prohibited. But none of that has anything to do with the cardinality of the underlying driver/platform-API adapter and device entities.

So, the original test code in principle ought to be rewritten to repeatedly call initialize_adapter, not just initialize_device, to make it correct. This API-usage fix is independent of whether or not there's (still) a memory leak, unless it happens that the leak only happens in this case, in which case the bug can be fixed simply by making wgpu spec-conformant.

@teoxoy
Copy link
Member

teoxoy commented Jun 4, 2024

I agree we should implement the validation, I was trying to provide relevant info as to why we are leaking without having the validation implemented. After all, this validation is in the spec most likely because of D3D12's behavior.

@teoxoy
Copy link
Member

teoxoy commented Jul 11, 2024

Disabling dx12 suballocation makes the test pass on my system.

#5943 fixed this but I'll leave this open since the original issue hasn't been solved.

@teoxoy
Copy link
Member

teoxoy commented Jul 30, 2024

The repro in the OP is no longer leaking. This was fixed by 6e21f7a (#3626).

@teoxoy teoxoy closed this as completed Jul 30, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in WebGPU for Firefox Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dx12 Issues with DX12 or DXGI type: bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

7 participants