Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGPU API pain points/blockers #6499

Open
7 tasks
benvanik opened this issue Jul 20, 2021 · 14 comments
Open
7 tasks

WebGPU API pain points/blockers #6499

benvanik opened this issue Jul 20, 2021 · 14 comments
Assignees
Labels
discussion Proposals, open-ended questions, etc hal/webgpu Runtime WebGPU HAL backend

Comments

@benvanik
Copy link
Collaborator

benvanik commented Jul 20, 2021

Filing this to start tracking issues with WebGPU that arise during initial porting. Some of these are just pain, others may be blockers - or at least block a prototype due to the work required to get around them.

Blockers:

Inefficiencies:

Confusion:

Besides the buffer mapping/readback issue I think this is all doable. It won't be nearly as good as native in multi-workload or highly parallelizable workloads due to the single queue/implicit barrier API limitation and have more overhead, but if the model is compute-bound it'd still likely be a win. Worth prototyping if we can.

@benvanik benvanik added discussion Proposals, open-ended questions, etc hal/webgpu Runtime WebGPU HAL backend labels Jul 20, 2021
@benvanik benvanik self-assigned this Jul 20, 2021
@benvanik
Copy link
Collaborator Author

First up is the lack of a fillBuffer. The hope is that most fills from codegen get folded into dispatches such that we don't have many, but they are useful from human-authored code and if there's not a dispatch to fold into they will be emitted. They are also useful to initialize sparse buffers, reset subregions of ringbuffers/pools, etc. I'm implementing a workaround that embeds a WGSL shader that performs the fill however the amount of API work and overhead involved is quite staggering. Without push constants or mutable/pooled bind groups each fill requires an update to a ringbuffer (which requires code to manage/maintain the ringbuffer) of uniform values defining the fill pattern and buffer size. It also requires a unique bind group for each buffer being filled (can be cached, but nasty). Worst of all, due to the lack of explicit barriers any fill operation to a subregion of a buffer will introduce a false dependency with any other use of that buffer unless the bind group explicitly specifies the fill size - which then defeats the ability to cache the bind group.

Filed a spec issue here to request fillBuffer: gpuweb/gpuweb#1968

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 20, 2021

Something trying to implement the fillBuffer emulation highlighted was that the ability to perform bind group management is missing from the API. The major issue is that there are no explicit barriers and if you are performing suballocation you must specify the total size of a bound buffer in order to prevent false dependencies between operations acting on independent regions of the buffer. While you can provide dynamic offsets when binding a group you cannot provide dynamic sizes so even if using dynamic offsets (of which there are a practical maximum of 4 available per dispatch) you will be introducing tons of false dependencies so long as the bindings are specified for the whole buffer (required for offsetting). A workaround to both the dynamic offset limit and false dependencies would be to have a unique bind group per unique set of buffer bindings and buffer ranges, which quickly turns ludicrous. This points back to the lack of ability to update bind groups after they are created such that new bind groups must be allocated effectively per dispatch and it's not possible to perform user-mode pooling because of the immutability, creating a spiral of sadness.

I'm not actually sure it's possible to make things efficient with the current API. The workaround I'm going to start with is to new up a new bind group for each dispatch and that will highlight the badness there while not hurting our utilization. Of course this assumes that the implementations are actually doing barrier insertion based on buffer subranges - if they aren't then none of this will work well and WebGPU utilization will look quite terrible next to native implementations. It'd show a good data point but it'd be way better if we didn't have to potentially wait a full spec design cycle for any improvements ;(

This old issue talks about this a bit: gpuweb/gpuweb#915 - the GPUBindGroupArena discussed there would likely be what I'm reaching for here - necro'ed the issue.

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 20, 2021

Currently in WebGPU there is only a single queue, and within that queue all submissions are processed in order (full barriers inbetween). This means that if you have two independent workloads you cannot have them execute concurrently even on devices with the ability to do so. There are some proposals for a multi-queue extension that look interesting as some introduce the fences we'd need to communicate this pipelining information to the implementation but they are a ways out. gpuweb/gpuweb#1073 may be the latest and it has fences but may not have a way to indicate that two submitted workloads have no overlap.

We don't really need multi-queue yet (we don't generate multi-queue workloads from the compiler), but allowing concurrent execution within the same queue would be really nice - especially when overlapping invocations or executing multiple models concurrently. Today this will result in lower utilization/more bubbles/higher latency than a native implementation that can properly perform out of order/overlapped execution when told to do so.

(the spec may make it possible for two subsequent submissions to overlap so long as the implicit dependency tracking was handling things right, however unlike the vulkan spec there's no wording about multiple command buffers being treated as if they were recorded as a single command buffer)

@benvanik benvanik changed the title WebGPU API pain points WebGPU API pain points/blockers Jul 20, 2021
@benvanik
Copy link
Collaborator Author

benvanik commented Jul 20, 2021

It's low priority as there are no browser tools (though wgpu/dawn may handle it) but it's hard to tell how debug groups work. PushDebugGroup/PopDebugGroup are specified on both the CommandEncoder and ComputePassEncoder and each say that they must be perfectly nested. I can't tell from the spec if this means that it's not possible to interleave root command encoder and compute pass encoder groups, whether they can span multiple passes, etc. I suspect it's a stack internal to the encoders but haven't verified in an implementation yet and it'd be nice if the spec clarified usage. In general the valid usage of the encoders/passes is a bit opaque (I never liked it in Metal for the same reasons) and it'd be nice for clarifications there.

Update: this clarifies it:
image

A group within a pass encoder cannot escape the encoder and groups on the primary encoder may span across passes but may only be recorded when no other pass is open. This will require a bit of state tracking in our command buffer as it tries to reuse open compute pass encoders as it'll need to know if it needs to close the current pass to pop a debug group that had been pushed on the encoder.

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 20, 2021

Due to the Metal-like encoder nature of command buffers it will be (likely) less efficient to interleave copies and dispatches as copyBufferToBuffer is only available on GPUCommandEncoder and must only be called when no other passes are active. This means that a sequence of copy+dispatch+copy+dispatch would result in two compute passes interleaved with the copies. I don't know if this matters in practice or not but it will depend on the implementation and may be complicated by the implicit barrier tracking (I could see most implementations just putting barriers at the head/tail of each pass, for example). It would have been clearer that these are exclusive operations if they were in their own GPUTransferPassEncoder instead of being effectively in that by implicit specification, however that would introduce even more API overhead when recording sequences of these commands.

I think this was mainly just an unexpected difference from Vulkan/Metal where the WebGPU solution feels like a mashup of the two without the benefits of either. In Vulkan you can just interleave vkCmdCopy* commands with any other command while in Metal there is a dedicated MTLBlitCommandEncoder. Here you can't interleave but there's also not a dedicated encoder.

Filed some feedback that some examples may be useful in the spec for both this and the debug groups given the nesting/pass interplay: gpuweb/gpuweb#1969

@benvanik
Copy link
Collaborator Author

There doesn't seem to be reusable command buffers or secondary/nested command buffers in WebGPU. That's unfortunate but something we can work around with the deferred command buffer approach we use in CUDA. The advantage of reusable command buffers in WebGPU is significantly greater than that in native platforms as the API call overhead and object lifetime tracking are much more expensive in WebGPU/JS/etc, making it a little disappointing that they aren't considered in the spec but compared to the bind group issues 🤷

Filed for clarification: gpuweb/gpuweb#1971

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 20, 2021

A big one here that may block the prototype for a bit is that buffers cannot be mapped while any dispatch using them is in-flight. From GPUQueue.submit:
image

In the long term that may be fine as we shouldn't need persistent mapping in most cases once the compiler properly attributes buffers, however the fact that mapping a buffer is a promise-based async operation throws a giant irradiated wrench into things: we just won't be able to efficiently do any kind of readback, as far as I can tell, without bouncing out to the performance-sapping craziness that is asyncify. Reading the spec I'm not even sure how to properly use staging buffers: I can't write to buffers from the host that aren't mapped, mapping them is an async operation, and I can't submit or have any work in-flight while something is mapped.

This may actually be a deal breaker: I don't see how to efficiently write pipelined compute code that can't touch any non-interfering regions of buffers without a promise and full device synchronization. I need to do more digging but I'm worried.

Update: I just mocked up using a pool of readback buffers. Each readback then issues a command buffer with a copyBufferToBuffer, synchronizes with the device, maps the buffer, and fetches the contents. This is workable but only if we use asyncify as once the buffer is ready we need to map it and that can only happen asynchronously. Batching these is difficult and since we can't always know the sizes of things coming back (without more expensive computation in the face of dynamic shapes/etc) this leads to either overcommitting these buffers or needing to map multiple - which the API does not make easy (each mapping returns a promise). Worst-case I think I could make this work but I won't be happy about it and it'll require asyncify (50%+ performance penalty on the entire host-side runtime, effectively) and introduce a non-trivial amount of latency (~100us-1ms per readback, vs the near 0 of native). Going to file an issue and see if anyone from the webgpu side knows something I'm missing.

Filed a request for a GPUQueue.readBuffer: gpuweb/gpuweb#1972

@benvanik
Copy link
Collaborator Author

benvanik commented Dec 3, 2021

Poking at this again - looks like most of the issues still exist and I found some others I hadn't yet ran into.

One is there being no way to release resources like bind groups/layouts/command buffers/etc besides tearing down the entire WebGPU stack - there's some recent activity in webgpu-native/webgpu-headers#15 but it's been unpromisingly stalled for 2 years.

There's also still no great way to get around the pure async browser-focused behavior; the non-blocking wgpuInstanceProcessEvents got added to webgpu.h but it looks like dawn doesn't yet expose it. There's a wgpu::Device::tick method that can be used but it also doesn't block like wgpu-native's wgpuDevicePool(..., /*force_wait=*/true), which is unfortunate. Some discussion here: webgpu-native/webgpu-headers#117 + webgpu-native/webgpu-headers#91 but no good solutions. Since there's still no synchronous mapping or concurrent use of buffers - even from web workers - it looks like a complex pthreads-based solution with semaphores and callbacks is going to be needed in both native and browser environments 😢

Still not feeling like WebGPU is really ready yet but if I can at least get something working it'll be easier to make some informed arguments - I'm really just looking out for the things that make it extremely difficult/impossible to practically use at this point because I'm tired of not having this.

@benvanik
Copy link
Collaborator Author

benvanik commented Dec 3, 2021

I was also going to be using dawn but wgpu-native may be a better choice today - their additions to webgpu.h have most of what's needed: https://github.com/gfx-rs/wgpu-native/blob/master/ffi/wgpu.h (drop and wait being the big ones) and it's possible to write a reasonable compute program: https://github.com/gfx-rs/wgpu-native/blob/master/examples/compute/main.c while dawn requires a busy-loop and sleeps: https://dawn.googlesource.com/dawn/+/refs/heads/main/examples/ComputeBoids.cpp#330

@benvanik
Copy link
Collaborator Author

benvanik commented Dec 3, 2021

wgpu-native setup was fairly painless and having a blocking sync will be really helpful in the initial bringup. For getting things working in emscripten/dawn we're going to need a callback manager; the idea being that when we give the webgpu API a callback (for mapping, submission done, etc) we can associate that back with the caller, ref count things, and track pending work. Simulating blocking in a way that works is going to be a bit special, though, and I'm not sure if it's even going to be possible in the browser: ideally we'd have a futex the blocker thread could wait on and the callbacks could signal, but it's unclear if in the browser any callback will ever be called without first returning to the main browser loop. This makes sense on the main javascript thread but doesn't make sense from worker threads, and unfortunately it doesn't seem like this has been handled by the webgpu design :/

@benvanik
Copy link
Collaborator Author

benvanik commented Dec 9, 2021

Found another issue with bind groups and the lack of native pooling in WebGPU: each bind group strongly retains the buffers it references meaning that trying to build a pool in user code is not possible - otherwise trying to retain a bind group to save a call to create a new one would keep all of the buffers referenced alive - potentially 100's of MB across an entire pool. WebGPU really needs a built in pool with weak references to avoid this. For now I'm just flushing the pool after every submission - it means that we'll need to recreate the bind groups per command buffer set but compared to retaining all the memory that seems better.

@benvanik
Copy link
Collaborator Author

benvanik commented Dec 9, 2021

kvark helped clarify that implementations should drop buffers as soon as possible (which extends to after all in-flight operations complete): https://gpuweb.github.io/gpuweb/#buffer-destruction

So an implementation should be fine with outstanding bind groups that reference buffers that were destroyed - it'd just be invalid to use them (that is to say they are already weak).

@benvanik
Copy link
Collaborator Author

Some discussion on this issue about how to work around the biggest blocker of async behavior: gpuweb/gpuweb#2217 (comment)
I think if we had that we could be running in the browser with no issue (and work with dawn without a spin-loop)

@benvanik
Copy link
Collaborator Author

In bringing up dawn I got this validation error:

[DAWN Validation] Buffer usages (BufferUsage::(MapRead|MapWrite|CopySrc|CopyDst|Uniform|Storage|Indirect)) is invalid. If a buffer usage contains BufferUsage::MapWrite the only other allowed usage is BufferUsage::CopySrc.

Unfortunately this is true in the spec:
image

wgpu-native allowed it so I didn't hit it there and thought I was in the clear. This restriction - that anything that is mappable cannot be used for anything but transfer and that a buffer cannot be used for both upload and download will be a significant point of complexity in any compute usage of WebGPU. I'm not quite sure how to support this without using some really nasty copies/extra allocations, which are unfortunately just something the WebGPU path is going to have to eat :( It'd be less bad if GPUQueue had readBuffer in addition to writeBuffer (as then we get timeline-ordered operations and let the implementation manage its ringbuffer) but we don't have that :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Proposals, open-ended questions, etc hal/webgpu Runtime WebGPU HAL backend
Projects
No open projects
Status: No status
Development

No branches or pull requests

1 participant