-
Notifications
You must be signed in to change notification settings - Fork 280
Buffer & Texture Upload
This document aims to summarize the different open and solved problems around our usage of GPU textures, constraints, experiments, known tradeoffs, etc.
Texture and buffer allocation is often slow, particularly on intel GPUs.
On some intel GPUs (haswell and older) we observed that allocating a buffer and mapping another buffer causes a sync point.
When allocating a texture with glTexImage it isn't required to specify up front if a mip chain will be needed, it could be added at any time later. This is usually implemented in drivers by allocating the entire mip-chain regardless of whether you use it or not. The texture_storage extension is one option for avoiding the allocation of unneeded mipchains, but
https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_texture_storage.txt
Texture uploads rely on some alignment properties of both the source data on the CPU and the destination on the GPU. This alignment restriction can be large (256 bytes!) which is a problem when uploading small texture cache items. Not satisfying the minimum alignment can trigger driver bugs or get us to fall into a slow path synchronous path in the driver.
See:
- Bug 1603783.
- https://searchfox.org/mozilla-central/rev/4e228dc5f594340d35da7453829ad9f3c3cb8b58/gfx/wr/webrender/src/device/gl.rs#1528
Some platforms like some pixel formats, other platforms like other formats.
OpenGL ES core only supports RGBA. Windows is very BGRA centric. Chrome supports this by choosing one of RGBA/BGRA at build time and using that everywhere.
Texture_storage doesn’t support BGRA on Mac.
Client storage prefers(requires) BGRA but wants the texture_rectangle extension which doesn’t support texture arrays. https://github.com/jrmuizel/client-storage-rs is an exploration into what’s fast and what’s not
This describes our current approach
We can lose our GPU-side data under some conditions that are out of our control, meaning we have to keep CPU-side copies. We also need to properly detect when GPU-side data is lost re-upload.
GLX_NV_robustness_video_memory_purge (NV binary linux driver) causes framebuffer attachment content invalidation on resume-from-sleep:
Previously in WebGL: https://bugzilla.mozilla.org/show_bug.cgi?id=1492580
Previously in WR: https://bugzilla.mozilla.org/show_bug.cgi?id=1484782
Most GPUs support non-power-of-two sizes, however, a lot of them will internally round the size up to powers of two. In addition some operations are faster on power-of-two sized textures at least on intel hardware: https://software.intel.com/en-us/articles/opengl-performance-tips-power-of-two-textures-have-better-performance
Note: we do already use power-of-two sizes in texture atlases
Non-power-of-two textures also have the following restrictions:
- You can't use mipmap filtering with them.
- You can use only these wrap modes: GL_CLAMP, GL_CLAMP_TO_EDGE, and GL_CLAMP_TO_BORDER.
- The texture cannot have a border.
- The texture uses non-normalized texture coordinates.
Texture arrays have some advantages such as resizing granularity, however:
- No GLES2 support.
- Various bugs on mobile (https://github.com/servo/webrender/wiki/Driver-issues#bug-1505508---glblitframebuffer-and-glcopytexsubimage-dont-work-correctly-with-texture-arrays-on-adreno).
- The number of layers affects the precision when sampling leading to unpredictable small fuzzy differences in reftests.
- Resizing large texture arrays is causing performance issues: https://bugzilla.mozilla.org/show_bug.cgi?id=1616901
- Mac bug with 64+ layers: https://github.com/servo/webrender/wiki/Driver-issues#bug-1505664---increasing-shared-texture-cache-to-64-layers-breaks-rendering-on-mac-intel-opengl-driver
On some mali GPUs, there are issues when blitting into a texture without clearing it first, causing issues when attempting to do partial updates of the texture cache from a staging texture. https://github.com/servo/webrender/wiki/Driver-issues#bug-1669960---using-glblitframebuffers-to-copy-picture-cache-textures-results-in-corruption-on-mali-gxx
There is a bug in fxc causing shader compilation times to explode on ANGLE with large UBOs (large being in the order of 30 elements and it gets worse the larger the UBO is). https://bugs.chromium.org/p/angleproject/issues/detail?id=3682
When mapping a buffer, we often get write-combined memory. With this type of memory it is important to fill entire cache lines sequentially without leaving holes.
At the moment we use glTexSubImage on Windows/ANGLE, PBOs everywhere else.
Nothing prevents it from being fast in theory, but in practice a lot of drivers end up blocking on the destination texture not being used. This is unlike D3D's UpdateSubResource which tends to be efficiently implemented. So we generally don't want to use glTexSubImage when running on actual GL implementation. On ANGLE it is fine as long as we land on the UpdateSubResource code path. However there is a driver bug workaround in ANGLE that prevents us from hitting UpdateSubResource some of the time and to do something slow instead. It is related to some 128bits texture formats such as RGBA_F32 which we use for the GPU cache.
On (low end) Intel + Windows, the per glTexSubImage call overhead is high (making uploading the upload of individual glyphs costly).
PBOs are the preferred way to upload data to the GPU when running on a real GL driver, except on Mac where client storage is better.
D3D11 doesn't have something that corresponds well to PBOs so ANGLE's PBO emulation is slow. This is why we use glTexSubImage on Windows.
Allocating a new PBO for each thing we need to upload causes performance issues. What we do now is to allocate a large PBO at the beginning of command submission in the renderer and copy several things into it (We don't do this on Mac AMD because of a driver bug (https://github.com/servo/webrender/wiki/Driver-issues#bug-1603783-pbo-uploads-with-non-zero-broken-on-amd-macos-1015).
It would be better to recycle the PBOs but this isn't showing up as a major issue right now.
Mac-specific. We provide the driver with a pointer to our own memory which it can copy from asynchronously. Lots of restrictions (see constraints and issues).
https://github.com/jrmuizel/client-storage-rs contains code for testing different upload strategies on Mac.
The GPU cache uses a RGBA_32 texture to store float data.
When updating the GPU cache, we often have small blocks to update scattered in the cache's texture. The PBO upload code path operates on a row granularity which can lead to large amounts of data uploaded. The alternative "scatter" mode pushes the blocks into a contiguous buffer and issues a draw call to update the GPU cache's texture with point sprites. It also has the benefit of not hitting the bad code path related to RGBA_F32 textures and to the UpdateSubResource driver bug on ANGLE.
We would like to stop using a texture for the GPU cache, however:
- UBOs must declare a fixed size in the shaders
- There is the fxc issue causing shader compilation time to explode on ANGLE.
- Not sure what proportion of users have SSBOs (ANGLE appears to support them).
- ANGLE team confirmed support for them, though can’t bind one buffer to multiple SSBOs in D3D11 (right now?)
Vertex instances use VBOs.
Cross-process texture sharing and synchronization {#cross-process-texture-sharing-and-synchronization}
- DXGI on windows
- Surface Texture on android
- We need something on Linux+Wayland and Linux+X11 for webgl, canvas, and video frames. Martin Stránský started adding support for dmabuf textures with wayland, we don't use it with webrender at the moment.
- Gralloc on b2g (RIP)
TODO: what bugs and restrictions do we have with each of these ?
TODO
- Texture array
- pros: easy to grow the atlas without changing offsets
- cons: driver issues on mobile, some low end hardware don't support it
- We used to have a guillotine allocator
- slow deallocation performance on the CPU
- fragmentation issues.
- Guillotiere: https://github.com/nical/guillotiere solves the deallocation perf
- We don't use a guillotine allocator anymore
- switched to a slab allocator
- replaced the guillotine allocator
- power-of-two square slab sizes and a few specific non-square slab sizes, one slab size per texture layer
- no fragmentation issue
- a lot of wasted space per allocation (typically ~50%)
The large row alignment requirements is a big constraint here (we don't deal with it well at the moment).
- WebRender driver issues: https://github.com/servo/webrender/wiki/Driver-issues
- Texture cache reallocation perf bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1616901
- Arm mali developer best practices: https://static.docs.arm.com/101897/0200/arm_mali_gpu_best_practices_developer_guide_101897_0200_00_en.pdf
- Apple texture best practices: https://developer.apple.com/library/archive/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_texturedata/opengl_texturedata.html
- Intel and power-of-two texture sizes: https://software.intel.com/en-us/articles/opengl-performance-tips-power-of-two-textures-have-better-performance
- intel gen 9 optimization guide: https://software.intel.com/sites/default/files/managed/49/2b/Graphics-API-Performance-Guide-2.5.pdf
- intel gen 11 optimization guide: https://software.intel.com/en-us/articles/developer-and-optimization-guide-for-intel-processor-graphics-gen11-api
D3D11 has two methods for uploading textures: UpdateSubresource or Map + CopySubresourceRegion.
RenderCompositorD3D11SWGL::TileD3D11::Map supports 4 different upload strategies: UpdateSubresource via Upload_Immediate and a couple of variants of Map + CopySubresourceRegion via Upload_Staging, Upload_StagingNoBlock, and Upload_StagingPooled. The current default is Upload_StagingPooled.
The documentation for UpdateSubresource suggests that it shouldn't block waiting for the GPU but https://share.firefox.dev/3zWf1ZA is evidence that it does.