From b5c53845d099f502f40b51718bc88489f535869d Mon Sep 17 00:00:00 2001 From: Robert Swain Date: Sat, 14 Oct 2023 14:52:42 +0200 Subject: [PATCH 01/14] WIP rendering performance news --- content/news/2023-10-21-bevy-0.12/index.md | 220 +++++++++++++++++++++ 1 file changed, 220 insertions(+) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 5321faf913..cd3ee07e02 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -19,6 +19,226 @@ Since our last release a few months ago we've added a _ton_ of new features, bug
authors: @author
+## Automatic Batching and Instancing, and the Road to GPU-driven Rendering + +
authors: Rob Swain (@superdump), @james-j-obrien, @JMS55, @inodentry, @robtfm, @nicopap, @teoxoy, @IceSentry, @Elabajaba
+ +Bevy's renderer performance for 2D and 3D meshes can improve a lot. Both CPU and graphics API / GPU bottlenecks can be removed to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2, through WebGPU, to the highest-end native discrete graphics cards. A solid foundation can support all of this. + +### What are the bottlenecks? + +One major bottleneck is the structure of the data used for rendering. + +* Mesh entity data is stored in one uniform buffer, but has to be rebound at different dynamic offsets per draw. +* Material type data (e.g. StandardMaterial uniform properties) are stored in individual uniform buffers that have to be rebound per draw if the material changes. +* Material textures are stored individually and have to be rebound per draw if the texture changes. +* Mesh index / vertex buffers are stored individually per-mesh and have to be rebound per draw. + +All of this rebinding has both CPU and graphics API / GPU performance impact. On the CPU, it means encoding of draw commands has many more steps and takes more time than necessary. In the graphics API and on the GPU, it means issuing many more rebinding and separate draw commands. + +Avoiding rebinding is both a big performance benefit for CPU-driven rendering, and is necessary to enable GPU-driven rendering. + +### What are CPU- and GPU-driven rendering? + +CPU-driven rendering is where draw commands are created on the CPU, in Bevy this means in Rust code, more specifically in render graph nodes. + +In GPU-driven rendering, the draw commands are encoded on the GPU by compute shaders. This leverages GPU parallelism, and unlocks more advanced culling optimisations that are infeasible to do on the CPU, among many other methods that bring large performance benefits. + +### Reorder Render Sets + +
authors: Rob Swain (@superdump), @james-j-obrien, @inodentry
+ +The order of draws needs to be known for some methods of instanced draws so that the data can be laid out, and looked up in order. For example, when per-instance data is stored in an instance-rate vertex buffer. + +The render set order before 0.12 caused some problems with this as data had to be prepared before knowing the draw order. The previous order of sets was: + +* Extract +* Prepare +* Queue +* Sort/Batch +* Render + +This constraint was most visible in the sprite batching implementation that skipped Prepare, sorted and prepared data in Queue, and then after being sorted again alongside 2D meshes and other entities in the Transparent2d render phase, possibly had its batches split to enable drawing of those other entities. + +The new render set order in 0.12 is: + +* Extract +* PrepareAssets +* Queue +* Sort +* Prepare/Batch +* Render + +PrepareAssets was introduced because we only want to queue entities for drawing once their assets have been prepared. Per-frame data preparation still happens in the Prepare set, but that is now after Queue and Sort, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. + +### BatchedUniformBuffer and GpuArrayBuffer + +OK, so we need to put many pieces of data of the same type into buffers in a way that we can bind them as few times as possible and draw multiple instances from them. How can we do that? + +Instance-rate vertex buffers are one way, but they are very constrained to having a specific order. They are/may be suitable for per-instance data like mesh entity transforms, but they can't be used for material data. + +The other main options are uniform and storage buffers. WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers have a minimum guaranteed size per binding of 16kB on WebGL2. Storage buffers, where available, have a minimum guaranteed size of 128MB. Data textures are also an option, but are far more awkward for structured data and without support for linear data layouts on some platforms, they will perform less well. So, support uniform buffers on WebGL2 or where storage buffers are not supported, and use storage buffers everywhere else. + +#### BatchedUniformBuffer + +
authors: Rob Swain (@superdump), @JMS55, @teoxoy, @robtfm, @konsolas
+ +We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, MeshUniform requires 144 bytes per instance, which means 113 instances per 16kB binding. If we want to draw more than 113 entities, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what BatchedUniformBuffer is designed to solve. + +DEMO RUST CODE. + +DEMO SHADER CODE. + +PERFORMANCE IMPROVEMENT. + +#### GpuArrayBuffer + +
authors: Rob Swain (@superdump), @JMS55, @IceSentry, @mockersf
+ +If users have to care about supporting both batched uniform and storage buffers to store arrays of data for use in shaders, many may choose not to because their priority is not WebGL2. We want to make it simple and easy to support all users. + +GpuArrayBuffer was designed and implemented as an abstraction over BatchedUniformBuffer and using a StorageBuffer to store an array of T. + +DEMO RUST CODE. + +DEMO SHADER CODE. + +PERFORMANCE IMPROVEMENT. + +### 2D / 3D Mesh Entities using GpuArrayBuffer + +
authors: Rob Swain (@superdump), @robtfm, @Elabajaba
+ +The 2D and 3D mesh entity rendering was migrated to use GpuArrayBuffer for the mesh uniform data. + +DEMO RUST CODE + +DEMO SHADER CODE + +PERFORMANCE IMPROVEMENT. + +### Improved bevymark Example + +
authors: Rob Swain (@superdump), @IceSentry
+ +The bevymark example needed to be improved to enable benchmarking the batching / instanced draw changes. Modes were added to: + +* draw 2D quad meshes instead of sprites +* vary the per-instance color data instead of only varying the colour per wave of birds +* generate a number of material / sprite textures and randomly choose from them either per wave or per instance depending on the vary per instance setting + +This allows benchmarking of different situations for batching / instancing in the next section. + +### Automatic Batching/Instancing of Draw Commands + +
authors: Rob Swain (@superdump), @robtfm, @nicopap
+ +There are many operations that can be done to prepare a draw command in a render pass. If anything needs to change either in bindings or the draw itself, then the draws cannot be batched together into an instanced draw. Some of the main things that can change between draws are: + +* Pipeline +* BindGroup or its corresponding dynamic offsets +* Index/vertex buffer +* Index/vertex range +* Instance range + +Pipelines usually vary due to using different shaders in custom materials, or using variants of a material due to shader defs as the shader defs produce different shaders. Bind group bindings can change due to different material textures, different buffers, or needing to bind different parts of some buffers using dynamic offsets. Index/vertex buffers and/or ranges change per mesh asset. Instance range is what we want to leverage for instanced draws. + +#### Assumptions + +The design of the automatic batching/instanced draws in Bevy makes some assumptions to enable a reasonable solution: + +* Only entities with prepared assets are queued to render phases +* View bindings are constant across a render phase for a given draw function, as phases are per-view +* `batch_and_prepare_render_phase` is the only system that performs batching and has sole responsibility for preparing per-instance (i.e. mesh uniform) data + +If these assumptions do not work for your use case, then you can add the `NoAutomaticBatching` component to your entities to opt-out and do your own thing. Note that mesh uniform data will still be written to the GpuArrayBuffer and can be used in your own mesh bind groups. + +#### What 0.12 Enables + +We can batch draws into a single instanced draw in some situations now that per-instance mesh uniform data is in a GpuArrayBuffer. If the mesh entity is using the same mesh asset, and same material asset, then it can be batched! + +DEMO PERFORMANCE IMPROVEMENT. + +#### What is next for rendering performance? + +* Put material data into GpuArrayBuffer per material type (e.g. all StandardMaterial instances will be stored in one GpuArrayBuffer) - this enables batching of draws for entities with the same mesh, same material type and textures, but different material data! This is implemented on a branch. +* Put material textures into bindless texture arrays - this enables batching of draws for entities with the same mesh and same material type! +* Put mesh data into one big buffer per mesh attribute layout - this removes the need to rebind the index/vertex buffers per-draw, instead only vertex/index range needs to be passed to the draw command. A simple version of this is implemented on a branch. +* GPU-driven rendering + * @JMS55 is working on GPU-driven rendering already, using a meshlet approach + * Rob Swain (@superdump) also intends to implement an alternative method similar to what is done in rend3. + +## Rendering Performance Improvements + +### EntityHashMap + +
authors: Rob Swain (@superdump), @robtfm, @pcwalton, @jancespivo, @SkiFire13, @nicopap
+ +#### The Performance Problem + +Since Bevy 0.6, Bevy's renderer has used a separate render world to store an extracted snapshot of the simulated data from the main world to enable pipelined rendering of frame N in the render app, while the main app simulates frame N+1. + +Part of the design involves clearing the render world of all entities between frames. This enables consistent Entity mapping between the main and render worlds while still being able to spawn new entities in the render world that don't exist in the main world. + +Unfortunately, this ECS usage pattern also incurred some significant performance problems. Entities are cleared and respawned each frame, components are inserted across many systems and different parts of the render app schedule. + +The fastest ECS storage available is called table storage. A simplified concept for table storage is that it is a structure of arrays of component data. Each archetype has its own table for storage. Whenever a new component is inserted onto an entity that it didn't have before, its archetype is changed. This then requires that that entity's component data be moved from the table for the old archetype to the table for the new archetype. + +In practice this was very visible in profiles as long-running system commands throughout the render app schedule. + +DEMO PROFILE IMAGE + +As can be seen, this was unfortunately leaving a lot of performance on the table. Many ideas were discussed over a long period for how to improve this. The main two paths forward were: + +1. Persist render world entities and their component data across frames - this has the problem of memory leaks and Entity collisions +2. Stop using entities for storing component data in the render world + +We have decided to explore option 2 for Bevy 0.12 as persisting entities involves solving other problems that have no simple and satisfactory answers. + +#### Data Structures + +Ideally we would only ever need to iterate over dense arrays of data (e.g. `Vec`). CPUs are very good at this as it enables predictable data access that increases the cache hit rate and makes for very fast processing. + +The options, for component data `T`: + +* `Vec` +* `SparseSet` - contains a 'sparse' `Vec` and a dense `Vec` that is indexed by `Entity.index` and the `usize` value is the index into the sparse `Vec`. +* `HashMap` + +`Vec` cannot be used with the current renderer architecture. Mesh entities are extracted in extraction query iteration order. They are then queued to a render phase and sorted. They are iterated in phase order to prepare and batch into instanced draws. The renderer can be rearchitected to enable use of `Vec` but that is more intrusive than there was time to finalize for 0.12. + +`SparseSet` is a good option and performs well. Lookups for batching involve indexing into two `Vec`s, which for lookups are fast. It has the benefit that the dense `Vec` can be iterated directly if `Entity` order is irrelevant. + +However, `SparseSet` has the downside that the dense `Vec` has to be as large as the largest contained `Entity.index`. If you spawn a million entities, then despawn 999,999, leaving the millionth entity still spawned, every `SparseSet` for different `T` will have to have a `Vec` that is one million items large. + +`HashMap` is similar to `SparseSet`, and more familiar. It has good space complexity, with performance depending a lot on the hash function. Fast hash functions from the wild were tested AHash, FNV, FxHasher, SeaHasher, but all had quite a big performance drop compared to `SparseSet`. Ultimately, a hash function designed by @SkiFire13, inspired by `rustc-hash` was chosen that has strong performance and is robust enough for this usage. + +### Sprite Instancing + +Sprites were being rendered by generating a vertex buffer containing 4 vertices per sprite with position, UV, and possibly color data. This has proven to be very effective, but having to split batches of sprites into multiple draws because they use a different color is suboptimal. + +Sprite rendering now uses an instance-rate vertex buffer to store the per-instance data. It contains an affine transformation matrix that enables translation, scaling, and rotation in one transform. It contains per-instance color, and UV offset and scale. + +A quad is drawn by leveraging a rendering industry trick that leverages a special index buffer containing 6 indices. The indices encode the vertex position in their bits - the least significant bit is x, the next least significant bit is y. The vertices of the quad are then: + +```text +10 11 + +00 01 +``` + +This retains all the functionality of the previous method, enables the additional flexibility of any sprite being able to have a color tint and all still be drawn in the same batch, and uses a total of 80 bytes per sprite, versus 144 bytes previously. The practical result is a performance improvement of up to 40% versus the previous method! + +### Overall Performance vs 0.11 + +3D Meshes + +2D Meshes + +Sprites + +UI + ## What's Next? We have plenty of work that is pretty much finished and is therefore very likely to land in **Bevy 0.13**: From 057fc3e2785c0bd37ea821d3d6498d7f78d86358 Mon Sep 17 00:00:00 2001 From: Robert Swain Date: Sat, 14 Oct 2023 15:11:16 +0200 Subject: [PATCH 02/14] More --- content/news/2023-10-21-bevy-0.12/index.md | 26 ++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index cd3ee07e02..cc22286412 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -159,14 +159,15 @@ We can batch draws into a single instanced draw in some situations now that per- DEMO PERFORMANCE IMPROVEMENT. -#### What is next for rendering performance? +#### What is next for batching/instancing and beyond? * Put material data into GpuArrayBuffer per material type (e.g. all StandardMaterial instances will be stored in one GpuArrayBuffer) - this enables batching of draws for entities with the same mesh, same material type and textures, but different material data! This is implemented on a branch. * Put material textures into bindless texture arrays - this enables batching of draws for entities with the same mesh and same material type! * Put mesh data into one big buffer per mesh attribute layout - this removes the need to rebind the index/vertex buffers per-draw, instead only vertex/index range needs to be passed to the draw command. A simple version of this is implemented on a branch. +* Put skinned mesh data into storage buffers if possible to enable instanced drawing of skinned mesh entities using the same mesh, skin, and material! * GPU-driven rendering - * @JMS55 is working on GPU-driven rendering already, using a meshlet approach - * Rob Swain (@superdump) also intends to implement an alternative method similar to what is done in rend3. + * @JMS55 is working on GPU-driven rendering already, using a meshlet approach. + * Rob Swain (@superdump) intends to implement an alternative method similar to what is done in rend3. ## Rendering Performance Improvements @@ -211,7 +212,19 @@ The options, for component data `T`: However, `SparseSet` has the downside that the dense `Vec` has to be as large as the largest contained `Entity.index`. If you spawn a million entities, then despawn 999,999, leaving the millionth entity still spawned, every `SparseSet` for different `T` will have to have a `Vec` that is one million items large. -`HashMap` is similar to `SparseSet`, and more familiar. It has good space complexity, with performance depending a lot on the hash function. Fast hash functions from the wild were tested AHash, FNV, FxHasher, SeaHasher, but all had quite a big performance drop compared to `SparseSet`. Ultimately, a hash function designed by @SkiFire13, inspired by `rustc-hash` was chosen that has strong performance and is robust enough for this usage. +`HashMap` is similar to `SparseSet`, and more familiar. It has good space complexity, with performance depending a lot on the hash function. + +Fast hash functions from the wild were tested AHash, FNV, FxHasher, SeaHasher, but all had quite a big performance drop compared to `SparseSet`. Ultimately, a hash function designed by @SkiFire13, and inspired by `rustc-hash`, was chosen that has strong performance and is robust enough for this usage. This combination is called `EntityHashMap` and is the new way to store component data in the render world. + +The worst case performance with random Z-order spawning of 2D meshes/sprites in `bevymark` is similar to `SparseSet`. The best case performance with 2D meshes/sprites spawned in draw order + +#### EntityHashMap Helpers + +A helper plugin was added to make it simple and quick to extract main world data for use in the render world in the form of `ExtractInstancesPlugin`. You can extract all entities matching a query, or only those that are visible, extracting multiple components at once into one target type. + +It is a good idea to group component data that will be accessed together into one target type to avoid having to do multiple lookups. + +DEMO CODE ### Sprite Instancing @@ -239,6 +252,11 @@ Sprites UI +### What's next for rendering performance? + +* `EntityHashMap` is good, but imagine a world with only `Vec`, no lookups in hot loops, only in-order iteration, and maximum performance! +* Batching code already compares previous draw state (pipeline, bind groups, index/vertex buffers, etc) to current draw state. This is then repeated by `TrackedRenderPass` when encoding draws. This cost can be removed with a new API called `DrawStream`. + ## What's Next? We have plenty of work that is pretty much finished and is therefore very likely to land in **Bevy 0.13**: From a3da5782d5c01606c1f87888f8c374b398270726 Mon Sep 17 00:00:00 2001 From: Robert Swain Date: Sat, 14 Oct 2023 15:25:28 +0200 Subject: [PATCH 03/14] More --- content/news/2023-10-21-bevy-0.12/index.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index cc22286412..fe8692f15f 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -23,18 +23,18 @@ Since our last release a few months ago we've added a _ton_ of new features, bug
authors: Rob Swain (@superdump), @james-j-obrien, @JMS55, @inodentry, @robtfm, @nicopap, @teoxoy, @IceSentry, @Elabajaba
-Bevy's renderer performance for 2D and 3D meshes can improve a lot. Both CPU and graphics API / GPU bottlenecks can be removed to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2, through WebGPU, to the highest-end native discrete graphics cards. A solid foundation can support all of this. +Bevy's renderer performance for 2D and 3D meshes can improve a lot. Both CPU and graphics API / GPU bottlenecks can be removed to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2 and mobile devices, to the highest-end native discrete graphics cards. A solid foundation can support all of this. ### What are the bottlenecks? One major bottleneck is the structure of the data used for rendering. -* Mesh entity data is stored in one uniform buffer, but has to be rebound at different dynamic offsets per draw. -* Material type data (e.g. StandardMaterial uniform properties) are stored in individual uniform buffers that have to be rebound per draw if the material changes. -* Material textures are stored individually and have to be rebound per draw if the texture changes. +* Mesh entity data is stored in one uniform buffer, but has to be rebound at different dynamic offsets for every single draw. +* Material type data (e.g. `StandardMaterial` uniform properties, but not textures) are stored in individual uniform buffers that have to be rebound per draw if the material changes. +* Material textures are stored individually and have to be rebound per draw if a mesh texture changes. * Mesh index / vertex buffers are stored individually per-mesh and have to be rebound per draw. -All of this rebinding has both CPU and graphics API / GPU performance impact. On the CPU, it means encoding of draw commands has many more steps and takes more time than necessary. In the graphics API and on the GPU, it means issuing many more rebinding and separate draw commands. +All of this rebinding has both CPU and graphics API / GPU performance impact. On the CPU, it means encoding of draw commands has many more steps to process and so takes more time than necessary. In the graphics API and on the GPU, it means many more rebinds, and separate draw commands. Avoiding rebinding is both a big performance benefit for CPU-driven rendering, and is necessary to enable GPU-driven rendering. @@ -60,6 +60,8 @@ The render set order before 0.12 caused some problems with this as data had to b This constraint was most visible in the sprite batching implementation that skipped Prepare, sorted and prepared data in Queue, and then after being sorted again alongside 2D meshes and other entities in the Transparent2d render phase, possibly had its batches split to enable drawing of those other entities. +The ordering of the sets also created some confusion about when bind groups should be created. Bind groups were meant to be created in Prepare, but sometimes they had to be created in Queue to ensure that some preparation had completed. + The new render set order in 0.12 is: * Extract @@ -67,9 +69,11 @@ The new render set order in 0.12 is: * Queue * Sort * Prepare/Batch + * PrepareResources + * PrepareBindGroups * Render -PrepareAssets was introduced because we only want to queue entities for drawing once their assets have been prepared. Per-frame data preparation still happens in the Prepare set, but that is now after Queue and Sort, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. +PrepareAssets was introduced because we only want to queue entities for drawing if their assets have been prepared. Per-frame data preparation still happens in the Prepare set, specifically in its PrepareResources subset. That is now after Queue and Sort, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. Bind groups now have a clear subset where they should be created - PrepareBindGroups. ### BatchedUniformBuffer and GpuArrayBuffer @@ -77,13 +81,13 @@ OK, so we need to put many pieces of data of the same type into buffers in a way Instance-rate vertex buffers are one way, but they are very constrained to having a specific order. They are/may be suitable for per-instance data like mesh entity transforms, but they can't be used for material data. -The other main options are uniform and storage buffers. WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers have a minimum guaranteed size per binding of 16kB on WebGL2. Storage buffers, where available, have a minimum guaranteed size of 128MB. Data textures are also an option, but are far more awkward for structured data and without support for linear data layouts on some platforms, they will perform less well. So, support uniform buffers on WebGL2 or where storage buffers are not supported, and use storage buffers everywhere else. +The other main options are uniform and storage buffers. WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers have a minimum guaranteed size per binding of 16kB on WebGL2. Storage buffers, where available, have a minimum guaranteed size of 128MB. Data textures are also an option, but are far more awkward for structured data, and without support for linear data layouts on some platforms, they will perform worse. We want to support uniform buffers on WebGL2 or where storage buffers are not supported, and use storage buffers everywhere else. #### BatchedUniformBuffer
authors: Rob Swain (@superdump), @JMS55, @teoxoy, @robtfm, @konsolas
-We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, MeshUniform requires 144 bytes per instance, which means 113 instances per 16kB binding. If we want to draw more than 113 entities, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what BatchedUniformBuffer is designed to solve. +We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, MeshUniform requires 144 bytes per instance, which means 113 instances per 16kB binding. If we want to draw more than 113 entities, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what `BatchedUniformBuffer` is designed to solve. DEMO RUST CODE. From 86ca41a0f82b087e7e7fb483631134ba21135a01 Mon Sep 17 00:00:00 2001 From: Robert Swain Date: Mon, 16 Oct 2023 21:50:46 +0200 Subject: [PATCH 04/14] Feedback and more --- content/news/2023-10-21-bevy-0.12/index.md | 169 ++++++++++++++------- 1 file changed, 110 insertions(+), 59 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index fe8692f15f..d319d02585 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -36,7 +36,7 @@ One major bottleneck is the structure of the data used for rendering. All of this rebinding has both CPU and graphics API / GPU performance impact. On the CPU, it means encoding of draw commands has many more steps to process and so takes more time than necessary. In the graphics API and on the GPU, it means many more rebinds, and separate draw commands. -Avoiding rebinding is both a big performance benefit for CPU-driven rendering, and is necessary to enable GPU-driven rendering. +Avoiding rebinding is both a big performance benefit for CPU-driven rendering, including WebGL2, and is necessary to enable GPU-driven rendering. ### What are CPU- and GPU-driven rendering? @@ -89,37 +89,85 @@ The other main options are uniform and storage buffers. WebGL2 does not support We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, MeshUniform requires 144 bytes per instance, which means 113 instances per 16kB binding. If we want to draw more than 113 entities, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what `BatchedUniformBuffer` is designed to solve. -DEMO RUST CODE. - -DEMO SHADER CODE. - -PERFORMANCE IMPROVEMENT. - #### GpuArrayBuffer
authors: Rob Swain (@superdump), @JMS55, @IceSentry, @mockersf
If users have to care about supporting both batched uniform and storage buffers to store arrays of data for use in shaders, many may choose not to because their priority is not WebGL2. We want to make it simple and easy to support all users. -GpuArrayBuffer was designed and implemented as an abstraction over BatchedUniformBuffer and using a StorageBuffer to store an array of T. +`GpuArrayBuffer` was designed and implemented as an abstraction over `BatchedUniformBuffer` and using a `StorageBuffer` to store an array of `T`. + +```rust +#[derive(Clone, ShaderType)] +struct MyType { + x: f32, +} + +// Create a GPU array buffer +let mut buffer = GpuArrayBuffer::::new(&render_device.limits()); + +// Push some items into it +for i in 0..N { + // indices is a GpuArrayBufferIndex which contains a NonMaxU32 index into the array + // and an Option dynamic offset. If storage buffers are supported, it will be None, + // else Some with te dynamic offset that needs to be used when binding the bind group. indices + // should be stored somewhere for later lookup, often associated with an Entity. + let indices = buffer.push(MyType { x: i as f32 }); +} + +// Queue writing the buffer contents to VRAM +buffer.write_buffer(&render_device, &render_queue); + +// The bind group layout entry to use when creating the pipeline +let binding = 0; +let visibility = ShaderStages::VERTEX; +let bind_group_layout_entry = buffer.binding_layout( + binding, + visibility, + &render_device, +); + +// Get the binding resource to make a bind group entry to use when creating the bind group +let buffer_binding_resource = buffer.binding()?; + +// Get the batch size. This will be None if storage buffers are supported, else it is the +// maximum number of elements that could fit in a batch +let buffer_batch_size = GpuArrayBuffer::::batch_size(&render_device.limits()); + +// Set a shader def with the buffer batch size +if let Some(buffer_batch_size) = buffer_batch_size { + shader_defs.push(ShaderDefVal::UInt( + "BUFFER_BATCH_SIZE".into(), + buffer_batch_size, + )); +} +``` -DEMO RUST CODE. +```rust +#import bevy_render::instance_index get_instance_index -DEMO SHADER CODE. +struct MyType { + x: f32, +} -PERFORMANCE IMPROVEMENT. +// Declare the buffer binding +#ifdef BUFFER_BATCH_SIZE +@group(2) @binding(0) var data: array; +#else +@group(2) @binding(0) var data: array; +#endif + +// Access an instance +let my_type = data[get_instance_index(in.instance_index)]; +``` ### 2D / 3D Mesh Entities using GpuArrayBuffer
authors: Rob Swain (@superdump), @robtfm, @Elabajaba
-The 2D and 3D mesh entity rendering was migrated to use GpuArrayBuffer for the mesh uniform data. - -DEMO RUST CODE +The 2D and 3D mesh entity rendering was migrated to use `GpuArrayBuffer` for the mesh uniform data. -DEMO SHADER CODE - -PERFORMANCE IMPROVEMENT. +Just avoiding the rebinding of the mesh uniform data buffer gives about a 6% increase in frame rates. ### Improved bevymark Example @@ -127,9 +175,10 @@ PERFORMANCE IMPROVEMENT. The bevymark example needed to be improved to enable benchmarking the batching / instanced draw changes. Modes were added to: -* draw 2D quad meshes instead of sprites -* vary the per-instance color data instead of only varying the colour per wave of birds -* generate a number of material / sprite textures and randomly choose from them either per wave or per instance depending on the vary per instance setting +* draw 2D quad meshes instead of sprites: `--mode mesh2d` +* vary the per-instance color data instead of only varying the colour per wave of birds: `--vary-per-instance` +* generate a number of material / sprite textures and randomly choose from them either per wave or per instance depending on the vary per instance setting: `--material-texture-count 10` +* spawn the birds in random z order (new default), or in draw order: `--ordered-z` This allows benchmarking of different situations for batching / instancing in the next section. @@ -157,21 +206,29 @@ The design of the automatic batching/instanced draws in Bevy makes some assumpti If these assumptions do not work for your use case, then you can add the `NoAutomaticBatching` component to your entities to opt-out and do your own thing. Note that mesh uniform data will still be written to the GpuArrayBuffer and can be used in your own mesh bind groups. -#### What 0.12 Enables +#### Instanced Draw Performance + +We can batch draws into a single instanced draw in some situations now that per-instance mesh uniform data is in a `GpuArrayBuffer`. If the mesh entity is using the same mesh asset, and same material asset, then it can be batched! + +Using the same approach as 0.11 with one dynamic offset binding per mesh entity, and comparing to either storage buffers or batched uniform buffers: -We can batch draws into a single instanced draw in some situations now that per-instance mesh uniform data is in a GpuArrayBuffer. If the mesh entity is using the same mesh asset, and same material asset, then it can be batched! +2D meshes: `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z` which spawns 160 waves of 1000 2D quad meshes, producing 160 instanced draws of 1000 instances per draw enables up to a **160% increase in frame rate (2.6x)!** -DEMO PERFORMANCE IMPROVEMENT. +3D meshes: `many_cubes` which spawn 160,000 cubes, of which ~11,700 are visible in the view. These are drawn using a single instanced draw of all visible cubes which enables up to **100% increase in frame rate (2x)**! + +These performance benefits can be leveraged on all platforms, including WebGL2! #### What is next for batching/instancing and beyond? -* Put material data into GpuArrayBuffer per material type (e.g. all StandardMaterial instances will be stored in one GpuArrayBuffer) - this enables batching of draws for entities with the same mesh, same material type and textures, but different material data! This is implemented on a branch. +* Put material data into GpuArrayBuffer per material type (e.g. all StandardMaterial instances will be stored in one GpuArrayBuffer) - this enables batching of draws for entities with the same mesh, same material type and textures, but different material data! + * A prototype implementation of this shows enormous benefits because materials are currently always _one uniform buffer per material instance_ which means we can't just update the dynamic offset, rather the entire bind group has to be rebound! * Put material textures into bindless texture arrays - this enables batching of draws for entities with the same mesh and same material type! -* Put mesh data into one big buffer per mesh attribute layout - this removes the need to rebind the index/vertex buffers per-draw, instead only vertex/index range needs to be passed to the draw command. A simple version of this is implemented on a branch. -* Put skinned mesh data into storage buffers if possible to enable instanced drawing of skinned mesh entities using the same mesh, skin, and material! -* GPU-driven rendering +* Where bindless texture arrays are not supported (WebGL2, WebGPU, some native) we can leverage asset preprocessing to pack textures into texture atlas textures, and use array textures where each layer is a texture atlas. This is an alternative way of avoiding rebinding for changing textures. +* Put mesh data into one big buffer per mesh attribute layout - this removes the need to rebind the index/vertex buffers per-draw, instead only vertex/index range needs to be passed to the draw command. Prototypes showed this didn't give much/any improvement for CPU-drive rendering, but it does unlock GPU-driven nonetheless. +* Put skinned mesh data into storage buffers if possible to enable instanced drawing of skinned mesh entities using the same mesh, skin, and material! This was prototyped and enabled drawing about 25% more (1.25x) foxes! +* GPU-driven rendering for WebGPU and native * @JMS55 is working on GPU-driven rendering already, using a meshlet approach. - * Rob Swain (@superdump) intends to implement an alternative method similar to what is done in rend3. + * Rob Swain (@superdump) intends to implement an alternative method that does not require processing meshes into meshlets but that limits to drawing up to 256 instances per draw. ## Rendering Performance Improvements @@ -200,27 +257,7 @@ As can be seen, this was unfortunately leaving a lot of performance on the table We have decided to explore option 2 for Bevy 0.12 as persisting entities involves solving other problems that have no simple and satisfactory answers. -#### Data Structures - -Ideally we would only ever need to iterate over dense arrays of data (e.g. `Vec`). CPUs are very good at this as it enables predictable data access that increases the cache hit rate and makes for very fast processing. - -The options, for component data `T`: - -* `Vec` -* `SparseSet` - contains a 'sparse' `Vec` and a dense `Vec` that is indexed by `Entity.index` and the `usize` value is the index into the sparse `Vec`. -* `HashMap` - -`Vec` cannot be used with the current renderer architecture. Mesh entities are extracted in extraction query iteration order. They are then queued to a render phase and sorted. They are iterated in phase order to prepare and batch into instanced draws. The renderer can be rearchitected to enable use of `Vec` but that is more intrusive than there was time to finalize for 0.12. - -`SparseSet` is a good option and performs well. Lookups for batching involve indexing into two `Vec`s, which for lookups are fast. It has the benefit that the dense `Vec` can be iterated directly if `Entity` order is irrelevant. - -However, `SparseSet` has the downside that the dense `Vec` has to be as large as the largest contained `Entity.index`. If you spawn a million entities, then despawn 999,999, leaving the millionth entity still spawned, every `SparseSet` for different `T` will have to have a `Vec` that is one million items large. - -`HashMap` is similar to `SparseSet`, and more familiar. It has good space complexity, with performance depending a lot on the hash function. - -Fast hash functions from the wild were tested AHash, FNV, FxHasher, SeaHasher, but all had quite a big performance drop compared to `SparseSet`. Ultimately, a hash function designed by @SkiFire13, and inspired by `rustc-hash`, was chosen that has strong performance and is robust enough for this usage. This combination is called `EntityHashMap` and is the new way to store component data in the render world. - -The worst case performance with random Z-order spawning of 2D meshes/sprites in `bevymark` is similar to `SparseSet`. The best case performance with 2D meshes/sprites spawned in draw order +After consideration, we landed on using `HashMap` with a hash function designed by @SkiFire13, and inspired by `rustc-hash`. This configuration is called `EntityHashMap` and is the new way to store component data in the render world. #### EntityHashMap Helpers @@ -228,21 +265,34 @@ A helper plugin was added to make it simple and quick to extract main world data It is a good idea to group component data that will be accessed together into one target type to avoid having to do multiple lookups. -DEMO CODE +To extract two components from visible entities: -### Sprite Instancing +```rust +struct MyType { + a: ComponentA, + b: ComponentB, +} + +impl ExtractInstance for MyType { + type Query = (Read, Read); + type Filter = (); -Sprites were being rendered by generating a vertex buffer containing 4 vertices per sprite with position, UV, and possibly color data. This has proven to be very effective, but having to split batches of sprites into multiple draws because they use a different color is suboptimal. + fn extract((a, b): QueryItem<'_, Self::Query>) -> Option { + Some(MyType { + a: a.clone(), + b: b.clone(), + }) + } +} -Sprite rendering now uses an instance-rate vertex buffer to store the per-instance data. It contains an affine transformation matrix that enables translation, scaling, and rotation in one transform. It contains per-instance color, and UV offset and scale. +app.add_plugins(ExtractInstancesPlugin::::extract_visible()); +``` -A quad is drawn by leveraging a rendering industry trick that leverages a special index buffer containing 6 indices. The indices encode the vertex position in their bits - the least significant bit is x, the next least significant bit is y. The vertices of the quad are then: +### Sprite Instancing -```text -10 11 +Sprites were being rendered by generating a vertex buffer containing 4 vertices per sprite with position, UV, and possibly color data. This has proven to be very effective. However, having to split batches of sprites into multiple draws because they use a different color is suboptimal. -00 01 -``` +Sprite rendering now uses an instance-rate vertex buffer to store the per-instance data. Instance-rate vertex buffers are stepped when the instance index changes, rather than when the vertex index changes. The new buffer contains an affine transformation matrix that enables translation, scaling, and rotation in one transform. It contains per-instance color, and UV offset and scale. This retains all the functionality of the previous method, enables the additional flexibility of any sprite being able to have a color tint and all still be drawn in the same batch, and uses a total of 80 bytes per sprite, versus 144 bytes previously. The practical result is a performance improvement of up to 40% versus the previous method! @@ -258,7 +308,8 @@ UI ### What's next for rendering performance? -* `EntityHashMap` is good, but imagine a world with only `Vec`, no lookups in hot loops, only in-order iteration, and maximum performance! +* Rearchitecting the renderer data flow to enable use of `Vec` + * Ideally we would only ever need to iterate in-order over dense arrays of data and never do any random-access lookups. CPUs are very good at this as it enables predictable data access that increases the cache hit rate and makes for very fast processing. Ideas have come up for possible ways to rearchitect the renderer a little to enable dense arrays and no unnecessary lookups! * Batching code already compares previous draw state (pipeline, bind groups, index/vertex buffers, etc) to current draw state. This is then repeated by `TrackedRenderPass` when encoding draws. This cost can be removed with a new API called `DrawStream`. ## What's Next? From 27a9483be1beae5d2d20e196a1a3fcee2241dcc9 Mon Sep 17 00:00:00 2001 From: Robert Swain Date: Mon, 16 Oct 2023 21:55:37 +0200 Subject: [PATCH 05/14] More --- content/news/2023-10-21-bevy-0.12/index.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index d319d02585..81ccc09b63 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -108,10 +108,11 @@ let mut buffer = GpuArrayBuffer::::new(&render_device.limits()); // Push some items into it for i in 0..N { - // indices is a GpuArrayBufferIndex which contains a NonMaxU32 index into the array - // and an Option dynamic offset. If storage buffers are supported, it will be None, - // else Some with te dynamic offset that needs to be used when binding the bind group. indices - // should be stored somewhere for later lookup, often associated with an Entity. + // indices is a GpuArrayBufferIndex which contains a NonMaxU32 + // index into the array and an Option dynamic offset. If storage + // buffers are supported, it will be None, else Some with the dynamic + // offset that needs to be used when binding the bind group. indices should + // be stored somewhere for later lookup, often associated with an Entity. let indices = buffer.push(MyType { x: i as f32 }); } @@ -127,11 +128,12 @@ let bind_group_layout_entry = buffer.binding_layout( &render_device, ); -// Get the binding resource to make a bind group entry to use when creating the bind group +// Get the binding resource to make a bind group entry to use when creating the +// bind group let buffer_binding_resource = buffer.binding()?; -// Get the batch size. This will be None if storage buffers are supported, else it is the -// maximum number of elements that could fit in a batch +// Get the batch size. This will be None if storage buffers are supported, else +// it is the maximum number of elements that could fit in a batch let buffer_batch_size = GpuArrayBuffer::::batch_size(&render_device.limits()); // Set a shader def with the buffer batch size From 5126d430c6781cb93d50271be2987da0e0d72145 Mon Sep 17 00:00:00 2001 From: Robert Swain Date: Thu, 19 Oct 2023 21:00:44 +0200 Subject: [PATCH 06/14] Diagrams and benchmark graphs --- .../2023-10-21-bevy-0.12/0.12-2DMeshes.svg | 3 + .../2023-10-21-bevy-0.12/0.12-3DMeshes.svg | 3 + .../BatchedUniformBuffer.svg | 3 + .../DynamicUniformBuffer.svg | 3 + .../2023-10-21-bevy-0.12/RenderSets-0.11.svg | 3 + .../2023-10-21-bevy-0.12/RenderSets-0.12.svg | 3 + .../2023-10-21-bevy-0.12/StorageBuffer.svg | 3 + content/news/2023-10-21-bevy-0.12/index.md | 60 ++++++++++++------- 8 files changed, 60 insertions(+), 21 deletions(-) create mode 100644 content/news/2023-10-21-bevy-0.12/0.12-2DMeshes.svg create mode 100644 content/news/2023-10-21-bevy-0.12/0.12-3DMeshes.svg create mode 100644 content/news/2023-10-21-bevy-0.12/BatchedUniformBuffer.svg create mode 100644 content/news/2023-10-21-bevy-0.12/DynamicUniformBuffer.svg create mode 100644 content/news/2023-10-21-bevy-0.12/RenderSets-0.11.svg create mode 100644 content/news/2023-10-21-bevy-0.12/RenderSets-0.12.svg create mode 100644 content/news/2023-10-21-bevy-0.12/StorageBuffer.svg diff --git a/content/news/2023-10-21-bevy-0.12/0.12-2DMeshes.svg b/content/news/2023-10-21-bevy-0.12/0.12-2DMeshes.svg new file mode 100644 index 0000000000..336d532ee7 --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/0.12-2DMeshes.svg @@ -0,0 +1,3 @@ + + +
0.11
0.11
0.12
0.12
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/0.12-3DMeshes.svg b/content/news/2023-10-21-bevy-0.12/0.12-3DMeshes.svg new file mode 100644 index 0000000000..720f2d9d71 --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/0.12-3DMeshes.svg @@ -0,0 +1,3 @@ + + +
0.11
0.11
0.12
0.12
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/BatchedUniformBuffer.svg b/content/news/2023-10-21-bevy-0.12/BatchedUniformBuffer.svg new file mode 100644 index 0000000000..e0f2bc298c --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/BatchedUniformBuffer.svg @@ -0,0 +1,3 @@ + + +
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/DynamicUniformBuffer.svg b/content/news/2023-10-21-bevy-0.12/DynamicUniformBuffer.svg new file mode 100644 index 0000000000..8cfafc64f3 --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/DynamicUniformBuffer.svg @@ -0,0 +1,3 @@ + + +
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/RenderSets-0.11.svg b/content/news/2023-10-21-bevy-0.12/RenderSets-0.11.svg new file mode 100644 index 0000000000..7eef935951 --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/RenderSets-0.11.svg @@ -0,0 +1,3 @@ + + +
Extract
Extract
Queue
Queue
Sort
/ Batch
Sort...
Prepare
Prepare
Render
Render
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/RenderSets-0.12.svg b/content/news/2023-10-21-bevy-0.12/RenderSets-0.12.svg new file mode 100644 index 0000000000..3609caa863 --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/RenderSets-0.12.svg @@ -0,0 +1,3 @@ + + +
Extract
Extract
Queue
Queue
Sort
Sort
Prepare / Batch
Prepare / Batch
Prepare Resources
Prepare Resou...
Prepare
Bind Groups
Prepare...
Render
Render
Prepare
Assets
Prepare...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/StorageBuffer.svg b/content/news/2023-10-21-bevy-0.12/StorageBuffer.svg new file mode 100644 index 0000000000..4c2a1024d0 --- /dev/null +++ b/content/news/2023-10-21-bevy-0.12/StorageBuffer.svg @@ -0,0 +1,3 @@ + + +
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 81ccc09b63..ea657b7244 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -52,42 +52,47 @@ The order of draws needs to be known for some methods of instanced draws so that The render set order before 0.12 caused some problems with this as data had to be prepared before knowing the draw order. The previous order of sets was: -* Extract -* Prepare -* Queue -* Sort/Batch -* Render +![RenderSets-0.11](RenderSets-0.11.svg) -This constraint was most visible in the sprite batching implementation that skipped Prepare, sorted and prepared data in Queue, and then after being sorted again alongside 2D meshes and other entities in the Transparent2d render phase, possibly had its batches split to enable drawing of those other entities. +This constraint was most visible in the sprite batching implementation that skipped `Prepare`, sorted and prepared data in `Queue`, and then after being sorted again alongside 2D meshes and other entities in the `Transparent2d` render phase, possibly had its batches split to enable drawing of those other entities. -The ordering of the sets also created some confusion about when bind groups should be created. Bind groups were meant to be created in Prepare, but sometimes they had to be created in Queue to ensure that some preparation had completed. +The ordering of the sets also created some confusion about when bind groups should be created. Bind groups were meant to be created in `Prepare`, but sometimes they had to be created in `Queue` to ensure that some preparation had completed. The new render set order in 0.12 is: -* Extract -* PrepareAssets -* Queue -* Sort -* Prepare/Batch - * PrepareResources - * PrepareBindGroups -* Render +![RenderSets-0.12](RenderSets-0.12.svg) -PrepareAssets was introduced because we only want to queue entities for drawing if their assets have been prepared. Per-frame data preparation still happens in the Prepare set, specifically in its PrepareResources subset. That is now after Queue and Sort, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. Bind groups now have a clear subset where they should be created - PrepareBindGroups. +`PrepareAssets` was introduced because we only want to queue entities for drawing if their assets have been prepared. Per-frame data preparation still happens in the `Prepare` set, specifically in its `PrepareResources` subset. That is now after `Queue` and `Sort`, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. Bind groups now have a clear subset where they should be created - `PrepareBindGroups`. ### BatchedUniformBuffer and GpuArrayBuffer OK, so we need to put many pieces of data of the same type into buffers in a way that we can bind them as few times as possible and draw multiple instances from them. How can we do that? -Instance-rate vertex buffers are one way, but they are very constrained to having a specific order. They are/may be suitable for per-instance data like mesh entity transforms, but they can't be used for material data. +In 0.11 per-instance `MeshUniform` data is stored in a uniform buffer with each instance's data aligned to a dynamic offset. When drawing each mesh entity, we update the dynamic offset, which is close to rebinding. It looks like this: -The other main options are uniform and storage buffers. WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers have a minimum guaranteed size per binding of 16kB on WebGL2. Storage buffers, where available, have a minimum guaranteed size of 128MB. Data textures are also an option, but are far more awkward for structured data, and without support for linear data layouts on some platforms, they will perform worse. We want to support uniform buffers on WebGL2 or where storage buffers are not supported, and use storage buffers everywhere else. +![DynamicUniformBuffer](DynamicUniformBuffer.svg) +
Red arrows are 'rebinds' to update the dynamic offset, blue boxes are instance data, orange boxes are padding for dynamic offset alignment, which is a requirement of GPUs and graphics APIs.
+ +Instance-rate vertex buffers are one way, but they are very constrained to having a specific order. They are/may be suitable for per-instance data like mesh entity transforms, but they can't be used for material data. The other main options are uniform buffers, storage buffers, and data textures. + +WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers have a minimum guaranteed size per binding of 16kB on WebGL2. Storage buffers, where available, have a minimum guaranteed size of 128MB. + +Data textures are far more awkward for structured data, and without support for linear data layouts on some platforms, they will perform worse. + +We want to support uniform buffers on WebGL2 or where storage buffers are not supported, and use storage buffers everywhere else. #### BatchedUniformBuffer
authors: Rob Swain (@superdump), @JMS55, @teoxoy, @robtfm, @konsolas
-We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, MeshUniform requires 144 bytes per instance, which means 113 instances per 16kB binding. If we want to draw more than 113 entities, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what `BatchedUniformBuffer` is designed to solve. +We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, `MeshUniform` requires 144 bytes per instance, which means we can have a batch of 113 instances per 16kB binding. If we want to draw more than 113 entities in total, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what `BatchedUniformBuffer` is designed to solve. + +`BatchedUniformBuffer` looks like this: + +![BatchedUniformBuffer](BatchedUniformBuffer.svg) +
Red arrows are 'rebinds' to update the dynamic offset, blue boxes are instance data, orange boxes are padding for dynamic offset alignment.
+ +Notice how the instance data can be packed much more tightly, fitting the same amount of used data in less space. Also, we only need to update the dynamic offset of the binding for each batch. #### GpuArrayBuffer @@ -97,6 +102,13 @@ If users have to care about supporting both batched uniform and storage buffers `GpuArrayBuffer` was designed and implemented as an abstraction over `BatchedUniformBuffer` and using a `StorageBuffer` to store an array of `T`. +The data in a `StorageBuffer` looks like this: + +![StorageBuffer](StorageBuffer.svg) +
Red arrows are 'rebinds', blue boxes are instance data.
+ +All the instance data can be placed directly one after the other, and we only have to bind once. There is no need for any dynamic offset binding, so there is no need for any padding for alignment. + ```rust #[derive(Clone, ShaderType)] struct MyType { @@ -206,7 +218,7 @@ The design of the automatic batching/instanced draws in Bevy makes some assumpti * View bindings are constant across a render phase for a given draw function, as phases are per-view * `batch_and_prepare_render_phase` is the only system that performs batching and has sole responsibility for preparing per-instance (i.e. mesh uniform) data -If these assumptions do not work for your use case, then you can add the `NoAutomaticBatching` component to your entities to opt-out and do your own thing. Note that mesh uniform data will still be written to the GpuArrayBuffer and can be used in your own mesh bind groups. +If these assumptions do not work for your use case, then you can add the `NoAutomaticBatching` component to your entities to opt-out and do your own thing. Note that mesh uniform data will still be written to the `GpuArrayBuffer` and can be used in your own mesh bind groups. #### Instanced Draw Performance @@ -214,10 +226,16 @@ We can batch draws into a single instanced draw in some situations now that per- Using the same approach as 0.11 with one dynamic offset binding per mesh entity, and comparing to either storage buffers or batched uniform buffers: -2D meshes: `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z` which spawns 160 waves of 1000 2D quad meshes, producing 160 instanced draws of 1000 instances per draw enables up to a **160% increase in frame rate (2.6x)!** +2D meshes: `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z` which spawns 160 waves of 1000 2D quad meshes, producing 160 instanced draws of 1000 instances per draw. + +![0.12-2DMeshes](0.12-2DMeshes.svg) +
Tested on an M1 Max, limiting the BatchedUniformBuffer batch size to 1 versus how it works in 0.12.
3D meshes: `many_cubes` which spawn 160,000 cubes, of which ~11,700 are visible in the view. These are drawn using a single instanced draw of all visible cubes which enables up to **100% increase in frame rate (2x)**! +![0.12-3DMeshes](0.12-3DMeshes.svg) +
Tested on an M1 Max, limiting the BatchedUniformBuffer batch size to 1 versus how it works in 0.12.
+ These performance benefits can be leveraged on all platforms, including WebGL2! #### What is next for batching/instancing and beyond? From cf4dcb3aa367b18aa4af31c5943e611750ce515b Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Tue, 31 Oct 2023 13:41:15 -0700 Subject: [PATCH 07/14] Update content/news/2023-10-21-bevy-0.12/index.md Co-authored-by: Alice Cecile --- content/news/2023-10-21-bevy-0.12/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index ea657b7244..4fa7ab0e96 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -23,7 +23,7 @@ Since our last release a few months ago we've added a _ton_ of new features, bug
authors: Rob Swain (@superdump), @james-j-obrien, @JMS55, @inodentry, @robtfm, @nicopap, @teoxoy, @IceSentry, @Elabajaba
-Bevy's renderer performance for 2D and 3D meshes can improve a lot. Both CPU and graphics API / GPU bottlenecks can be removed to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2 and mobile devices, to the highest-end native discrete graphics cards. A solid foundation can support all of this. +Bevy's renderer performance for 2D and 3D meshes can improve a lot. There are bottlenecks on both the CPU and GPU side, which can be lessened to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2 and mobile devices, to the highest-end native discrete graphics cards. A solid foundation can support all of this. ### What are the bottlenecks? From cbb0da894fc01a6a85a655d74b6eb2653abe065c Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Tue, 31 Oct 2023 13:42:30 -0700 Subject: [PATCH 08/14] Apply suggestions from code review Co-authored-by: Alice Cecile --- content/news/2023-10-21-bevy-0.12/index.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 4fa7ab0e96..286ed787cc 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -40,9 +40,9 @@ Avoiding rebinding is both a big performance benefit for CPU-driven rendering, i ### What are CPU- and GPU-driven rendering? -CPU-driven rendering is where draw commands are created on the CPU, in Bevy this means in Rust code, more specifically in render graph nodes. +CPU-driven rendering is where draw commands are created on the CPU. In Bevy this means in Rust code, more specifically in render graph nodes. -In GPU-driven rendering, the draw commands are encoded on the GPU by compute shaders. This leverages GPU parallelism, and unlocks more advanced culling optimisations that are infeasible to do on the CPU, among many other methods that bring large performance benefits. +In GPU-driven rendering, the draw commands are encoded on the GPU by [compute shaders](https://www.khronos.org/opengl/wiki/Compute_Shader). This leverages GPU parallelism, and unlocks more advanced culling optimizations that are infeasible to do on the CPU, among many other methods that bring large performance benefits. ### Reorder Render Sets @@ -79,7 +79,7 @@ WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers h Data textures are far more awkward for structured data, and without support for linear data layouts on some platforms, they will perform worse. -We want to support uniform buffers on WebGL2 or where storage buffers are not supported, and use storage buffers everywhere else. +We want to support uniform buffers where storage buffers are not supported (like WebGL2) and use storage buffers everywhere else. #### BatchedUniformBuffer From a54a5008c79f466cd4eae26180a1a1c8b1ec6bdd Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Tue, 31 Oct 2023 14:27:21 -0700 Subject: [PATCH 09/14] Condense into "whats next" --- content/news/2023-10-21-bevy-0.12/index.md | 23 ++++------------------ 1 file changed, 4 insertions(+), 19 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 286ed787cc..70105d6074 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -238,18 +238,6 @@ Using the same approach as 0.11 with one dynamic offset binding per mesh entity, These performance benefits can be leveraged on all platforms, including WebGL2! -#### What is next for batching/instancing and beyond? - -* Put material data into GpuArrayBuffer per material type (e.g. all StandardMaterial instances will be stored in one GpuArrayBuffer) - this enables batching of draws for entities with the same mesh, same material type and textures, but different material data! - * A prototype implementation of this shows enormous benefits because materials are currently always _one uniform buffer per material instance_ which means we can't just update the dynamic offset, rather the entire bind group has to be rebound! -* Put material textures into bindless texture arrays - this enables batching of draws for entities with the same mesh and same material type! -* Where bindless texture arrays are not supported (WebGL2, WebGPU, some native) we can leverage asset preprocessing to pack textures into texture atlas textures, and use array textures where each layer is a texture atlas. This is an alternative way of avoiding rebinding for changing textures. -* Put mesh data into one big buffer per mesh attribute layout - this removes the need to rebind the index/vertex buffers per-draw, instead only vertex/index range needs to be passed to the draw command. Prototypes showed this didn't give much/any improvement for CPU-drive rendering, but it does unlock GPU-driven nonetheless. -* Put skinned mesh data into storage buffers if possible to enable instanced drawing of skinned mesh entities using the same mesh, skin, and material! This was prototyped and enabled drawing about 25% more (1.25x) foxes! -* GPU-driven rendering for WebGPU and native - * @JMS55 is working on GPU-driven rendering already, using a meshlet approach. - * Rob Swain (@superdump) intends to implement an alternative method that does not require processing meshes into meshlets but that limits to drawing up to 256 instances per draw. - ## Rendering Performance Improvements ### EntityHashMap @@ -326,18 +314,15 @@ Sprites UI -### What's next for rendering performance? - -* Rearchitecting the renderer data flow to enable use of `Vec` - * Ideally we would only ever need to iterate in-order over dense arrays of data and never do any random-access lookups. CPUs are very good at this as it enables predictable data access that increases the cache hit rate and makes for very fast processing. Ideas have come up for possible ways to rearchitect the renderer a little to enable dense arrays and no unnecessary lookups! -* Batching code already compares previous draw state (pipeline, bind groups, index/vertex buffers, etc) to current draw state. This is then repeated by `TrackedRenderPass` when encoding draws. This cost can be removed with a new API called `DrawStream`. - ## What's Next? -We have plenty of work that is pretty much finished and is therefore very likely to land in **Bevy 0.13**: +We have plenty of work in progress! Some of this will likely land in **Bevy 0.13**. Check out the [**Bevy 0.13 Milestone**](https://github.com/bevyengine/bevy/milestone/17) for an up-to-date list of current work being considered for **Bevy 0.13**. +* **More Batching/Instancing Improvements**: Put skinned mesh data into storage buffers to enable instanced drawing of skinned mesh entities with the same mesh/skin/material. Put material data in the new GpuArrayBuffer to enable batching of draws of entities with the same mesh, material type, and textures, but different material data. +* **GPU driven rendering**: We plan on driving rendering via the GPU by creating draw calls in compute shaders (on platforms that support it). We have [experiments using meshlets](https://github.com/bevyengine/bevy/pull/10164) and plan to explore other approaches as well. This will involve putting textures into bindless texture arrays and putting meshes in one big buffer to avoid rebinds. + ## Support Bevy Sponsorships help make our work on Bevy sustainable. If you believe in Bevy's mission, consider [sponsoring us](/community/donate) ... every bit helps! From 129b1f00000b6f1afeac7e989a40b094ca4a706e Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Tue, 31 Oct 2023 19:12:28 -0700 Subject: [PATCH 10/14] Slim down, rephrase for clarity, reorder --- content/news/2023-10-21-bevy-0.12/index.md | 275 ++++++++------------- 1 file changed, 105 insertions(+), 170 deletions(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 70105d6074..15a14d96fe 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -19,56 +19,100 @@ Since our last release a few months ago we've added a _ton_ of new features, bug
authors: @author
-## Automatic Batching and Instancing, and the Road to GPU-driven Rendering +## Automatic Batching/Instancing of Draw Commands -
authors: Rob Swain (@superdump), @james-j-obrien, @JMS55, @inodentry, @robtfm, @nicopap, @teoxoy, @IceSentry, @Elabajaba
+
authors: Rob Swain (@superdump)
-Bevy's renderer performance for 2D and 3D meshes can improve a lot. There are bottlenecks on both the CPU and GPU side, which can be lessened to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2 and mobile devices, to the highest-end native discrete graphics cards. A solid foundation can support all of this. +**Bevy 0.12** now automatically batches/instances draw commands where possible. This cuts down the number of draw calls, which yields significant performance wins! + +This required a number of architectural changes, including how we store and access per-entity mesh data (more on this later). + +Here are some benches of the old unbatched approach (0.11) to the new batched approach (0.12): + +### 2D Mesh Bevymark (frames per second, more is better) + +This renders 160,000 entities with textured quad meshes (160 groups of 1,000 entities each, each group sharing a material). This means we can batch each group, resulting in only 160 instanced draw calls when batching is enabled. + +![0.12-2DMeshes](0.12-2DMeshes.svg) +
Tested on an M1 Max, limiting the BatchedUniformBuffer batch size to 1 versus how it works in 0.12.
+ +### 3D Mesh "Many Cubes" (frames per second, more is better) + +This renders 160,000 cubes, of which ~11,700 are visible in the view. These are drawn using a single instanced draw of all visible cubes which enables up to **100% increase in frame rate (2x)**! + +![0.12-3DMeshes](0.12-3DMeshes.svg) +
Tested on an M1 Max, limiting the BatchedUniformBuffer batch size to 1 versus how it works in 0.12.
+ +These performance benefits can be leveraged on all platforms, including WebGL2! + +### What can be batched? + +Batching/Instancing can only happen for GPU data that doesn't require "rebinding". This means if something like a pipeline (shaders), bind group (shader-accessible bound data), vertex / index buffer (mesh) is different, it cannot be batched. -### What are the bottlenecks? +From a high level, currently entities with the same material and mesh can be batched. -One major bottleneck is the structure of the data used for rendering. +We are investigating ways to make more data accessible without rebinds, such as bindless textures, combining meshes into larger buffers, etc. -* Mesh entity data is stored in one uniform buffer, but has to be rebound at different dynamic offsets for every single draw. -* Material type data (e.g. `StandardMaterial` uniform properties, but not textures) are stored in individual uniform buffers that have to be rebound per draw if the material changes. -* Material textures are stored individually and have to be rebound per draw if a mesh texture changes. -* Mesh index / vertex buffers are stored individually per-mesh and have to be rebound per draw. +### Opting Out + +If you would like to opt out an entity from automatic batching, you can add the new [`NoAutomaticBatching`] component to it. + +This is generally for cases where you are doing custom, non-standard renderer features that don't play nicely with batching's assumptions. For example, it assumes view bindings are constant across draws and that Bevy's-built-in entity batching logic is used. + +[`NoAutomaticBatching`]: https://dev-docs.bevyengine.org/bevy/render/batching/struct.NoAutomaticBatching.html + +## The Road to GPU-driven Rendering + +
authors: Rob Swain (@superdump), @james-j-obrien, @JMS55, @inodentry, @robtfm, @nicopap, @teoxoy, @IceSentry, @Elabajaba
-All of this rebinding has both CPU and graphics API / GPU performance impact. On the CPU, it means encoding of draw commands has many more steps to process and so takes more time than necessary. In the graphics API and on the GPU, it means many more rebinds, and separate draw commands. +Bevy's renderer performance for 2D and 3D meshes can improve a lot. There are bottlenecks on both the CPU and GPU side, which can be lessened to give significantly higher frame rates. As always with Bevy, we want to make the most of the platforms you use, from the constraints of WebGL2 and mobile devices, to the highest-end native discrete graphics cards. A solid foundation can support all of this. -Avoiding rebinding is both a big performance benefit for CPU-driven rendering, including WebGL2, and is necessary to enable GPU-driven rendering. +In **Bevy 0.12** we have started reworking rendering data structures, data flow, and draw patterns to unlock new optimizations. This enabled the Automatic Batching/Instancing we landed in **Bevy 0.12** and also helps pave the way for other significant wins in the future, such as GPU-driven rendering. ### What are CPU- and GPU-driven rendering? -CPU-driven rendering is where draw commands are created on the CPU. In Bevy this means in Rust code, more specifically in render graph nodes. +CPU-driven rendering is where draw commands are created on the CPU. In Bevy this means "in Rust code", more specifically in render graph nodes. This is how Bevy currently kicks off draws. In GPU-driven rendering, the draw commands are encoded on the GPU by [compute shaders](https://www.khronos.org/opengl/wiki/Compute_Shader). This leverages GPU parallelism, and unlocks more advanced culling optimizations that are infeasible to do on the CPU, among many other methods that bring large performance benefits. +### What needs to change? + +Historically Bevy's general GPU data pattern has been to bind each piece of data per-entity and issue a draw call per-entity. We did store data in arrays and accessed with dynamic offsets in some cases, but this still resulted in rebinding at each offset. + +All of this rebinding has performance implications, both on the CPU and the GPU. On the CPU, it means encoding draw commands has many more steps to process and so takes more time than necessary. In the graphics API and on the GPU, it means many more rebinds and separate draw commands. + +Avoiding rebinding is both a big performance benefit for CPU-driven rendering and is necessary to enable GPU-driven rendering. + +To avoid rebinds, the general data pattern we are aiming for is: + +* For each data type (meshes, materials, transforms, textures), create a single array (or a small number of arrays) containing all of the items of that data type +* Bind these arrays a small number of times (ideally once), avoiding per-entity/per-draw rebinds + +In **Bevy 0.12** we've started this process in earnest! We've made a number of architectural changes that are already yielding fruit. Thanks to these changes, we can now [automatically batch and instance draws](#automatic-batching-instancing-of-draw-commands) for entities with the exact same mesh and material. And as we progress further down this path, we can batch/instance across a wider variety of cases, cutting out more and more CPU work until eventually we are "fully GPU-driven". + ### Reorder Render Sets
authors: Rob Swain (@superdump), @james-j-obrien, @inodentry
The order of draws needs to be known for some methods of instanced draws so that the data can be laid out, and looked up in order. For example, when per-instance data is stored in an instance-rate vertex buffer. -The render set order before 0.12 caused some problems with this as data had to be prepared before knowing the draw order. The previous order of sets was: +The render set order before **Bevy 0.12** caused some problems with this as data had to be prepared (written to the GPU) before knowing the draw order. Not ideal when our plan is to have an ordered list of entity data on the GPU! The previous order of sets was: ![RenderSets-0.11](RenderSets-0.11.svg) -This constraint was most visible in the sprite batching implementation that skipped `Prepare`, sorted and prepared data in `Queue`, and then after being sorted again alongside 2D meshes and other entities in the `Transparent2d` render phase, possibly had its batches split to enable drawing of those other entities. - -The ordering of the sets also created some confusion about when bind groups should be created. Bind groups were meant to be created in `Prepare`, but sometimes they had to be created in `Queue` to ensure that some preparation had completed. +This caused friction (and suboptimal instancing) in a number of current (and planned) renderer features. Most notably in previous versions of Bevy, it caused these problems for sprite batching. The new render set order in 0.12 is: ![RenderSets-0.12](RenderSets-0.12.svg) -`PrepareAssets` was introduced because we only want to queue entities for drawing if their assets have been prepared. Per-frame data preparation still happens in the `Prepare` set, specifically in its `PrepareResources` subset. That is now after `Queue` and `Sort`, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. Bind groups now have a clear subset where they should be created - `PrepareBindGroups`. +`PrepareAssets` was introduced because we only want to queue entities for drawing if their assets have been prepared. Per-frame data preparation still happens in the `Prepare` set, specifically in its `PrepareResources` subset. That is now after `Queue` and `Sort`, so the order of draws is known. This also made a lot more sense for batching, as it is now known at the point of batching whether an entity that is of another type in the render phase needs to be drawn. Bind groups now have a clear subset where they should be created ... `PrepareBindGroups`. ### BatchedUniformBuffer and GpuArrayBuffer OK, so we need to put many pieces of data of the same type into buffers in a way that we can bind them as few times as possible and draw multiple instances from them. How can we do that? -In 0.11 per-instance `MeshUniform` data is stored in a uniform buffer with each instance's data aligned to a dynamic offset. When drawing each mesh entity, we update the dynamic offset, which is close to rebinding. It looks like this: +In previous versions of Bevy, per-instance `MeshUniform` data is stored in a uniform buffer with each instance's data aligned to a dynamic offset. When drawing each mesh entity, we update the dynamic offset, which is close to rebinding. It looks like this: ![DynamicUniformBuffer](DynamicUniformBuffer.svg)
Red arrows are 'rebinds' to update the dynamic offset, blue boxes are instance data, orange boxes are padding for dynamic offset alignment, which is a requirement of GPUs and graphics APIs.
@@ -77,15 +121,15 @@ Instance-rate vertex buffers are one way, but they are very constrained to havin WebGL2 does not support storage buffers, only uniform buffers. Uniform buffers have a minimum guaranteed size per binding of 16kB on WebGL2. Storage buffers, where available, have a minimum guaranteed size of 128MB. -Data textures are far more awkward for structured data, and without support for linear data layouts on some platforms, they will perform worse. +Data textures are far more awkward for structured data. And on platforms that don't support linear data layouts, they will perform worse. -We want to support uniform buffers where storage buffers are not supported (like WebGL2) and use storage buffers everywhere else. +Given these constraints, we want to use storage buffers on platforms where they are supported, and we want to use uniform buffers on platforms where they are not supported (ex: WebGL 2). #### BatchedUniformBuffer
authors: Rob Swain (@superdump), @JMS55, @teoxoy, @robtfm, @konsolas
-We have to assume that on WebGL2, we may only be able to access 16kB of data at a time. Taking an example, `MeshUniform` requires 144 bytes per instance, which means we can have a batch of 113 instances per 16kB binding. If we want to draw more than 113 entities in total, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what `BatchedUniformBuffer` is designed to solve. +For uniform buffers, we have to assume that on WebGL2 we may only be able to access 16kB of data at a time. Taking an example, `MeshUniform` requires 144 bytes per instance, which means we can have a batch of 113 instances per 16kB binding. If we want to draw more than 113 entities in total, we need a way of managing a uniform buffer of data that can be bound at a dynamic offset per batch of instances. This is what `BatchedUniformBuffer` is designed to solve. `BatchedUniformBuffer` looks like this: @@ -98,178 +142,60 @@ Notice how the instance data can be packed much more tightly, fitting the same a
authors: Rob Swain (@superdump), @JMS55, @IceSentry, @mockersf
-If users have to care about supporting both batched uniform and storage buffers to store arrays of data for use in shaders, many may choose not to because their priority is not WebGL2. We want to make it simple and easy to support all users. +Given that we need to support both uniform and storage buffers for a given data type, this increases the level of complexity required to implement new low-level renderer features (both in Rust code and in shaders). When confronted with this complexity, some developers might choose instead only use storage buffers (effectively dropping support for WebGL 2). -`GpuArrayBuffer` was designed and implemented as an abstraction over `BatchedUniformBuffer` and using a `StorageBuffer` to store an array of `T`. +To make it as easy as possible to support both storage types, we developed [`GpuArrayBuffer`]. This is a generic collection of `T` values that abstracts over `BatchedUniformBuffer` and [`StorageBuffer`]. It will use the right storage for the current platform / GPU. -The data in a `StorageBuffer` looks like this: +The data in a [`StorageBuffer`] looks like this: ![StorageBuffer](StorageBuffer.svg)
Red arrows are 'rebinds', blue boxes are instance data.
All the instance data can be placed directly one after the other, and we only have to bind once. There is no need for any dynamic offset binding, so there is no need for any padding for alignment. -```rust -#[derive(Clone, ShaderType)] -struct MyType { - x: f32, -} - -// Create a GPU array buffer -let mut buffer = GpuArrayBuffer::::new(&render_device.limits()); - -// Push some items into it -for i in 0..N { - // indices is a GpuArrayBufferIndex which contains a NonMaxU32 - // index into the array and an Option dynamic offset. If storage - // buffers are supported, it will be None, else Some with the dynamic - // offset that needs to be used when binding the bind group. indices should - // be stored somewhere for later lookup, often associated with an Entity. - let indices = buffer.push(MyType { x: i as f32 }); -} - -// Queue writing the buffer contents to VRAM -buffer.write_buffer(&render_device, &render_queue); - -// The bind group layout entry to use when creating the pipeline -let binding = 0; -let visibility = ShaderStages::VERTEX; -let bind_group_layout_entry = buffer.binding_layout( - binding, - visibility, - &render_device, -); - -// Get the binding resource to make a bind group entry to use when creating the -// bind group -let buffer_binding_resource = buffer.binding()?; - -// Get the batch size. This will be None if storage buffers are supported, else -// it is the maximum number of elements that could fit in a batch -let buffer_batch_size = GpuArrayBuffer::::batch_size(&render_device.limits()); - -// Set a shader def with the buffer batch size -if let Some(buffer_batch_size) = buffer_batch_size { - shader_defs.push(ShaderDefVal::UInt( - "BUFFER_BATCH_SIZE".into(), - buffer_batch_size, - )); -} -``` - -```rust -#import bevy_render::instance_index get_instance_index - -struct MyType { - x: f32, -} - -// Declare the buffer binding -#ifdef BUFFER_BATCH_SIZE -@group(2) @binding(0) var data: array; -#else -@group(2) @binding(0) var data: array; -#endif +[Check out this annotated code example](https://gist.github.com/cart/3a9f190bd5e789a7d42317c28843ffca) that illustrates using [`GpuArrayBuffer`] to support both uniform and storage buffer bindings. -// Access an instance -let my_type = data[get_instance_index(in.instance_index)]; -``` +[`GpuArrayBuffer`]: https://dev-docs.bevyengine.org/bevy/render/render_resource/enum.GpuArrayBuffer.html +[`StorageBuffer`]: https://dev-docs.bevyengine.org/bevy/render/render_resource/struct.StorageBuffer.html ### 2D / 3D Mesh Entities using GpuArrayBuffer
authors: Rob Swain (@superdump), @robtfm, @Elabajaba
-The 2D and 3D mesh entity rendering was migrated to use `GpuArrayBuffer` for the mesh uniform data. +The 2D and 3D mesh entity rendering was migrated to use [`GpuArrayBuffer`] for the mesh uniform data. -Just avoiding the rebinding of the mesh uniform data buffer gives about a 6% increase in frame rates. +Just avoiding the rebinding of the mesh uniform data buffer gives about a 6% increase in frame rates! -### Improved bevymark Example - -
authors: Rob Swain (@superdump), @IceSentry
- -The bevymark example needed to be improved to enable benchmarking the batching / instanced draw changes. Modes were added to: - -* draw 2D quad meshes instead of sprites: `--mode mesh2d` -* vary the per-instance color data instead of only varying the colour per wave of birds: `--vary-per-instance` -* generate a number of material / sprite textures and randomly choose from them either per wave or per instance depending on the vary per instance setting: `--material-texture-count 10` -* spawn the birds in random z order (new default), or in draw order: `--ordered-z` - -This allows benchmarking of different situations for batching / instancing in the next section. - -### Automatic Batching/Instancing of Draw Commands - -
authors: Rob Swain (@superdump), @robtfm, @nicopap
- -There are many operations that can be done to prepare a draw command in a render pass. If anything needs to change either in bindings or the draw itself, then the draws cannot be batched together into an instanced draw. Some of the main things that can change between draws are: - -* Pipeline -* BindGroup or its corresponding dynamic offsets -* Index/vertex buffer -* Index/vertex range -* Instance range - -Pipelines usually vary due to using different shaders in custom materials, or using variants of a material due to shader defs as the shader defs produce different shaders. Bind group bindings can change due to different material textures, different buffers, or needing to bind different parts of some buffers using dynamic offsets. Index/vertex buffers and/or ranges change per mesh asset. Instance range is what we want to leverage for instanced draws. - -#### Assumptions - -The design of the automatic batching/instanced draws in Bevy makes some assumptions to enable a reasonable solution: - -* Only entities with prepared assets are queued to render phases -* View bindings are constant across a render phase for a given draw function, as phases are per-view -* `batch_and_prepare_render_phase` is the only system that performs batching and has sole responsibility for preparing per-instance (i.e. mesh uniform) data - -If these assumptions do not work for your use case, then you can add the `NoAutomaticBatching` component to your entities to opt-out and do your own thing. Note that mesh uniform data will still be written to the `GpuArrayBuffer` and can be used in your own mesh bind groups. - -#### Instanced Draw Performance - -We can batch draws into a single instanced draw in some situations now that per-instance mesh uniform data is in a `GpuArrayBuffer`. If the mesh entity is using the same mesh asset, and same material asset, then it can be batched! - -Using the same approach as 0.11 with one dynamic offset binding per mesh entity, and comparing to either storage buffers or batched uniform buffers: - -2D meshes: `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z` which spawns 160 waves of 1000 2D quad meshes, producing 160 instanced draws of 1000 instances per draw. - -![0.12-2DMeshes](0.12-2DMeshes.svg) -
Tested on an M1 Max, limiting the BatchedUniformBuffer batch size to 1 versus how it works in 0.12.
- -3D meshes: `many_cubes` which spawn 160,000 cubes, of which ~11,700 are visible in the view. These are drawn using a single instanced draw of all visible cubes which enables up to **100% increase in frame rate (2x)**! - -![0.12-3DMeshes](0.12-3DMeshes.svg) -
Tested on an M1 Max, limiting the BatchedUniformBuffer batch size to 1 versus how it works in 0.12.
- -These performance benefits can be leveraged on all platforms, including WebGL2! - -## Rendering Performance Improvements - -### EntityHashMap +## EntityHashMap Renderer Optimization
authors: Rob Swain (@superdump), @robtfm, @pcwalton, @jancespivo, @SkiFire13, @nicopap
-#### The Performance Problem - -Since Bevy 0.6, Bevy's renderer has used a separate render world to store an extracted snapshot of the simulated data from the main world to enable pipelined rendering of frame N in the render app, while the main app simulates frame N+1. +Since **Bevy 0.6**, Bevy's renderer has extracted data from the "app world" into a separate "render world". This enables [Pipelined Rendering](/news/bevy-0-6/#pipelined-rendering-extract-prepare-queue-render), which renders frame N in the render app, while the main app simulates frame N+1. Part of the design involves clearing the render world of all entities between frames. This enables consistent Entity mapping between the main and render worlds while still being able to spawn new entities in the render world that don't exist in the main world. -Unfortunately, this ECS usage pattern also incurred some significant performance problems. Entities are cleared and respawned each frame, components are inserted across many systems and different parts of the render app schedule. +Unfortunately, this ECS usage pattern also incurred some significant performance problems. To get good "linear iteration read performance", we wanted to use "table storage" (Bevy's default ECS storage model). However in the renderer, entities are cleared and respawned each frame, components are inserted across many systems and different parts of the render app schedule. This resulted in a lot of "archetype moves" as new components were inserted from various renderer contexts. When an entity moves to a new archetype, all of its "table storage" components are copied into the new archetype's table. This can be expensive across many archetype moves and/or large table moves. -The fastest ECS storage available is called table storage. A simplified concept for table storage is that it is a structure of arrays of component data. Each archetype has its own table for storage. Whenever a new component is inserted onto an entity that it didn't have before, its archetype is changed. This then requires that that entity's component data be moved from the table for the old archetype to the table for the new archetype. +This was unfortunately leaving a lot of performance on the table. Many ideas were discussed over a long period for how to improve this. -In practice this was very visible in profiles as long-running system commands throughout the render app schedule. +### The Path Forward -DEMO PROFILE IMAGE +The main two paths forward were: -As can be seen, this was unfortunately leaving a lot of performance on the table. Many ideas were discussed over a long period for how to improve this. The main two paths forward were: +1. Persist render world entities and their component data across frames +2. Stop using entity table storage for storing component data in the render world -1. Persist render world entities and their component data across frames - this has the problem of memory leaks and Entity collisions -2. Stop using entities for storing component data in the render world +We have decided to explore option (2) for **Bevy 0.12** as persisting entities involves solving other problems that have no simple and satisfactory answers (ex: how do we keep the worlds perfectly in sync without leaking data). We may find those answers eventually, but for now we chose the path of least resistance! -We have decided to explore option 2 for Bevy 0.12 as persisting entities involves solving other problems that have no simple and satisfactory answers. +We landed on using `HashMap` with an optimized hash function designed by @SkiFire13, and inspired by [`rustc-hash`](https://github.com/rust-lang/rustc-hash). This is exposed as [`EntityHashMap`] and is the new way to store component data in the render world. -After consideration, we landed on using `HashMap` with a hash function designed by @SkiFire13, and inspired by `rustc-hash`. This configuration is called `EntityHashMap` and is the new way to store component data in the render world. +This [yielded significant performance wins](https://github.com/bevyengine/bevy/pull/9903). -#### EntityHashMap Helpers +[`EntityHashMap`]: https://dev-docs.bevyengine.org/bevy/utils/type.EntityHashMap.html -A helper plugin was added to make it simple and quick to extract main world data for use in the render world in the form of `ExtractInstancesPlugin`. You can extract all entities matching a query, or only those that are visible, extracting multiple components at once into one target type. +### Usage + +The easiest way to use it is to use the new [`ExtractInstancesPlugin`]. This wil extract all entities matching a query, or only those that are visible, extracting multiple components at once into one target type. It is a good idea to group component data that will be accessed together into one target type to avoid having to do multiple lookups. @@ -296,23 +222,32 @@ impl ExtractInstance for MyType { app.add_plugins(ExtractInstancesPlugin::::extract_visible()); ``` -### Sprite Instancing +[`ExtractInstancesPlugin`]: https://dev-docs.bevyengine.org/bevy/render/extract_instances/struct.ExtractInstancesPlugin.html + +## Sprite Instancing + +
authors: Rob Swain (@superdump)
-Sprites were being rendered by generating a vertex buffer containing 4 vertices per sprite with position, UV, and possibly color data. This has proven to be very effective. However, having to split batches of sprites into multiple draws because they use a different color is suboptimal. +In previous versions of Bevy, Sprites were rendered by generating a vertex buffer containing 4 vertices per sprite with position, UV, and possibly color data. This has proven to be very effective. However, having to split batches of sprites into multiple draws because they use a different color is suboptimal. Sprite rendering now uses an instance-rate vertex buffer to store the per-instance data. Instance-rate vertex buffers are stepped when the instance index changes, rather than when the vertex index changes. The new buffer contains an affine transformation matrix that enables translation, scaling, and rotation in one transform. It contains per-instance color, and UV offset and scale. -This retains all the functionality of the previous method, enables the additional flexibility of any sprite being able to have a color tint and all still be drawn in the same batch, and uses a total of 80 bytes per sprite, versus 144 bytes previously. The practical result is a performance improvement of up to 40% versus the previous method! +This retains all the functionality of the previous method, enables the additional flexibility of any sprite being able to have a color tint and all still be drawn in the same batch, and uses a total of 80 bytes per sprite, versus 144 bytes previously. -### Overall Performance vs 0.11 +This resulted in a performance improvement of up to **40%** versus the previous method! -3D Meshes +## Improved bevymark Example -2D Meshes +
authors: Rob Swain (@superdump), @IceSentry
-Sprites +The bevymark example needed to be improved to enable benchmarking the batching / instanced draw changes. Modes were added to: -UI +* draw 2D quad meshes instead of sprites: `--mode mesh2d` +* vary the per-instance color data instead of only varying the colour per wave of birds: `--vary-per-instance` +* generate a number of material / sprite textures and randomly choose from them either per wave or per instance depending on the vary per instance setting: `--material-texture-count 10` +* spawn the birds in random z order (new default), or in draw order: `--ordered-z` + +This allows benchmarking of different situations for batching / instancing in the next section. ## What's Next? From d25fec6115790ae67451bc3c42ab71f48689bbc9 Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Wed, 1 Nov 2023 11:53:36 -0700 Subject: [PATCH 11/14] app world -> main world --- content/news/2023-10-21-bevy-0.12/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 15a14d96fe..1da0fa77e6 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -170,7 +170,7 @@ Just avoiding the rebinding of the mesh uniform data buffer gives about a 6% inc
authors: Rob Swain (@superdump), @robtfm, @pcwalton, @jancespivo, @SkiFire13, @nicopap
-Since **Bevy 0.6**, Bevy's renderer has extracted data from the "app world" into a separate "render world". This enables [Pipelined Rendering](/news/bevy-0-6/#pipelined-rendering-extract-prepare-queue-render), which renders frame N in the render app, while the main app simulates frame N+1. +Since **Bevy 0.6**, Bevy's renderer has extracted data from the "main world" into a separate "render world". This enables [Pipelined Rendering](/news/bevy-0-6/#pipelined-rendering-extract-prepare-queue-render), which renders frame N in the render app, while the main app simulates frame N+1. Part of the design involves clearing the render world of all entities between frames. This enables consistent Entity mapping between the main and render worlds while still being able to spawn new entities in the render world that don't exist in the main world. From 03b7ab6f19768479cbe8366187d5933a48c66bc2 Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Wed, 1 Nov 2023 11:57:24 -0700 Subject: [PATCH 12/14] clarify "array" --- content/news/2023-10-21-bevy-0.12/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 1da0fa77e6..657f1bed10 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -77,7 +77,7 @@ In GPU-driven rendering, the draw commands are encoded on the GPU by [compute sh ### What needs to change? -Historically Bevy's general GPU data pattern has been to bind each piece of data per-entity and issue a draw call per-entity. We did store data in arrays and accessed with dynamic offsets in some cases, but this still resulted in rebinding at each offset. +Historically Bevy's general GPU data pattern has been to bind each piece of data per-entity and issue a draw call per-entity. In some cases we did store data in uniform buffers in "array style" and accessed with dynamic offsets, but this still resulted in rebinding at each offset. All of this rebinding has performance implications, both on the CPU and the GPU. On the CPU, it means encoding draw commands has many more steps to process and so takes more time than necessary. In the graphics API and on the GPU, it means many more rebinds and separate draw commands. From cb304862e961ec34a6def32d2adcced6dddaeab3 Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Wed, 1 Nov 2023 12:01:09 -0700 Subject: [PATCH 13/14] "close in cost" --- content/news/2023-10-21-bevy-0.12/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 657f1bed10..1f1755f906 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -112,7 +112,7 @@ The new render set order in 0.12 is: OK, so we need to put many pieces of data of the same type into buffers in a way that we can bind them as few times as possible and draw multiple instances from them. How can we do that? -In previous versions of Bevy, per-instance `MeshUniform` data is stored in a uniform buffer with each instance's data aligned to a dynamic offset. When drawing each mesh entity, we update the dynamic offset, which is close to rebinding. It looks like this: +In previous versions of Bevy, per-instance `MeshUniform` data is stored in a uniform buffer with each instance's data aligned to a dynamic offset. When drawing each mesh entity, we update the dynamic offset, which can be close in cost to rebinding. It looks like this: ![DynamicUniformBuffer](DynamicUniformBuffer.svg)
Red arrows are 'rebinds' to update the dynamic offset, blue boxes are instance data, orange boxes are padding for dynamic offset alignment, which is a requirement of GPUs and graphics APIs.
From 2b7464769e9db8a84c88791f9363a6d121803af5 Mon Sep 17 00:00:00 2001 From: Carter Anderson Date: Wed, 1 Nov 2023 12:08:09 -0700 Subject: [PATCH 14/14] Define "binding" --- content/news/2023-10-21-bevy-0.12/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/news/2023-10-21-bevy-0.12/index.md b/content/news/2023-10-21-bevy-0.12/index.md index 1f1755f906..9bb3437c3d 100644 --- a/content/news/2023-10-21-bevy-0.12/index.md +++ b/content/news/2023-10-21-bevy-0.12/index.md @@ -47,7 +47,7 @@ These performance benefits can be leveraged on all platforms, including WebGL2! ### What can be batched? -Batching/Instancing can only happen for GPU data that doesn't require "rebinding". This means if something like a pipeline (shaders), bind group (shader-accessible bound data), vertex / index buffer (mesh) is different, it cannot be batched. +Batching/Instancing can only happen for GPU data that doesn't require "rebinding" (binding is making data available to shaders / pipelines, which incurs a runtime cost). This means if something like a pipeline (shaders), bind group (shader-accessible bound data), vertex / index buffer (mesh) is different, it cannot be batched. From a high level, currently entities with the same material and mesh can be batched.