-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903
Use EntityHashMap<Entity, T> for render world entity storage for better performance #9903
Conversation
@cart just to clearly state that this PR does introduce performance regressions for sprites and UI as it switches them from |
These benchmarks are likely to change once we switch the render phase sorting to be by pipeline, right? |
These benchmarks in particular wouldn’t change as they’re all using one pipeline each. Beyond that, we would only implement that kind of sorting for opaque phases where draw order is about maximising performance and not about visual correctness. That means only many_cubes would be affected as 2D and UI only have transparent phases at the moment. |
crates/bevy_utils/src/lib.rs
Outdated
#[inline] | ||
fn write_u32(&mut self, i: u32) { | ||
self.hash = i as u64; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a performance footgun. hashbrown
uses the top 7 bits of the hash to do SIMD probing inside a group, and this will set all of them to 0, essentially nullifying that optimization.
Also note that Entity
has two u32
fields, so it will call write_u32
twice and the second one will overwrite the result of the first one. Luckily the index
is the second field of Entity
, but we're not documenting this anywhere and it would be easy to break (especially considering the ongoing work to refactor identifiers in general).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know it worked like that. What do you suggest?
The generation in the Entity doesn't really matter for these uses, so I did try using Entity.index() -> u32 as noted in the description but in some cases it was slower than using Entity for some reason. Just noting that the index is the bit I'm interested in using as a key for cases where the map is cleared each frame, in case that makes things simpler to improve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know it worked like that. What do you suggest?
You could try with a very simple hasher like rustc_hash
. If you want to manually implement it for the u32
case, it's just (i as u64).wrapping_mul(0x517cc1b727220a95)
. The only concern I have with this is that it kinda randomizes the hash, so it pessimizes the case where entities are sorted.
An alternative for better memory locality might be keeping the entity index for the lower 32 bits while using the hash for the upper 32 bits, which should allow some useful top 7 bits. This could be implemented with ((i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64)
.
For 32 bit targets (e.g. wasm) this is a bit more complicated because hashbrown
only considers the bottom 32 bits. Unfortunately the top 7 bits of the entity index are all zeros most of the time (you need 2^25 ~ 33M entities for them to start being non-zero). My guess is that you could get away with the previous method but done on a 16-16 bits split, e.g. (i.wrapping_mul(0x9e3779b9) << 16) | (i & 0xffff)
Just noting that the index is the bit I'm interested in using as a key for cases where the map is cleared each frame, in case that makes things simpler to improve.
Then using only index
should be fine for your uses, but how do you ensure that whoever changes Entity
in the future makes sure that the index
field will be hashed last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't like how implicit this is. PassHasher exists to pass a single full u64 hash. Giving it support for a u32 and then using it on each of the fields of Entity is a confusing misuse of that design imo.
Some ideas:
- A custom Entity Hash impl that combines the index and generation into a single u64
- Probably very bad for some safety reason (in addition to cleanliness / maintainability), but could we transmute
&self
(&Entity
) to&64
?
- Probably very bad for some safety reason (in addition to cleanliness / maintainability), but could we transmute
- A custom EntityMap type that does the work for users. (ex: if we really want a entity.index->X map, we could just wrap an Entity api over
PassHashMap<u64, X, PassHash>
that casts entity.index to u64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a suggestion from @nicopap about doing a wrapping_mul by u32::MAX / golden ratio. I was thinking that as the hash is a u64, I should instead use u64::MAX / phi. But this made performance really really bad. Not sure what I did wrong. And ahash and others are not as bad as that was, which seems odd. My current guess without having had time to dig deeper is that wrapping_mul on a u64 uses u128 maths that just has really bad performance.
@cart options 1 and 2 don’t address skifire’s comment about the upper 7 bits being used for some simd optimisation. Option 2 would always have the upper 32 bits as all zeros, and option 1 would be mostly zeros unless people do a lot of respawning of entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed write_u32
, and implemented Hash
for Entity
manually by using .to_bits()
which shifts the generation into the upper 32 bits
. This seems to work fine, and I don't see why it would be a problem anywhere. Shout if you think of something.
I also tried FxHasher
(from the rustc-hash
crate) and SeaHasher
as they seemed to be claimed to be the fastest hashers I could find, and neither are close to being the speed of past hash. As we are concerned about performance, it seems that the hashing cost of FxHasher
or similar dominates performance versus the SIMD optimisation in hashbrown's hash map lookup. Maybe we can find something better in the future, but this seems to be fastest now. And I did also try .rotate_right(7)
to shift the bottom 7 bits up to being the top 7 bits. That made performance seriously tank. Not sure why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could copy the 7 bottom bits to the top with a bitshift operation.
The problem with this approach is that the 7 top bits are used to differentiate values that end up in the same bucket (or other near buckets), but if they end up in the same bucket their very likely have the same bottom bits, and hence if you copy the 7 bottom bits to the 7 top bits they will be useless to distinguish different values.
Ideally you would copy the bottom 7 bits that aren't included in the bucket mask, but this depends on the size of the hashmap which is not a constant.
I also tried
FxHasher
(from therustc-hash
crate) andSeaHasher
as they seemed to be claimed to be the fastest hashers I could find, and neither are close to being the speed of past hash. As we are concerned about performance, it seems that the hashing cost ofFxHasher
or similar dominates performance versus the SIMD optimisation in hashbrown's hash map lookup.
Did you try using just those hasher or my version which stores the hash in the top 32 bits only (the one with ((i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64)
)?
If you're using the hashers from those crates you may be observing the issue with memory locality I was mentioning previously. And if this is the issue I wonder how often it comes up in practice: this may just be a consequence of spawning lot of entities with the same component at the same moment. Maybe the entity distribution is much more random if lot of other systems are running, spawning/despawning entities and changing their archetypes.
Did you hash just the index
or the generation too? Hashing the generation too takes more time (for example rustc_hash
is just a multiplication to hash a single u32
/u64
, but to hash a second u32
it starts requiring a rotation, a xor and another multiplication)
And I did also try .rotate_right(7) to shift the bottom 7 bits up to being the top 7 bits. That made performance seriously tank. Not sure why.
If you .rotate_right(7)
then you "lose" the bottom 7 bits, which makes every chunk of 128 entities map to the same bucket, which is horrible for performance. I think @nicopap meant doing something like ((i as u64) << 57) (i as u64)
which makes the 7 top bits equal to the 7 bottom bits without losing them, but as I mentioned above this will likely be useless for the SIMD optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR as-is with Entity.to_bits() and PassHash, versus Skifire's suggestion of using only the index, not the generation, and calculating the hash as (i as u64).wrapping_mul(0x517cc1b727220a95) << 32) | (i as u64)
, and a suggestion from nicopap to try i | (i << (64 - 7))
as a cheap way of getting the low 7 bits into the high 7 bits.
Ordered z spawning
With entities spawned with ordered z (which means after sorting, the batching and main pass 2d are doing a linear scan in terms of entity spawn order):
PassHash (yellow) vs Skifire's (red) frame times:
PassHash (yellow) vs nicopap's (red) frame times:
nicopap's median frame time is 2.7% less than the PR, and Skifire's is 1.2%.
PR: queuing: 7.43ms, batching: 7ms, main_pass_2d: 0.288ms
nicopap's: queuing: 7.22ms, batching: 6.92ms, main_pass_2d: 0.293ms
Skifire's: queuing: 7.46ms, batching: 6.94ms, main_pass_2d: 0.276ms
So, for ordered, it looks like nicopap's suggestion is marginally better.
Random z spawning
This mode spawns the entities with random z, this causes their order after sorting to be random, so the lookups in batching and main pass 2d are random.
PassHash (yellow) vs Skifire's (red) frame times:
PassHash (yellow) vs nicopap's (red) frame times:
nicopap's median frame time is 2.7% less than the PR, and Skifire's is 3.8%.
PR: queuing: 7.46ms, batching: 25.85ms, main_pass_2d: 88.28ms
nicopap's: queuing: 7.36ms, batching: 24.39ms, main_pass_2d: 86.57ms
Skifire's: queuing: 7.44ms, batching: 23.88ms, main_pass_2d: 85.53ms
So for random z and random lookup orders, Skifire's suggestion is marginally faster.
I suspect Skifire's suggestion is going to be more robust generally, and as it has the best worst case (random access) performance, and is close for the best case (ordered access) performance, I'm inclined to choose it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SkiFire13 how exactly does one calculate / arrive at 0x517cc1b727220a95
? As I had seen @nicopap's suggestion of using something golden ratio-based, I have seen doing things like u64::MAX / phi. But your constant seems to be u64::MAX / π. However, if I try to calculate it in rust code using (u64::MAX as f64 * std::f64::consts::FRAC_1_PI) as u64
or with various rounding, I can't get the same number. Is it that such a calculation was used as a starting point and then tests were done to identify that this particular value was good? I want to add a comment explaining its origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't know precisely how 0x517cc1b727220a95
was found, I just tweaked what rustc_hash
does (which contains that exact constant), which in turn documents it is a port of Firefox's internal hasher. You're right though that is related to pi, it seems the number for which 2^64 / N
is closest to pi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally on board for this direction. But I agree that the current Entity passhash behavior we're relying on isn't ideal.
Fixes PassHash only using Entity.index.
|
Yup, fixing. |
You could take the |
crates/bevy_asset/src/id.rs
Outdated
@@ -13,7 +14,7 @@ use std::{ | |||
/// For an identifier tied to the lifetime of an asset, see [`Handle`]. | |||
/// | |||
/// For an "untyped" / "generic-less" id, see [`UntypedAssetId`]. | |||
#[derive(Reflect)] | |||
#[derive(Component, Reflect)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a Component
though. I'd prefer if it was newtyped
#[derive(Component)]
struct RenderAssetId<A: Asset>(AssetId<A>);
making this a component may be misleading for users, they might accidentally insert an AssetId
when they mean to insert a Handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had discussed this with Cart and we thought it should become a component. The only reason I needed to change it was because ExtractComponent is implemented for Handle<T>
and is supposed to extract to a Handle<T>
at the moment, and we rather thought it should extract to an AssetId<T>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it should be a newtype. I can see the the disadvantages but not the advantages (apart from expediency, I'd be glad to open a PR after this one is merged if that's the only problem). Maybe there is an advantage I'm missing. Could you link me to the conversation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't as it was in DMs on Discord. On the batching PR, Cart mentioned that AssetId<T>
should be used if a weak handle is desired, such as how I was using it there. It felt appropriate to also switch to AssetId<T>
as part of this PR.
I am happy to wrap AssetId<T>
in a newtype, but I don't understand why I would do it. I hear that you think a user might insert an AssetId<T>
when they should have used a Handle<T>
but I don't understand why wrapping it in a newtype would make much difference to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned on Discord that it's easy for someone to see Handle and AssetId are both Components and think they can use either, whereas a newtype wrapper around AssetId gives a bit more friction. I decided to revert the change that made AssetId a Component and revert the change to the ExtractComponent impl for Handle. I think it's better taken in a separate PR. :) Thanks for noticing it!
Right, that's what I meant would have a negative performance impact on the batching because it would have to do two additional lookups per mesh instance. |
&ViewVisibility, | ||
&GlobalTransform, | ||
&Mesh2dHandle, | ||
Has<NoAutomaticBatching>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is NoAutomaticBatching
added to a main world entity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this rhetorical and suggesting I should document it?
Main world entities that have morphs or skins will have it added. Otherwise users can choose to add it if they want to opt out of batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mesh2d/mesh.rs
here I don't think morphs/skins are relevant.
But what I was asking is: the idea is to make automatic batching user-controllable on an entity basis. But why? Isn't it a property of the specific material? If so, is a separate component justified? I mean, non-blocking, but worth thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The component was introduced in the batching PR. The point was to allow people to opt-out of automatic batching and instancing if for some reason it was causing performance issues, or if they could do their own batching/instancing in some way that they want to. It's an escape hatch for user control.
18802e5
to
312de3f
Compare
312de3f
to
f904037
Compare
638ece8
to
249b6a3
Compare
…sier for other components to adopt. Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. This commit creates a new `ExtractToRenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction.
…sier for other components to adopt. Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. This commit creates a new `ExtractToRenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction.
…sier for other components to adopt. Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. This commit creates a new `ExtractToRenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction.
# Objective - After #9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`
…ther components to adopt. (#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in #9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that #9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.
# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`
…sier for other components to adopt. (bevyengine#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.
# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`
…sier for other components to adopt. (bevyengine#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.
# Objective Related to #10472. Not having a hardcoded scale factor makes comparing results from these stress tests difficult. Contributors using high dpi screens may be rendering 4x as many pixels as others (or more). Stress tests may have different behavior when moved from one monitor in a dual setup to another. At very high resolutions, different parts of the engine / hardware are being stressed. 1080p is also a far more common resolution for gaming. ## Solution Use a consistent 1080p with `scale_factor_override: 1.0` everywhere. In #9903, this sort of change was added specifically to `bevymark` and `many_cubes` but it makes sense to do it everywhere. ## Discussion - Maybe we should have a command line option, environment variable, or `CI_TESTING_CONFIG` option for 1080p / 1440p / 4k. - Will these look odd (small text?) when screenshotted and shown in the example showcase? The aspect ratio is the same, but they will be downscaled from 1080p instead of ~720p. - Maybe there are other window properties that should be consistent across stress tests. e.g. `resizable: false`. - Should we add a `stress_test_window(title)` helper or something? - Bevymark (pre-10472) was intentionally 800x600 to match "bunnymark", I believe. I don't personally think this is very important.
# Objective Keep essentially the same structure of `EntityHasher` from #9903, but rephrase the multiplication slightly to save an instruction. cc @superdump Discord thread: https://discord.com/channels/691052431525675048/1172033156845674507/1174969772522356756 ## Solution Today, the hash is ```rust self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32); ``` with `i` being `(generation << 32) | index`. Expanding things out, we get ```rust i | ( (i * CONST) << 32 ) = (generation << 32) | index | ((((generation << 32) | index) * CONST) << 32) = (generation << 32) | index | ((index * CONST) << 32) // because the generation overflowed = (index * CONST | generation) << 32 | index ``` What if we do the same thing, but with `+` instead of `|`? That's almost the same thing, except that it has carries, which are actually often better in a hash function anyway, since it doesn't saturate. (`|` can be dangerous, since once something becomes `-1` it'll stay that, and there's no mixing available.) ```rust (index * CONST + generation) << 32 + index = (CONST << 32 + 1) * index + generation << 32 = (CONST << 32 + 1) * index + (WHATEVER << 32 + generation) << 32 // because the extra overflows and thus can be anything = (CONST << 32 + 1) * index + ((CONST * generation) << 32 + generation) << 32 // pick "whatever" to be something convenient = (CONST << 32 + 1) * index + ((CONST << 32 + 1) * generation) << 32 = (CONST << 32 + 1) * index +((CONST << 32 + 1) * (generation << 32) = (CONST << 32 + 1) * (index + generation << 32) = (CONST << 32 + 1) * (generation << 32 | index) = (CONST << 32 + 1) * i ``` So we can do essentially the same thing using a single multiplication instead of doing multiply-shift-or. LLVM was already smart enough to merge the shifting into a multiplication, but this saves the extra `or`: ![image](https://github.com/bevyengine/bevy/assets/18526288/d9396614-2326-4730-abbe-4908c01b5ace) <https://rust.godbolt.org/z/MEvbz4eo4> It's a very small change, and often will disappear in load latency anyway, but it's a couple percent faster in lookups: ![image](https://github.com/bevyengine/bevy/assets/18526288/c365ec85-6adc-4f6d-8fa6-a65146f55a75) (There was more of an improvement here before #10558, but with `to_bits` being a single `qword` load now, keeping things mostly as it is turned out to be better than the bigger changes I'd tried in #10605.) --- ## Changelog (Probably skip it) ## Migration Guide (none needed)
…er performance (bevyengine#9903) # Objective - Improve rendering performance, particularly by avoiding the large system commands costs of using the ECS in the way that the render world does. ## Solution - Define `EntityHasher` that calculates a hash from the `Entity.to_bits()` by `i | (i.wrapping_mul(0x517cc1b727220a95) << 32)`. `0x517cc1b727220a95` is something like `u64::MAX / N` for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes from `rustc-hash` (a.k.a. `FxHasher`) with some tweaks for the case of hashing an `Entity`. `FxHasher` and `SeaHasher` were also tested but were significantly slower. - Define `EntityHashMap` type that uses the `EntityHashser` - Use `EntityHashMap<Entity, T>` for render world entity storage, including: - `RenderMaterialInstances` - contains the `AssetId<M>` of the material associated with the entity. Also for 2D. - `RenderMeshInstances` - contains mesh transforms, flags and properties about mesh entities. Also for 2D. - `SkinIndices` and `MorphIndices` - contains the skin and morph index for an entity, respectively - `ExtractedSprites` - `ExtractedUiNodes` ## Benchmarks All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none. ### 2D Meshes `bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d` #### `--ordered-z` This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1112" alt="Screenshot 2023-09-27 at 07 50 45" src="https://github.com/bevyengine/bevy/assets/302146/e140bc98-7091-4a3b-8ae1-ab75d16d2ccb"> -39.1% median frame time. #### Random This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable. <img width="1108" alt="Screenshot 2023-09-27 at 07 51 28" src="https://github.com/bevyengine/bevy/assets/302146/29c2e813-645a-43ce-982a-55df4bf7d8c4"> -7.2% median frame time. ### 3D Meshes `many_cubes --benchmark` <img width="1112" alt="Screenshot 2023-09-27 at 07 51 57" src="https://github.com/bevyengine/bevy/assets/302146/1a729673-3254-4e2a-9072-55e27c69f0fc"> -7.7% median frame time. ### Sprites **NOTE: On `main` sprites are using `SparseSet<Entity, T>`!** `bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite` #### `--ordered-z` This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate. <img width="1116" alt="Screenshot 2023-09-27 at 07 52 31" src="https://github.com/bevyengine/bevy/assets/302146/bc8eab90-e375-4d31-b5cd-f55f6f59ab67"> +13.0% median frame time. #### Random This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses. <img width="1109" alt="Screenshot 2023-09-27 at 07 53 01" src="https://github.com/bevyengine/bevy/assets/302146/22073f5d-99a7-49b0-9584-d3ac3eac3033"> +0.6% median frame time. ### UI **NOTE: On `main` UI is using `SparseSet<Entity, T>`!** `many_buttons` <img width="1111" alt="Screenshot 2023-09-27 at 07 53 26" src="https://github.com/bevyengine/bevy/assets/302146/66afd56d-cbe4-49e7-8b64-2f28f6043d85"> +15.1% median frame time. ## Alternatives - Cart originally suggested trying out `SparseSet<Entity, T>` and indeed that is slightly faster under ideal conditions. However, `PassHashMap<Entity, T>` has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage that `SparseSet`'s dense `Vec<usize>` that maps from the `Entity` index to sparse index into `Vec<T>`. This dense `Vec` has to be as large as the largest Entity index used with the `SparseSet`. - I also tested `PassHashMap<u32, T>`, intending to use `Entity.index()` as the key, but this proved to sometimes be slower and mostly no different. - The only outstanding approach that has not been implemented and tested is to _not_ clear the render world of its entities each frame. That has its own problems, though they could perhaps be solved. - Performance-wise, if the entities and their component data were not cleared, then they would incur table moves on spawn, and should not thereafter, rather just their component data would be overwritten. Ideally we would have a neat way of either updating data in-place via `&mut T` queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems. - The main problem to solve however is that we want to both maintain a mapping between main world entities and render world entities, be able to run the render app and world in parallel with the main app and world for pipelined rendering, and at the same time be able to spawn entities in the render world in such a way that those Entity ids do not collide with those spawned in the main world. This is potentially quite solvable, but could well be a lot of ECS work to do it in a way that makes sense. --- ## Changelog - Changed: Component data for entities to be drawn are no longer stored on entities in the render world. Instead, data is stored in a `EntityHashMap<Entity, T>` in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest are `RenderMeshInstances` and `RenderMaterialInstances`, and their 2D counterparts. ## Migration Guide Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially `EntityHashMap<Entity, T>` where `T` are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such as `RenderMeshInstances`, or maintain their own `EntityHashMap<Entity, T>` for their own data. Before: ```rust fn queue_custom( material_meshes: Query<(Entity, &MeshTransforms, &Handle<Mesh>), With<InstanceMaterialData>>, ) { ... for (entity, mesh_transforms, mesh_handle) in &material_meshes { ... } } ``` After: ```rust fn queue_custom( render_mesh_instances: Res<RenderMeshInstances>, instance_entities: Query<Entity, With<InstanceMaterialData>>, ) { ... for entity in &instance_entities { let Some(mesh_instance) = render_mesh_instances.get(&entity) else { continue; }; // The mesh handle in `AssetId<Mesh>` form, and the `MeshTransforms` can now // be found in `mesh_instance` which is a `RenderMeshInstance` ... } } ``` --------- Co-authored-by: robtfm <[email protected]>
# Objective - After bevyengine#9903, example `mesh2d_manual` doesn't render anything ## Solution - Fix the example using the new `RenderMesh2dInstances`
…sier for other components to adopt. (bevyengine#10002) # Objective Currently, the only way for custom components that participate in rendering to opt into the higher-performance extraction method in bevyengine#9903 is to implement the `RenderInstances` data structure and the extraction logic manually. This is inconvenient compared to the `ExtractComponent` API. ## Solution This commit creates a new `RenderInstance` trait that mirrors the existing `ExtractComponent` method but uses the higher-performance approach that bevyengine#9903 uses. Additionally, `RenderInstance` is more flexible than `ExtractComponent`, because it can extract multiple components at once. This makes high-performance rendering components essentially as easy to write as the existing ones based on component extraction. --- ## Changelog ### Added A new `RenderInstance` trait is available mirroring `ExtractComponent`, but using a higher-performance method to extract one or more components to the render world. If you have custom components that rendering takes into account, you may consider migration from `ExtractComponent` to `RenderInstance` for higher performance.
…ine#10474) # Objective Related to bevyengine#10472. Not having a hardcoded scale factor makes comparing results from these stress tests difficult. Contributors using high dpi screens may be rendering 4x as many pixels as others (or more). Stress tests may have different behavior when moved from one monitor in a dual setup to another. At very high resolutions, different parts of the engine / hardware are being stressed. 1080p is also a far more common resolution for gaming. ## Solution Use a consistent 1080p with `scale_factor_override: 1.0` everywhere. In bevyengine#9903, this sort of change was added specifically to `bevymark` and `many_cubes` but it makes sense to do it everywhere. ## Discussion - Maybe we should have a command line option, environment variable, or `CI_TESTING_CONFIG` option for 1080p / 1440p / 4k. - Will these look odd (small text?) when screenshotted and shown in the example showcase? The aspect ratio is the same, but they will be downscaled from 1080p instead of ~720p. - Maybe there are other window properties that should be consistent across stress tests. e.g. `resizable: false`. - Should we add a `stress_test_window(title)` helper or something? - Bevymark (pre-10472) was intentionally 800x600 to match "bunnymark", I believe. I don't personally think this is very important.
Objective
Solution
EntityHasher
that calculates a hash from theEntity.to_bits()
byi | (i.wrapping_mul(0x517cc1b727220a95) << 32)
.0x517cc1b727220a95
is something likeu64::MAX / N
for N that gives a value close to π and that works well for hashing. Thanks for @SkiFire13 for the suggestion and to @nicopap for alternative suggestions and discussion. This approach comes fromrustc-hash
(a.k.a.FxHasher
) with some tweaks for the case of hashing anEntity
.FxHasher
andSeaHasher
were also tested but were significantly slower.EntityHashMap
type that uses theEntityHashser
EntityHashMap<Entity, T>
for render world entity storage, including:RenderMaterialInstances
- contains theAssetId<M>
of the material associated with the entity. Also for 2D.RenderMeshInstances
- contains mesh transforms, flags and properties about mesh entities. Also for 2D.SkinIndices
andMorphIndices
- contains the skin and morph index for an entity, respectivelyExtractedSprites
ExtractedUiNodes
Benchmarks
All benchmarks have been conducted on an M1 Max connected to AC power. The tests are run for 1500 frames. The 1000th frame is captured for comparison to check for visual regressions. There were none.
2D Meshes
bevymark --benchmark --waves 160 --per-wave 1000 --mode mesh2d
--ordered-z
This test spawns the 2D meshes with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate.
-39.1% median frame time.
Random
This test spawns the 2D meshes with random z. This not only makes the batching and transparent 2D pass lookups get a lot of cache misses, it also currently means that the meshes are almost certain to not be batchable.
-7.2% median frame time.
3D Meshes
many_cubes --benchmark
-7.7% median frame time.
Sprites
NOTE: On
main
sprites are usingSparseSet<Entity, T>
!bevymark --benchmark --waves 160 --per-wave 1000 --mode sprite
--ordered-z
This test spawns the sprites with z incrementing back to front, which is the ideal arrangement allocation order as it matches the sorted render order which means lookups have a high cache hit rate.
+13.0% median frame time.
Random
This test spawns the sprites with random z. This makes the batching and transparent 2D pass lookups get a lot of cache misses.
+0.6% median frame time.
UI
NOTE: On
main
UI is usingSparseSet<Entity, T>
!many_buttons
+15.1% median frame time.
Alternatives
SparseSet<Entity, T>
and indeed that is slightly faster under ideal conditions. However,PassHashMap<Entity, T>
has better worst case performance when data is randomly distributed, rather than in sorted render order, and does not have the worst case memory usage thatSparseSet
's denseVec<usize>
that maps from theEntity
index to sparse index intoVec<T>
. This denseVec
has to be as large as the largest Entity index used with theSparseSet
.PassHashMap<u32, T>
, intending to useEntity.index()
as the key, but this proved to sometimes be slower and mostly no different.&mut T
queries, or inserting components if not present. This would likely be quite cumbersome to have to remember to do everywhere, but perhaps it only needs to be done in the more performance-sensitive systems.Changelog
EntityHashMap<Entity, T>
in various resources. This brings significant performance benefits due to the way the render app clears entities every frame. Resources of most interest areRenderMeshInstances
andRenderMaterialInstances
, and their 2D counterparts.Migration Guide
Previously the render app extracted mesh entities and their component data from the main world and stored them as entities and components in the render world. Now they are extracted into essentially
EntityHashMap<Entity, T>
whereT
are structs containing an appropriate group of data. This means that while extract set systems will continue to run extract queries against the main world they will store their data in hash maps. Also, systems in later sets will either need to look up entities in the available resources such asRenderMeshInstances
, or maintain their ownEntityHashMap<Entity, T>
for their own data.Before:
After: