-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Extract parallel queue abstraction #7348
Conversation
#6817 similarly abstracts the deferred mutation pattern, but seems to do so in a fully compatible way. |
The main difference here is that this is a standalone type. It's used in |
This could be avoided by replacing the |
This makes it difficult to work with several of these together, with a triple nested scope in that case. We can have both, but only having the scope is probably not 100% viable. |
Fair enough, didn't consider such a case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a very useful abstraction, it's a nice way of making thread locals usable. I like the name too -- it feels like a very natural way to describe a set of values stored in parallel across threads.
I am a bit wary of ParRef<>
, though. Requiring the drop impl to apply changes to a smart pointer feels strange and overengineered, and it is potentially footgunny as Nile mentioned. Perhaps we should replace .get()
with .scope()
, and just take the L on nested scopes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally in favor of this. I was considering a pr to do this exact thing. But I'd like to see some benches before approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much nicer now that the API is based on scope
. I'm on board now, but there's a few API details I'm not sure about.
Moving to 0.12 as there are still open conversations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main thing for me to approve is just to see a bench that check_visibility still performs the same.
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by #7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
Co-authored-by: Joseph <[email protected]>
Comparing |
let cell = self.locals.get_or_default(); | ||
let mut value = cell.take(); | ||
let ret = f(&mut value); | ||
cell.set(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a panic occurs in the scope and gets caught, any commands added during the scope will be lost. This is probably desirable but I think it's worth mentioning in a comment.
Co-authored-by: Joseph <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice little bit of abstraction :)
# Objective #7348 added `bevy_utils::Parallel` and replaced the usage of the `ThreadLocal<Cell<Vec<...>>>` in `check_visibility`, but we were also using it in `extract_meshes`. ## Solution Refactor the system to use `Parallel` instead.
# Objective There's a repeating pattern of `ThreadLocal<Cell<Vec<T>>>` which is very useful for low overhead, low contention multithreaded queues that have cropped up in a few places in the engine. This pattern is surprisingly useful when building deferred mutation across multiple threads, as noted by it's use in `ParallelCommands`. However, `ThreadLocal<Cell<Vec<T>>>` is not only a mouthful, it's also hard to ensure the thread-local queue is replaced after it's been temporarily removed from the `Cell`. ## Solution Wrap the pattern into `bevy_utils::Parallel<T>` which codifies the entire pattern and ensures the user follows the contract. Instead of fetching indivdual cells, removing the value, mutating it, and replacing it, `Parallel::get` returns a `ParRef<'a, T>` which contains the temporarily removed value and a reference back to the cell, and will write the mutated value back to the cell upon being dropped. I would like to use this to simplify the remaining part of bevyengine#4899 that has not been adopted/merged. --- ## Changelog TODO --------- Co-authored-by: Joseph <[email protected]>
# Objective bevyengine#7348 added `bevy_utils::Parallel` and replaced the usage of the `ThreadLocal<Cell<Vec<...>>>` in `check_visibility`, but we were also using it in `extract_meshes`. ## Solution Refactor the system to use `Parallel` instead.
# Objective There's a repeating pattern of `ThreadLocal<Cell<Vec<T>>>` which is very useful for low overhead, low contention multithreaded queues that have cropped up in a few places in the engine. This pattern is surprisingly useful when building deferred mutation across multiple threads, as noted by it's use in `ParallelCommands`. However, `ThreadLocal<Cell<Vec<T>>>` is not only a mouthful, it's also hard to ensure the thread-local queue is replaced after it's been temporarily removed from the `Cell`. ## Solution Wrap the pattern into `bevy_utils::Parallel<T>` which codifies the entire pattern and ensures the user follows the contract. Instead of fetching indivdual cells, removing the value, mutating it, and replacing it, `Parallel::get` returns a `ParRef<'a, T>` which contains the temporarily removed value and a reference back to the cell, and will write the mutated value back to the cell upon being dropped. I would like to use this to simplify the remaining part of bevyengine#4899 that has not been adopted/merged. --- ## Changelog TODO --------- Co-authored-by: Joseph <[email protected]>
# Objective bevyengine#7348 added `bevy_utils::Parallel` and replaced the usage of the `ThreadLocal<Cell<Vec<...>>>` in `check_visibility`, but we were also using it in `extract_meshes`. ## Solution Refactor the system to use `Parallel` instead.
Objective
There's a repeating pattern of
ThreadLocal<Cell<Vec<T>>>
which is very useful for low overhead, low contention multithreaded queues that have cropped up in a few places in the engine. This pattern is surprisingly useful when building deferred mutation across multiple threads, as noted by it's use inParallelCommands
.However,
ThreadLocal<Cell<Vec<T>>>
is not only a mouthful, it's also hard to ensure the thread-local queue is replaced after it's been temporarily removed from theCell
.Solution
Wrap the pattern into
bevy_utils::Parallel<T>
which codifies the entire pattern and ensures the user follows the contract. Instead of fetching indivdual cells, removing the value, mutating it, and replacing it,Parallel::get
returns aParRef<'a, T>
which contains the temporarily removed value and a reference back to the cell, and will write the mutated value back to the cell upon being dropped.I would like to use this to simplify the remaining part of #4899 that has not been adopted/merged.
Changelog
TODO