-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run the multi-threaded executor at the end of each system task. #11906
Conversation
I will profile this later today. Are you sure the formatting of your SAFETY comments is correct? |
ci failure on run-examples looks real. |
I think I understand the bug that caused the The existing code handled that case by running I'm not sure what the cleanest way to fix that is. I should have time to work on it tonight, but not before then. Sorry! |
Thanks, but it's not actually working yet, so no need just yet. The CI caught a race condition I had missed.
Nope, not at all! ... Yup, the ones in I'll fix those. Were those the ones you meant, or are there other issues, too? |
Okay, I fixed the bug the CI caught, and added a unit test for it! ("Dependencies / check-bans" seems to be failing for every PR and doesn't mention anything I changed, so I plan to ignore that failure.) |
Seems promising. Ran the many foxes example with Main schedule is saving 200us, extract is 70us and render is around 70us. Over a 8ms frame time. I'm going to run more of the examples add see how things look. I think running 3d_scene will be interesting to see how the overhead is reduced. The render schedule seeing less reduction makes me think that this helped mostly with reducing some context switching. Rendering is more bottlenecked by long running systems. edit: Note for myself to check cpu usage. Also panic handling. |
@@ -202,7 +202,7 @@ impl FakeTask { | |||
/// For more information, see [`TaskPool::scope`]. | |||
#[derive(Debug)] | |||
pub struct Scope<'scope, 'env: 'scope, T> { | |||
executor: &'env async_executor::LocalExecutor<'env>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you need to change these lifetimes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The futures passed to scope.spawn
may run the executor and spawn more futures, so they need &scope
. That only lives for 'scope
, which is shorter than 'env
.
The change to this file makes the API match the one in task_pool.rs
.
(The difference between the API in task_pool
and single_threaded_task_pool
led to some fun and confusing compiler errors. I somehow had rust-analyzer using a configuration with task_pool
but cargo build
using a configuration with single_threaded_task_pool
, so it worked in the IDE but failed on the command line. And the lifetime errors it reports don't point at code in single_threaded_task_pool
, so it took me a while to figure out that this file even existed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth pulling out of this pr then. This is a riskier change and if for some reason it gets reverted we should still keep the changes to this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more complete review is forthcoming, I'm particularly concerned about the soundness of the changes, so I may need some time to more thoroughly go through it.
With that said, I am seeing similar performance gains to what @hymm is seeing, so this is definitely looking pretty promising by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me, though the potential for aliasing on Conditions
is a bit concerning.
Ideally, we wouldn't yield back to the async executor or the OS, but we can leave that for another PR.
@@ -483,12 +555,14 @@ impl MultiThreadedExecutor { | |||
} | |||
|
|||
// Evaluate the system set's conditions. | |||
// SAFETY: We have exclusive access to ExecutorState, so no other thread is touching these conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not guaranteed given the function signature and the safety invariants, this likely needs to be propagated through the invariants of this function.
543215c
to
2ba70eb
Compare
I added two commits to try to address @james7132's review comments, then rebased the whole thing to resolve merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely requires another check to ensure the performance hasn't degraded from the extra mutex, and we can come back and replace it with the old SyncUnsafeCells if we can reasonably prove out the safety invariants, but this otherwise looks good to me.
systems, | ||
mut conditions, | ||
} = SyncUnsafeSchedule::new(schedule); | ||
let environment = &Environment::new(self, schedule, world); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let environment = &Environment::new(self, schedule, world); | |
let environment = Environment::new(self, schedule, world); |
Is this borrow needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context
needs a borrow because Environment
owns the Mutex<Conditions>
, and so that we only copy one pointer into each task instead of the whole environment. And doing the borrow here avoids having to spell out environment: &environment
when constructing the Context
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked this against main, the new mutex does seem to eat into the improvements we saw earlier. Tested on many_foxes. Yellow is this PR, red is main.
There are still notable gains however. As @hymm noted, this most heavily impacts schedules with primarily small systems with little work to do.
What's interesting here is that there's almost 4x the number of calls into the multithreaded executor, but significantly improved the overall distribution of the time spent. On average cutting the time spent by 50%, and eliminating most of the long tail instances that take 100+us per run.
Overall, LGTM. We can try to remove the mutex and see if we can make any improvements
in a follow-up PR, but I'm interested in trying to minimize yielding to the task executor.
system: &BoxedSystem, | ||
) { | ||
// tell the executor that the system finished | ||
self.environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't have to be done in this PR, but we have the SystemResult here, and we may be able to lock the executor, sending this through the completion queue when we can just pass it in via a function argument may be wasteful and may contribute to contention on the queue.
Perhaps we could only add the result to the queue if and only if we failed to get the lock on the executor state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, just tried this myself, seems to deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a race condition if one thread fails to get the lock, then the other thread exits the executor, and then the first thread pushes to the queue.
This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.
… store a single pointer to it in Context.
Co-authored-by: James Liu <[email protected]>
74c03e4
to
8b5a37c
Compare
Rebased to fix merge conflicts.
Oh, wow, I wasn't expecting an uncontended mutex to actually cost anything! One option to get back to one mutex with only safe code to would be to create a |
Even if it's not under contention, there is still going to be a syscall to make sure that on the OS side there is no issue.
Agreed. These are wins even with the impact from the extra mutex, and we can incrementally improve the results in a later PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job. The changes are pretty conservative since it's mostly just yeeting most of the logic behind a mutex. I took the opportunity to do a pass over the safety comments. Some of them weren't quite correct before.
One thing to note is that I think this will deadlock when there aren't any taskpool threads now, since we're not awaiting in the ticking code anymore. Not sure it's relevant, but should maybe be added to the migration guide.
I want to do a little more pref testing with varying thread counts before approving. Should get to that this weekend.
sender: Sender<SystemResult>, | ||
/// Receives system completion events. | ||
receiver: Receiver<SystemResult>, | ||
/// The running state, protected by a mutex so that the . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is incomplete.
let systems = environment.systems; | ||
|
||
let state = self.state.get_mut().unwrap(); | ||
if state.apply_final_deferred { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think apply_final_deferred needs to be in the state. We should just be able to keep it on MultithreadedExecutor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll move that.
(I'm surprised that's the only one; I wasn't looking at whether the fields were used mutably, just whether they were read outside the lock. I figured there would be something else immutable in there, but even num_systems
gets changed.)
*panic_payload = Some(payload); | ||
} | ||
} | ||
self.tick_executor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this recursive? If it is we should maybe have a comment, since it's not obvious. I think it might be possible to hit the recursion limit in a large enough schedule. Probably unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it spawns new tasks for each system and returns.
if self.exclusive_running { | ||
return; | ||
} | ||
|
||
let mut conditions = context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work to add a safety condition that no other borrows of conditions exist? instead of use a mutex. I think that's trivially true because of how conditions are only evaluated behind the mutex.
Should also be added that evaluate_and_fold_conditions
is missing a safety comment that "The conditions that are run cannot have any active borrows."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be possible. @james7132 said "We can try to remove the mutex and see if we can make any improvements
in a follow-up PR", so I'm not planning to change that in this PR.
self.ready_systems.set(system_index, false); | ||
|
||
// SAFETY: `can_run` returned true, which means that: | ||
// - It must have called `update_archetype_component_access` for each run condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calling update_archetype_component_access really a safety condition? Seems more like it just needs to be done for correctness. i.e. If you don't call it the run conditions won't evaulate correctly, but they also won't access data they're not allowed to.
This doesn't need to be fixed in this PR. There are a bunch of safety comments with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_archetype_component
access is used to validate that the world matches, so it needs to be called before run_unsafe
if you haven't already verified that the world is the same one used to initialize the system.
let system_meta = &self.system_task_metadata[system_index]; | ||
|
||
#[cfg(feature = "trace")] | ||
let system_span = system_meta.system_task_span.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we even need this span anymore. We'd see this task take longer if there was contention on the channel, but would there still be significant overhead here? What does ConcurrentQueue do if there's contention? Will it park the thread like a channel would? We should check this in tracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, I made this span less useful by pulling the sending of the completion message out of it. Sorry; I was trying to exclude tick_executor()
and I had put those in the same function.
The existing code was doing a non-blocking try_send
on the channel, which calls ConcurrentQueue::push
and then does some bookkeeping. So the thread wouldn't have parked before and won't park now. It looks like there is a busy-wait in push
, though, so that may be slow under contention.
Do you want any changes here? I could take the tick_executor()
call out of system_completed()
and then put system_completed()
back in the tracing span. Or I could remove the span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to leave as is for now. If we're not seeing much different here between the system run time we should probably remove it later. I originally added the span because there were sometimes gaps between systems running. So adding this showed what was happening in those gaps.
&mut self, | ||
scope: &Scope<'_, 'scope, ()>, | ||
context: &Context<'_, '_, '_>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stuffing all this stuff into Context is a nice cleanup of the function signature. If for some reason we don't merge this, we should still do this.
@@ -202,7 +202,7 @@ impl FakeTask { | |||
/// For more information, see [`TaskPool::scope`]. | |||
#[derive(Debug)] | |||
pub struct Scope<'scope, 'env: 'scope, T> { | |||
executor: &'env async_executor::LocalExecutor<'env>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth pulling out of this pr then. This is a riskier change and if for some reason it gets reverted we should still keep the changes to this file.
Co-authored-by: Mike <[email protected]>
Co-authored-by: Mike <[email protected]>
…ied while running the schedule.
…tch the multi-threaded version. (#12073) # Objective `Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope` have different signatures depending on whether the `multi-threaded` feature is enabled. The single-threaded version has a stricter signature that prevents sending the `Scope` itself to spawned tasks. ## Solution Changed the lifetime constraints in the single-threaded signatures from `'env` to `'scope` to match the multi-threaded version. This was split off from #11906.
Ran this 3d scene and saw a somewhat ambiguous result for Render schedule. The other schedules seemed faster, but rendering is often the bottleneck so want to be carefull here. I think the regression is just due to there no longer being the weird fast path hump, so we're probably ok here. tested with Looks like I was wrong about it deadlocking if there wasn't any compute threads. I added .add_plugins(DefaultPlugins.set(TaskPoolPlugin {
task_pool_options: TaskPoolOptions {
min_total_threads: 0,
max_total_threads: 0,
compute: TaskPoolThreadAssignmentPolicy {
min_threads: 0,
max_threads: 0,
percent: 0.0,
},
..default()
}
})) and it ran ok. I confirmed with tracy that there weren't any compute threads. |
…tch the multi-threaded version. (bevyengine#12073) # Objective `Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope` have different signatures depending on whether the `multi-threaded` feature is enabled. The single-threaded version has a stricter signature that prevents sending the `Scope` itself to spawned tasks. ## Solution Changed the lifetime constraints in the single-threaded signatures from `'env` to `'scope` to match the multi-threaded version. This was split off from bevyengine#11906.
…engine#11906) # Objective The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion. See bevyengine#8304 ## Solution Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads. While this should be faster in theory, I don't actually know how to measure the performance impact myself. --------- Co-authored-by: James Liu <[email protected]> Co-authored-by: Mike <[email protected]>
…tch the multi-threaded version. (bevyengine#12073) # Objective `Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope` have different signatures depending on whether the `multi-threaded` feature is enabled. The single-threaded version has a stricter signature that prevents sending the `Scope` itself to spawned tasks. ## Solution Changed the lifetime constraints in the single-threaded signatures from `'env` to `'scope` to match the multi-threaded version. This was split off from bevyengine#11906.
…engine#11906) # Objective The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion. See bevyengine#8304 ## Solution Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads. While this should be faster in theory, I don't actually know how to measure the performance impact myself. --------- Co-authored-by: James Liu <[email protected]> Co-authored-by: Mike <[email protected]>
…engine#11906) The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion. See bevyengine#8304 Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads. While this should be faster in theory, I don't actually know how to measure the performance impact myself. --------- Co-authored-by: James Liu <[email protected]> Co-authored-by: Mike <[email protected]>
Objective
The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion.
See #8304
Solution
Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.
While this should be faster in theory, I don't actually know how to measure the performance impact myself.