-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Next Task Optimization for the Mulithreaded Executor #12869
base: main
Are you sure you want to change the base?
Conversation
result with benchmark #12873 |
# Objective Improve performance scalability when adding new event types to a Bevy app. Currently, just using Bevy in the default configuration, all apps spend upwards of 100+us in the `First` schedule, every app tick, evaluating if it should update events or not, even if events are not being used for that particular frame, and this scales with the number of Events registered in the app. ## Solution As `Events::update` is guaranteed `O(1)` by just checking if a resource's value, swapping two Vecs, and then clearing one of them, the actual cost of running `event_update_system` is *very* cheap. The overhead of doing system dependency injection, task scheduling ,and the multithreaded executor outweighs the cost of running the system by a large margin. Create an `EventRegistry` resource that keeps a number of function pointers that update each event. Replace the per-event type `event_update_system` with a singular exclusive system uses the `EventRegistry` to update all events instead. Update `SubApp::add_event` to use `EventRegistry` instead. ## Performance This speeds reduces the cost of the `First` schedule in both many_foxes and many_cubes by over 80%. Note this is with system spans on. The majority of this is now context-switching costs from launching `time_system`, which should be mostly eliminated with #12869. ![image](https://github.com/bevyengine/bevy/assets/3137680/037624be-21a2-4dc2-a42f-9d0bfa3e9b4a) The actual `event_update_system` is usually *very* short, using only a few microseconds on average. ![image](https://github.com/bevyengine/bevy/assets/3137680/01ff1689-3595-49b6-8f09-5c44bcf903e8) --- ## Changelog TODO ## Migration Guide TODO --------- Co-authored-by: Josh Matthews <[email protected]> Co-authored-by: Alice Cecile <[email protected]>
I experimented with something similar to this recently, and found a few potential issues when looking at the traces more closely. On my 4-core machine, most of the benefit came from the fact that there were extra threads running. The main and render threads currently sleep while running a schedule, but with this change they will run systems alongside the compute task pool threads. My understanding is that some architectures need to send rendering tasks to the main thread, so running more of the main schedule there could delay rendering (until #9122 is merged). Also, if you have tasks saturating the The other potential issue is that system tasks sometimes get picked up by the scope running a |
// data stored in the world, that the system has the permission to run, and that the system's | ||
// archetypes are updated. | ||
unsafe { cached_system.run_system(self) }; | ||
} | ||
if self.environment.executor.system_completion.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other threads failed to lock the executor, running the task before looping will delay spawning dependent systems until the cached system is run.
For example, if you have add_systems(Update, ((a, b).chain(), (c, d).chain()))
, then b
and d
can run concurrently. But if c
completes while a
holds the executor lock, then the thread that completed a
will run b
before running the executor to spawn d
.
You may want to do something like
let mut cached_system = None;
loop {
if let Ok(mut guard) = self.environment.executor.state.try_lock() {
guard.tick(self, &mut cached_system);
drop(guard);
if !self.environment.executor.system_completion.is_empty() {
continue;
}
}
match cached_system.take() {
Some(cached_system) => unsafe { cached_system.run_system(self) },
None => break,
}
}
Objective
Whenever the multithreaded executor spawns system tasks, it always starts a new task when a system is supposed to run. This presents a potential context switch if the task is stolen by another thread. Even with #11906, its still pretty common to see systems and the mulithreaded executor ping-pong between multiple threads, particularly when systems becomes serialized due to conflicting accesses or explicit ordering, with each bounce incurring a context switch overhead. This PR aims to address that overhead.
Solution
This PR ensures that at least one thread is running a scheduled system or the multithreaded executor at any given time. This is done by withholding one of the systems that are slated to run, and instead of spawning a task, directly run the system inside the same task the mulitthreaded executor is running in after releasing the lock on the executor. So long as there is at least one system the executor can run, and the task can grab the lock on the multithreaded executor, it will continue to run in a loop without yielding to the async task executor.
In strictly serialized schedules, there should now be close to zero OS-level overhead between running a single threaded executor and the multithreaded executor.
Performance
The performance difference here is more noticeable with schedules with a large number of serialized or short-lived systems. Tested on
many_foxes
(yellow is main, red is this PR), there is about a consistent 5% performance improvement on the main schedule.On the Render schedule, this yields a more meager 1% improvement on the entire schedule: