Next Task Optimization for the Mulithreaded Executor #12869

james7132 · 2024-04-04T07:52:58Z

Objective

Whenever the multithreaded executor spawns system tasks, it always starts a new task when a system is supposed to run. This presents a potential context switch if the task is stolen by another thread. Even with #11906, its still pretty common to see systems and the mulithreaded executor ping-pong between multiple threads, particularly when systems becomes serialized due to conflicting accesses or explicit ordering, with each bounce incurring a context switch overhead. This PR aims to address that overhead.

Solution

This PR ensures that at least one thread is running a scheduled system or the multithreaded executor at any given time. This is done by withholding one of the systems that are slated to run, and instead of spawning a task, directly run the system inside the same task the mulitthreaded executor is running in after releasing the lock on the executor. So long as there is at least one system the executor can run, and the task can grab the lock on the multithreaded executor, it will continue to run in a loop without yielding to the async task executor.

In strictly serialized schedules, there should now be close to zero OS-level overhead between running a single threaded executor and the multithreaded executor.

Performance

The performance difference here is more noticeable with schedules with a large number of serialized or short-lived systems. Tested on many_foxes (yellow is main, red is this PR), there is about a consistent 5% performance improvement on the main schedule.

On the Render schedule, this yields a more meager 1% improvement on the entire schedule:

re0312 · 2024-04-04T15:35:04Z

result with benchmark #12873

# Objective Improve performance scalability when adding new event types to a Bevy app. Currently, just using Bevy in the default configuration, all apps spend upwards of 100+us in the `First` schedule, every app tick, evaluating if it should update events or not, even if events are not being used for that particular frame, and this scales with the number of Events registered in the app. ## Solution As `Events::update` is guaranteed `O(1)` by just checking if a resource's value, swapping two Vecs, and then clearing one of them, the actual cost of running `event_update_system` is *very* cheap. The overhead of doing system dependency injection, task scheduling ,and the multithreaded executor outweighs the cost of running the system by a large margin. Create an `EventRegistry` resource that keeps a number of function pointers that update each event. Replace the per-event type `event_update_system` with a singular exclusive system uses the `EventRegistry` to update all events instead. Update `SubApp::add_event` to use `EventRegistry` instead. ## Performance This speeds reduces the cost of the `First` schedule in both many_foxes and many_cubes by over 80%. Note this is with system spans on. The majority of this is now context-switching costs from launching `time_system`, which should be mostly eliminated with #12869. ![image](https://github.com/bevyengine/bevy/assets/3137680/037624be-21a2-4dc2-a42f-9d0bfa3e9b4a) The actual `event_update_system` is usually *very* short, using only a few microseconds on average. ![image](https://github.com/bevyengine/bevy/assets/3137680/01ff1689-3595-49b6-8f09-5c44bcf903e8) --- ## Changelog TODO ## Migration Guide TODO --------- Co-authored-by: Josh Matthews <[email protected]> Co-authored-by: Alice Cecile <[email protected]>

chescock · 2024-04-13T18:20:24Z

I experimented with something similar to this recently, and found a few potential issues when looking at the traces more closely.

On my 4-core machine, most of the benefit came from the fact that there were extra threads running. The main and render threads currently sleep while running a schedule, but with this change they will run systems alongside the compute task pool threads.

My understanding is that some architectures need to send rendering tasks to the main thread, so running more of the main schedule there could delay rendering (until #9122 is merged).

Also, if you have tasks saturating the Io and AsyncCompute task pools, then keeping the main thread running would give you more threads than cores and cause thrashing. That won't show up in many_foxes since it doesn't do much on those pools, although it might become a problem if #1907 is resolved.

The other potential issue is that system tasks sometimes get picked up by the scope running a par_iter. With this change, system tasks can run much longer and could block the scope for longer. It would be possible for a par_iter in the Render schedule to block while it runs the entire PostUpdate schedule, or vice versa!

chescock · 2024-04-13T18:20:29Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

+                // data stored in the world, that the system has the permission to run, and that the system's
+                // archetypes are updated.
+                unsafe { cached_system.run_system(self) };
+            }
            if self.environment.executor.system_completion.is_empty() {


If other threads failed to lock the executor, running the task before looping will delay spawning dependent systems until the cached system is run.

For example, if you have add_systems(Update, ((a, b).chain(), (c, d).chain())), then b and d can run concurrently. But if c completes while a holds the executor lock, then the thread that completed a will run b before running the executor to spawn d.

You may want to do something like

let mut cached_system = None; loop { if let Ok(mut guard) = self.environment.executor.state.try_lock() { guard.tick(self, &mut cached_system); drop(guard); if !self.environment.executor.system_completion.is_empty() { continue; } } match cached_system.take() { Some(cached_system) => unsafe { cached_system.run_system(self) }, None => break, } }

Next Task Optimization for the Mulithreaded Executor

5dbd290

james7132 requested review from hymm and maniwani April 4, 2024 07:52

james7132 added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels Apr 4, 2024

james7132 mentioned this pull request Apr 12, 2024

Optimize Event Updates #12936

Merged

chescock reviewed Apr 13, 2024

View reviewed changes

janhohenheim added the S-Needs-Review Needs reviewer attention (from anyone!) to move forward label Jul 17, 2024

BenjaminBrienen added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next Task Optimization for the Mulithreaded Executor #12869

Next Task Optimization for the Mulithreaded Executor #12869

james7132 commented Apr 4, 2024

re0312 commented Apr 4, 2024

chescock commented Apr 13, 2024

chescock Apr 13, 2024

Next Task Optimization for the Mulithreaded Executor #12869

Are you sure you want to change the base?

Next Task Optimization for the Mulithreaded Executor #12869

Conversation

james7132 commented Apr 4, 2024

Objective

Solution

Performance

re0312 commented Apr 4, 2024

chescock commented Apr 13, 2024

chescock Apr 13, 2024

Choose a reason for hiding this comment