Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next Task Optimization for the Mulithreaded Executor #12869

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

james7132
Copy link
Member

Objective

Whenever the multithreaded executor spawns system tasks, it always starts a new task when a system is supposed to run. This presents a potential context switch if the task is stolen by another thread. Even with #11906, its still pretty common to see systems and the mulithreaded executor ping-pong between multiple threads, particularly when systems becomes serialized due to conflicting accesses or explicit ordering, with each bounce incurring a context switch overhead. This PR aims to address that overhead.

Solution

This PR ensures that at least one thread is running a scheduled system or the multithreaded executor at any given time. This is done by withholding one of the systems that are slated to run, and instead of spawning a task, directly run the system inside the same task the mulitthreaded executor is running in after releasing the lock on the executor. So long as there is at least one system the executor can run, and the task can grab the lock on the multithreaded executor, it will continue to run in a loop without yielding to the async task executor.

In strictly serialized schedules, there should now be close to zero OS-level overhead between running a single threaded executor and the multithreaded executor.

Performance

The performance difference here is more noticeable with schedules with a large number of serialized or short-lived systems. Tested on many_foxes (yellow is main, red is this PR), there is about a consistent 5% performance improvement on the main schedule.

image

On the Render schedule, this yields a more meager 1% improvement on the entire schedule:

image

@james7132 james7132 requested review from hymm and maniwani April 4, 2024 07:52
@james7132 james7132 added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels Apr 4, 2024
@re0312
Copy link
Contributor

re0312 commented Apr 4, 2024

result with benchmark #12873

image

github-merge-queue bot pushed a commit that referenced this pull request Apr 13, 2024
# Objective
Improve performance scalability when adding new event types to a Bevy
app. Currently, just using Bevy in the default configuration, all apps
spend upwards of 100+us in the `First` schedule, every app tick,
evaluating if it should update events or not, even if events are not
being used for that particular frame, and this scales with the number of
Events registered in the app.

## Solution
As `Events::update` is guaranteed `O(1)` by just checking if a
resource's value, swapping two Vecs, and then clearing one of them, the
actual cost of running `event_update_system` is *very* cheap. The
overhead of doing system dependency injection, task scheduling ,and the
multithreaded executor outweighs the cost of running the system by a
large margin.

Create an `EventRegistry` resource that keeps a number of function
pointers that update each event. Replace the per-event type
`event_update_system` with a singular exclusive system uses the
`EventRegistry` to update all events instead. Update `SubApp::add_event`
to use `EventRegistry` instead.

## Performance
This speeds reduces the cost of the `First` schedule in both many_foxes
and many_cubes by over 80%. Note this is with system spans on. The
majority of this is now context-switching costs from launching
`time_system`, which should be mostly eliminated with #12869.

![image](https://github.com/bevyengine/bevy/assets/3137680/037624be-21a2-4dc2-a42f-9d0bfa3e9b4a)

The actual `event_update_system` is usually *very* short, using only a
few microseconds on average.

![image](https://github.com/bevyengine/bevy/assets/3137680/01ff1689-3595-49b6-8f09-5c44bcf903e8)

---

## Changelog
TODO

## Migration Guide
TODO

---------

Co-authored-by: Josh Matthews <[email protected]>
Co-authored-by: Alice Cecile <[email protected]>
@chescock
Copy link
Contributor

I experimented with something similar to this recently, and found a few potential issues when looking at the traces more closely.

On my 4-core machine, most of the benefit came from the fact that there were extra threads running. The main and render threads currently sleep while running a schedule, but with this change they will run systems alongside the compute task pool threads.

My understanding is that some architectures need to send rendering tasks to the main thread, so running more of the main schedule there could delay rendering (until #9122 is merged).

Also, if you have tasks saturating the Io and AsyncCompute task pools, then keeping the main thread running would give you more threads than cores and cause thrashing. That won't show up in many_foxes since it doesn't do much on those pools, although it might become a problem if #1907 is resolved.

The other potential issue is that system tasks sometimes get picked up by the scope running a par_iter. With this change, system tasks can run much longer and could block the scope for longer. It would be possible for a par_iter in the Render schedule to block while it runs the entire PostUpdate schedule, or vice versa!

// data stored in the world, that the system has the permission to run, and that the system's
// archetypes are updated.
unsafe { cached_system.run_system(self) };
}
if self.environment.executor.system_completion.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other threads failed to lock the executor, running the task before looping will delay spawning dependent systems until the cached system is run.

For example, if you have add_systems(Update, ((a, b).chain(), (c, d).chain())), then b and d can run concurrently. But if c completes while a holds the executor lock, then the thread that completed a will run b before running the executor to spawn d.

You may want to do something like

let mut cached_system = None;
loop {
    if let Ok(mut guard) = self.environment.executor.state.try_lock() {
        guard.tick(self, &mut cached_system);
        drop(guard);
        if !self.environment.executor.system_completion.is_empty() {
            continue;
        }
    }
    match cached_system.take() {
        Some(cached_system) => unsafe { cached_system.run_system(self) },
        None => break,
    }
}

@janhohenheim janhohenheim added the S-Needs-Review Needs reviewer attention (from anyone!) to move forward label Jul 17, 2024
@BenjaminBrienen BenjaminBrienen added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants