-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FD ring buffer is stalling too often #15221
Comments
I think it's behaving correctly. We start with 4 launch message buffer spaces, but had to go down to queue at most 3 commands because the watcher looks at the previously executed ring buffer spaces. We also behave differently when first caching commands; I didn't modify assemble_stall_commands to take in the sync_count, so it always syncs to expected_num_workers_completed. We could change that if you want, but it's probably not that performance critical. In this example this->expected_num_workers_completed is the number of workers completed before this command (if you sync on that value you sync on the previous command). If I modify the code to print out this->expected_num_workers_completed + num_workers, I get this: stall: 0 0 56 0 is_cached 0 The first program doesn't stall. We could tweak assemble_stall_commands to take a sync count if you want; it's a bit cleaner, but I wanted to avoid unnecessary changes in my original patch. Making watcher not look at the previous executed command is probably more difficult. |
hmm, wonder why what you got is 1 further back than what I got. |
I've got a trace tracy now (of a run with tracing enabled); it looks like there's a
so there is more of a gap after the first fast kernel; I'll have to see why. |
It looks like it takes around 535 ns (or cycles) from when dispatcher_s starts sending the go to when the worker receives it. It's also around 500 ns from when the active worker sends its completion increment to when the dispatcher_s receives it; after the slow kernel that drops to 236 ns, which suggests that the rest of the time is congestion from all of idle workers sending their acks at the same time. The active worker takes around 650 ns from go received to ack, while idle workers take around 510 ns. So most of the time (1035 ns) is spent in the round trip from the worker to dispatcher_s and back again. We should be able to hide around 500 ns if we preload the next kernel. We should probably also make the noc_fast_atomic_increment at completion be posted to avoid extra traffic and save time there. |
Just tested switching noc_fast_atomic_increment to posted, and it seems to reduce the time for the dispatcher to receive the message to 250-ish ns, and reduces the time per fast kernel from 1700ns to 1450 ns. |
when you measure the total duration of dispatch w/ 1,2,3,4,5 shadow kernels do you see a knee from 3-4? when I ran the first was ~free and there was no other knee w/ each additional kernel taking ~6K cycles iirc. is that not an issue at present? |
We never wait on the acks from these completion messages, so make them posted to avoid contention from a lot of replies being sent at once. In the case where every worker is sending them at the same time, this can halve the latency from 500ns to 250ns (on wormhole).
We never wait on the acks from these completion messages, so make them posted to avoid contention from a lot of replies being sent at once. In the case where every worker is sending them at the same time, this can halve the latency from 500ns to 250ns (on wormhole).
We never wait on the acks from these completion messages, so make them posted to avoid contention from a lot of replies being sent at once. In the case where every worker is sending them at the same time, this can halve the latency from 500ns to 250ns (on wormhole).
We never wait on the acks from these completion messages, so make them posted to avoid contention from a lot of replies being sent at once. In the case where every worker is sending them at the same time, this can halve the latency from 500ns to 250ns (on wormhole).
The data on this sheet (Dispatch~0 tab) shows we aren't queuing up kernels as expected w/ low latency dispatch: https://docs.google.com/spreadsheets/d/1nN1ZPUNc2f0d6YEenBOHS7BdkpGlBysZ6SCArinGOrE/edit?pli=1&gid=84505338#gid=84505338
Running this command:
build/test/tt_metal/perf_microbenchmark/dispatch/test_pgm_dispatch -w 0 -i 1 -s 1024 -rs 40000 -nf 9
And putting a print at the top of EnqueueProgramCommand::process() after calculating the stall values like this:
fprintf(stderr, "stall: %d %d %d %d\n", stall_first, stall_before_program, this->expected_num_workers_completed, sync_count);
Shows:
Notable is that the stall is just running one kernel ahead (as if the launch message buffer is sized to 2) after the first 4 kernels.
The text was updated successfully, but these errors were encountered: