FD ring buffer is stalling too often #15221

pgkeller · 2024-11-19T17:26:24Z

The data on this sheet (Dispatch~0 tab) shows we aren't queuing up kernels as expected w/ low latency dispatch: https://docs.google.com/spreadsheets/d/1nN1ZPUNc2f0d6YEenBOHS7BdkpGlBysZ6SCArinGOrE/edit?pli=1&gid=84505338#gid=84505338

Running this command:
build/test/tt_metal/perf_microbenchmark/dispatch/test_pgm_dispatch -w 0 -i 1 -s 1024 -rs 40000 -nf 9
And putting a print at the top of EnqueueProgramCommand::process() after calculating the stall values like this:
fprintf(stderr, "stall: %d %d %d %d\n", stall_first, stall_before_program, this->expected_num_workers_completed, sync_count);

Shows:

stall: 0 0 0 0
stall: 0 0 56 0
stall: 0 0 112 0
stall: 0 0 168 0
stall: 1 0 224 112
stall: 1 0 280 168
stall: 1 0 336 224
stall: 1 0 392 280
stall: 1 0 448 336
stall: 1 0 504 392

Notable is that the stall is just running one kernel ahead (as if the launch message buffer is sized to 2) after the first 4 kernels.

The text was updated successfully, but these errors were encountered:

jbaumanTT · 2024-11-21T18:06:43Z

I think it's behaving correctly. We start with 4 launch message buffer spaces, but had to go down to queue at most 3 commands because the watcher looks at the previously executed ring buffer spaces. We also behave differently when first caching commands; I didn't modify assemble_stall_commands to take in the sync_count, so it always syncs to expected_num_workers_completed. We could change that if you want, but it's probably not that performance critical.

In this example this->expected_num_workers_completed is the number of workers completed before this command (if you sync on that value you sync on the previous command). If I modify the code to print out this->expected_num_workers_completed + num_workers, I get this:

stall: 0 0 56 0 is_cached 0
stall: 0 0 112 0 is_cached 0
stall: 0 0 168 0 is_cached 1
stall: 0 0 224 0 is_cached 1
stall: 1 0 280 112 is_cached 1
stall: 1 0 336 168 is_cached 1
stall: 1 0 392 224 is_cached 1
stall: 1 0 448 280 is_cached 1
stall: 1 0 504 336 is_cached 1
stall: 1 0 560 392 is_cached 1

The first program doesn't stall.
The second program stalls (on 56) because it's not cached.
Programs 3 and 4 don't stall.
Program 5 stalls for 112, which is the second command. When it starts these messages are in the launch message buffer: 3,4,5, so the launch message buffer is full at that point.
It continues on like that.

We could tweak assemble_stall_commands to take a sync count if you want; it's a bit cleaner, but I wanted to avoid unnecessary changes in my original patch. Making watcher not look at the previous executed command is probably more difficult.

pgkeller · 2024-11-21T20:40:50Z

hmm, wonder why what you got is 1 further back than what I got.
the data I gathered in the spreadsheet showed that we get 1 fast kernel after a dispatch, which isn't adequate, so something odd is still going on.
if changing watcher is hard, we could address this by making the launch_message ring buffer bigger, say 8 entries. we definitely need to queue at least ~3 into the future for resnet
I'll take a look at my test/data again and be sure that something isn't going wonky there

jbaumanTT · 2024-11-22T22:12:49Z

I've got a trace tracy now (of a run with tracing enabled); it looks like there's a

Gap of 574 ns after the slow kernel
937ms kernel invocation (all fast kernels take this long
Gap of 735 ns after that kernel
709ns gap after the next kernel
755 ns gap after the next kernel
700 ns gap after the next kernel
1.4 us gap after the next kernel
more 1.4us gaps

so there is more of a gap after the first fast kernel; I'll have to see why.

jbaumanTT · 2024-11-26T19:44:59Z

It looks like it takes around 535 ns (or cycles) from when dispatcher_s starts sending the go to when the worker receives it. It's also around 500 ns from when the active worker sends its completion increment to when the dispatcher_s receives it; after the slow kernel that drops to 236 ns, which suggests that the rest of the time is congestion from all of idle workers sending their acks at the same time. The active worker takes around 650 ns from go received to ack, while idle workers take around 510 ns.

So most of the time (1035 ns) is spent in the round trip from the worker to dispatcher_s and back again. We should be able to hide around 500 ns if we preload the next kernel. We should probably also make the noc_fast_atomic_increment at completion be posted to avoid extra traffic and save time there.

jbaumanTT · 2024-11-26T19:50:08Z

Just tested switching noc_fast_atomic_increment to posted, and it seems to reduce the time for the dispatcher to receive the message to 250-ish ns, and reduces the time per fast kernel from 1700ns to 1450 ns.

pgkeller · 2024-11-26T20:31:58Z

when you measure the total duration of dispatch w/ 1,2,3,4,5 shadow kernels do you see a knee from 3-4? when I ran the first was ~free and there was no other knee w/ each additional kernel taking ~6K cycles iirc. is that not an issue at present?

We never wait on the acks from these completion messages, so make them posted to avoid contention from a lot of replies being sent at once. In the case where every worker is sending them at the same time, this can halve the latency from 500ns to 250ns (on wormhole).

pgkeller assigned jbaumanTT Nov 19, 2024

jbaumanTT mentioned this issue Dec 19, 2024

#15221: Post completion messages to dispatch_s #16187

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FD ring buffer is stalling too often #15221

FD ring buffer is stalling too often #15221

pgkeller commented Nov 19, 2024

jbaumanTT commented Nov 21, 2024

pgkeller commented Nov 21, 2024

jbaumanTT commented Nov 22, 2024

jbaumanTT commented Nov 26, 2024

jbaumanTT commented Nov 26, 2024

pgkeller commented Nov 26, 2024

FD ring buffer is stalling too often #15221

FD ring buffer is stalling too often #15221

Comments

pgkeller commented Nov 19, 2024

jbaumanTT commented Nov 21, 2024

pgkeller commented Nov 21, 2024

jbaumanTT commented Nov 22, 2024

jbaumanTT commented Nov 26, 2024

jbaumanTT commented Nov 26, 2024

pgkeller commented Nov 26, 2024