Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FD ring buffer is stalling too often #15221

Open
pgkeller opened this issue Nov 19, 2024 · 6 comments
Open

FD ring buffer is stalling too often #15221

pgkeller opened this issue Nov 19, 2024 · 6 comments
Assignees

Comments

@pgkeller
Copy link
Contributor

The data on this sheet (Dispatch~0 tab) shows we aren't queuing up kernels as expected w/ low latency dispatch: https://docs.google.com/spreadsheets/d/1nN1ZPUNc2f0d6YEenBOHS7BdkpGlBysZ6SCArinGOrE/edit?pli=1&gid=84505338#gid=84505338

Running this command:
build/test/tt_metal/perf_microbenchmark/dispatch/test_pgm_dispatch -w 0 -i 1 -s 1024 -rs 40000 -nf 9
And putting a print at the top of EnqueueProgramCommand::process() after calculating the stall values like this:
fprintf(stderr, "stall: %d %d %d %d\n", stall_first, stall_before_program, this->expected_num_workers_completed, sync_count);

Shows:

stall: 0 0 0 0
stall: 0 0 56 0
stall: 0 0 112 0
stall: 0 0 168 0
stall: 1 0 224 112
stall: 1 0 280 168
stall: 1 0 336 224
stall: 1 0 392 280
stall: 1 0 448 336
stall: 1 0 504 392

Notable is that the stall is just running one kernel ahead (as if the launch message buffer is sized to 2) after the first 4 kernels.

@jbaumanTT
Copy link
Contributor

I think it's behaving correctly. We start with 4 launch message buffer spaces, but had to go down to queue at most 3 commands because the watcher looks at the previously executed ring buffer spaces. We also behave differently when first caching commands; I didn't modify assemble_stall_commands to take in the sync_count, so it always syncs to expected_num_workers_completed. We could change that if you want, but it's probably not that performance critical.

In this example this->expected_num_workers_completed is the number of workers completed before this command (if you sync on that value you sync on the previous command). If I modify the code to print out this->expected_num_workers_completed + num_workers, I get this:

stall: 0 0 56 0 is_cached 0
stall: 0 0 112 0 is_cached 0
stall: 0 0 168 0 is_cached 1
stall: 0 0 224 0 is_cached 1
stall: 1 0 280 112 is_cached 1
stall: 1 0 336 168 is_cached 1
stall: 1 0 392 224 is_cached 1
stall: 1 0 448 280 is_cached 1
stall: 1 0 504 336 is_cached 1
stall: 1 0 560 392 is_cached 1

The first program doesn't stall.
The second program stalls (on 56) because it's not cached.
Programs 3 and 4 don't stall.
Program 5 stalls for 112, which is the second command. When it starts these messages are in the launch message buffer: 3,4,5, so the launch message buffer is full at that point.
It continues on like that.

We could tweak assemble_stall_commands to take a sync count if you want; it's a bit cleaner, but I wanted to avoid unnecessary changes in my original patch. Making watcher not look at the previous executed command is probably more difficult.

@pgkeller
Copy link
Contributor Author

hmm, wonder why what you got is 1 further back than what I got.
the data I gathered in the spreadsheet showed that we get 1 fast kernel after a dispatch, which isn't adequate, so something odd is still going on.
if changing watcher is hard, we could address this by making the launch_message ring buffer bigger, say 8 entries. we definitely need to queue at least ~3 into the future for resnet
I'll take a look at my test/data again and be sure that something isn't going wonky there

@jbaumanTT
Copy link
Contributor

I've got a trace tracy now (of a run with tracing enabled); it looks like there's a

  • Gap of 574 ns after the slow kernel
  • 937ms kernel invocation (all fast kernels take this long
  • Gap of 735 ns after that kernel
  • 709ns gap after the next kernel
  • 755 ns gap after the next kernel
  • 700 ns gap after the next kernel
  • 1.4 us gap after the next kernel
  • more 1.4us gaps

so there is more of a gap after the first fast kernel; I'll have to see why.

@jbaumanTT
Copy link
Contributor

It looks like it takes around 535 ns (or cycles) from when dispatcher_s starts sending the go to when the worker receives it. It's also around 500 ns from when the active worker sends its completion increment to when the dispatcher_s receives it; after the slow kernel that drops to 236 ns, which suggests that the rest of the time is congestion from all of idle workers sending their acks at the same time. The active worker takes around 650 ns from go received to ack, while idle workers take around 510 ns.

So most of the time (1035 ns) is spent in the round trip from the worker to dispatcher_s and back again. We should be able to hide around 500 ns if we preload the next kernel. We should probably also make the noc_fast_atomic_increment at completion be posted to avoid extra traffic and save time there.

@jbaumanTT
Copy link
Contributor

Just tested switching noc_fast_atomic_increment to posted, and it seems to reduce the time for the dispatcher to receive the message to 250-ish ns, and reduces the time per fast kernel from 1700ns to 1450 ns.

@pgkeller
Copy link
Contributor Author

when you measure the total duration of dispatch w/ 1,2,3,4,5 shadow kernels do you see a knee from 3-4? when I ran the first was ~free and there was no other knee w/ each additional kernel taking ~6K cycles iirc. is that not an issue at present?

jbaumanTT added a commit that referenced this issue Dec 3, 2024
We never wait on the acks from these completion messages, so make them posted
to avoid contention from a lot of replies being sent at once. In the case where
every worker is sending them at the same time, this can halve the latency from
500ns to 250ns (on wormhole).
jbaumanTT added a commit that referenced this issue Dec 3, 2024
We never wait on the acks from these completion messages, so make them posted
to avoid contention from a lot of replies being sent at once. In the case where
every worker is sending them at the same time, this can halve the latency from
500ns to 250ns (on wormhole).
jbaumanTT added a commit that referenced this issue Dec 19, 2024
We never wait on the acks from these completion messages, so make them posted
to avoid contention from a lot of replies being sent at once. In the case where
every worker is sending them at the same time, this can halve the latency from
500ns to 250ns (on wormhole).
jbaumanTT added a commit that referenced this issue Dec 19, 2024
We never wait on the acks from these completion messages, so make them posted
to avoid contention from a lot of replies being sent at once. In the case where
every worker is sending them at the same time, this can halve the latency from
500ns to 250ns (on wormhole).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants