Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack kernel binary memory spans into one #12977

Merged
merged 4 commits into from
Sep 29, 2024
Merged

pack kernel binary memory spans into one #12977

merged 4 commits into from
Sep 29, 2024

Conversation

pgkeller
Copy link
Contributor

Ticket

#11789

Problem description

Kernel binaries are parsed into multiple memory address ranges (spans) and sent to the device individually which is inefficient. Also, the data section ends up at a magic address which is used prior to copying the data to local memory.
The kernel ring buffer needs the kernel in one blob in the ring buffer, not at multiple locations. This PR merges the spans into one in prep for the ring buffer.

What's changed

NOTE: slow dispatch doesn't check for kernel text+data overflowing
the firmware hole size. This will be fixed w/ the ring buffer

Split linker script builds into 2, fw and kernel
Bump firmware hole size up by local sizes, kernel data now goes here
Create scratch area for firmware locals, used only during init
Check fw_size+kernel_size+kernel_local_size against max at runtime since now we
could overflow after compile succeeds (only done for FD, will fix w/ ring buffer for SD)
Use symbol for kernel data address, const for firmware data address
Pack data span into text span

pgm_dispatch perf gains of 5-30%

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • New/Existing tests provide coverage for changes

@pgkeller pgkeller requested a review from nathan-TT September 22, 2024 20:37
@@ -191,15 +191,15 @@ int main(int argc, char **argv) {
JitBuildProcessorType::DATA_MOVEMENT,
0,
get_latest_kernel_binary_path(mask, riscv0_kernel));
ll_api::memory brisc_binary = llrt::get_risc_binary(brisc_hex_path);
ll_api::memory brisc_binary = llrt::get_risc_binary(brisc_hex_path, 0, true);
Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from api standpoint I recommend to replace bool arg to an enum, as it makes semantics of this true clear.
https://github.com/tenstorrent/tt-metal/blob/main/best_practices.md#14-avoid-bool-arguments-in-apis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer constant should be an enum too, AFAICT it's an 'id'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the bool. The riscv ID is entrenched throughtout the code base w/ many different definitions. It will get cleaned up when we start using the HAL to abstract this away.

#define MEM_IERISC_MAILBOX_SIZE 3072
#define MEM_IERISC_MAILBOX_END (MEM_IERISC_MAILBOX_BASE + MEM_IERISC_MAILBOX_SIZE)
// TODO: this should be tightly packed against MAILBOX_END but that causes
// problems w/ back to back runs, leaving the padding for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding TODO comment @ line 104, 109.
That is because there is data in 4K block @ 0x1000 that syseng tools read back when creating ethernet map.
idle erisc code has to start at 0x2000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, is that for Blackhole? Then I am not sure. Maybe syseng has the same assumptions/requirements for BH.
What I said above is for WH.

#define MEM_IERISC_MAILBOX_SIZE 3072
#define MEM_IERISC_MAILBOX_END (MEM_IERISC_MAILBOX_BASE + MEM_IERISC_MAILBOX_SIZE)
// TODO: this should be tightly packed against MAILBOX_END but that causes
// problems w/ back to back runs, leaving the padding for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments above in blackhole/dev_mem_map.h.

@@ -307,6 +328,7 @@ void wait_until_cores_done(
// Print not-done cores
if (loop_count % 1000 == 0) {
log_debug(tt::LogMetal, "Not done phys cores: {}", fmt::join(not_done_phys_cores, " "));
usleep(100000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sleep intended?

NOTE: slow dispatch doesn't check for kernel text+data overflowing
the firmware hole size.  This will be fixed w/ the ring buffer

Split linker script builds into 2, fw and kernel
Bump firmware hole size up by local sizes, kernel data now goes here
Create scratch area for firmware locals, used only during init
Check fw_size+kernel_size+kernel_local_size against max at runtime since now we
could overflow after compile succeeds
Use symbol for kernel data address, const for firmware data address
Pack data span into text span
Some bad values just happened to work, sizes now line up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants