Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UMD] Removed set_*_params calls and constants #15908

Merged
merged 3 commits into from
Dec 17, 2024
Merged

Conversation

broskoTT
Copy link
Contributor

@broskoTT broskoTT commented Dec 11, 2024

Ticket

Related to #13948

Problem description

Related to UMD change: tenstorrent/tt-umd#395
These parameters are defined by hardware.
The only parameters to be defined by the client are membars. This change adds set_barrier_address_params accordingly.

What's changed

Removed set_device_dram_address_params and set_device_l1_address_params, and related constants.
Replace with call to set_barrier_address_params.

Checklist

@@ -36,7 +36,6 @@ HalCoreInfoType create_idle_eth_mem_map() {

mem_map_bases.resize(static_cast<std::size_t>(HalL1MemAddrType::COUNT));
mem_map_bases[static_cast<std::size_t>(HalL1MemAddrType::BASE)] = MEM_ETH_BASE;
mem_map_bases[static_cast<std::size_t>(HalL1MemAddrType::BARRIER)] = MEM_L1_BARRIER;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgkeller why is this 12 in tt_metal, but here it takes some other value? Is this barrier a fixed address, or can it be chosen arbitrarily?
static constexpr std::uint32_t L1_BARRIER_BASE = 0x16dfc0;
https://github.com/tenstorrent/tt-umd/blob/main/src/firmware/riscv/blackhole/l1_address_map.h#L45

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using 0x16dfc0 in metal would cause problems, that could potentially overwrite tensor data
metal puts all its reserved addresses low in memory for tensix (eth is a little different due to the base firmware)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to add, the barrier doesn't have to be at a fixed address. It is up to the memory map definers to allocate space for it

Copy link
Contributor Author

@broskoTT broskoTT Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so there are two sets of parameters:

  • Parameters which are constant for our hardware, and UMD's client doesn't have freedom to choose it, but has to use that one value.
  • Parameters which can freely be chosen by the user, like how l1 memory is layed out.

I'd obviously want to remove the first category but not the latter. If l1_barrier can be chosen, I'll change this PR.
There is probably also a third category, something you might want to change theoretically, but some value is highly suggested, or it would require a power user to make sure our device works properly.

So I obviously need help with this :)
Can you comment on which category do these parameters fall into?
The tt_driver_host_address_params was already removed (just want to recheck)
The tt_driver_eth_interface_params was already removed (just want to recheck)
Fw_Base was removed from l1_address_params because it wasn't used in UMD.

struct tt_device_dram_address_params {
std::uint32_t DRAM_BARRIER_BASE = 0;
};

/**

  • Struct encapsulating all L1 Address Map parameters required by UMD.
  • These parameters are passed to the constructor.
    */
    struct tt_device_l1_address_params {
    std::uint32_t ncrisc_fw_base = 0;
    std::uint32_t fw_base = 0;
    std::uint32_t trisc0_size = 0;
    std::uint32_t trisc1_size = 0;
    std::uint32_t trisc2_size = 0;
    std::uint32_t trisc_base = 0;
    std::uint32_t tensix_l1_barrier_base = 0;
    std::uint32_t eth_l1_barrier_base = 0;
    std::uint32_t fw_version_addr = 0;
    };

/**

  • Struct encapsulating all Host Address Map parameters required by UMD.
  • These parameters are passed to the constructor and are needed for non-MMIO transactions.
    */
    struct tt_driver_host_address_params {
    std::uint32_t eth_routing_block_size = 0;
    std::uint32_t eth_routing_buffers_start = 0;
    };

/**

  • Struct encapsulating all ERISC Firmware parameters required by UMD.
  • These parameters are passed to the constructor and are needed for non-MMIO transactions.
    */
    struct tt_driver_eth_interface_params {
    std::uint32_t noc_addr_local_bits = 0;
    std::uint32_t noc_addr_node_id_bits = 0;
    std::uint32_t eth_rack_coord_width = 0;
    std::uint32_t cmd_buf_size_mask = 0;
    std::uint32_t max_block_size = 0;
    std::uint32_t request_cmd_queue_base = 0;
    std::uint32_t response_cmd_queue_base = 0;
    std::uint32_t cmd_counters_size_bytes = 0;
    std::uint32_t remote_update_ptr_size_bytes = 0;
    std::uint32_t cmd_data_block = 0;
    std::uint32_t cmd_wr_req = 0;
    std::uint32_t cmd_wr_ack = 0;
    std::uint32_t cmd_rd_req = 0;
    std::uint32_t cmd_rd_data = 0;
    std::uint32_t cmd_buf_size = 0;
    std::uint32_t cmd_data_block_dram = 0;
    std::uint32_t eth_routing_data_buffer_addr = 0;
    std::uint32_t request_routing_cmd_queue_base = 0;
    std::uint32_t response_routing_cmd_queue_base = 0;
    std::uint32_t cmd_buf_ptr_mask = 0;
    std::uint32_t cmd_ordered = 0;
    std::uint32_t cmd_broadcast = 0;
    };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tt_device_dram_address_params and tt_device_l1_address_params are freely chosen by users.

tt_device_l1_address_params

  1. Curious to know if triscX_size/base and ncrisc_fw_base are used by UMD
  2. I also am not clear on need/usage of fw_version_addr. Metal has never really used this.

tt_driver_host_address_params/tt_driver_eth_interface_params

  1. This should only be required for WH
  2. I don't see Metal calling set_driver_host_address_params or set_driver_eth_interface_params anywhere. This seems to pull out addresses needed by base eth routing FW so it falls into "constant for our hardware" category

I like Joel's idea of user supplying barrier address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • You probably missed the info in my huge comment, but yeah the set_driver_host_address_params and set_driver_eth_interface_params were removed.
  • Trisc and ncrisc fw_base have been removed indeed from l1_address_params
  • fw_version_addr is used, it is used for verifying eth_firmware. But it obviously falls into "constant for our hardware", at least for now.

**If I understood correctly all the comments:

  • All barriers should be set by the client, DRAM_BARRIER_BASE, tensix_l1_barrier_base, eth_l1_barrier_base
  • All other constants fall under "constant for our hardware", or at least into "hard to come by to a scenario where you'd want to change this"**

Regarding @joelsmithTT 's comment:

pass barrier addresses when they are used
Not sure this is possible, there are many entry points which internally use these addresses, not always transparent to the client.

I'll make the changes accordingly, thanks all for feedback.

Copy link
Contributor

@pgkeller pgkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have Almeet review as well.
How do L1 barriers work now? I think we initiate barriers from our device, where do they write to?

@@ -36,7 +36,6 @@ HalCoreInfoType create_idle_eth_mem_map() {

mem_map_bases.resize(static_cast<std::size_t>(HalL1MemAddrType::COUNT));
mem_map_bases[static_cast<std::size_t>(HalL1MemAddrType::BASE)] = MEM_ETH_BASE;
mem_map_bases[static_cast<std::size_t>(HalL1MemAddrType::BARRIER)] = MEM_L1_BARRIER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using 0x16dfc0 in metal would cause problems, that could potentially overwrite tensor data
metal puts all its reserved addresses low in memory for tensix (eth is a little different due to the base firmware)

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm to remove the set_*_params but I think UMD would need to accept the barrier address from Metal because UMD shouldn't enforce a particular L1 structure

@blozano-tt
Copy link
Contributor

blozano-tt commented Dec 11, 2024

@broskoTT what about a script that can read tt_metal dev_mem_map.h and generate the L1 address map you use on the UMD side? This way we have single source of truth. I think it could be a simple python script.

@joelsmithTT
Copy link
Contributor

I can suggest two different proposals for how to deal with this.

  1. Application-level implementation: simplify the UMD by having Metal implement its own memory barrier patterns. UMD exposes the basic IO primitives and memory fence operations. Metal to implement the equivalent of Cluster::set_membar_flag (plus convenience wrappers for L1/DRAM/whatever).
  2. Point-of-use address passing: remove memory map state from UMD and instead pass barrier addresses when they are used. This makes the interface more explicit and removes the need to maintain/synchronize memory maps across different codebases.

@abhullar-tt
Copy link
Contributor

2. interface more explicit and removes the need to maintain/synchronize memory maps across different codebases.

I am in favour of option 2

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

broskoTT added a commit to tenstorrent/tt-umd that referenced this pull request Dec 16, 2024
### Issue
Fixes #106 

### Description
Part of this was done previously in
#260
#261 and
#234

### List of the changes
- Added default l1_address_params and dram_address_params
- Remove set_device_l1_address_params, set_device_dram_address_params,
set_driver_host_address_params, set_driver_eth_interface_params
- Remove unnecessary code from tests

### Testing
CI tests, but more importantly CI tests on the related tt_metal PR.

### API Changes
This PR has API changes:
- [x] tt_metal approved PR pointing to this branch:
tenstorrent/tt-metal#15908
- [X] No breaking changes in tt_debuda
@broskoTT broskoTT merged commit 5381e43 into main Dec 17, 2024
277 of 285 checks passed
@broskoTT broskoTT deleted the brosko/set_remove branch December 17, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants