Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-device system descriptor #220

Merged
merged 2 commits into from
Aug 10, 2024
Merged

Support for multi-device system descriptor #220

merged 2 commits into from
Aug 10, 2024

Conversation

jnie-TT
Copy link
Contributor

@jnie-TT jnie-TT commented Jul 23, 2024

#162
Example of system descriptor for n300:
{'version': {'major': 0, 'minor': 0, 'patch': 39}, 'ttmlir_git_hash': '484963fb271e0817d5aa5fd6e6c4c3626787c2f2', 'product_identifier': 'unknown', 'system_desc': {'chip_descs': [{'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}, {'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}], 'chip_desc_indices': [0, 1], 'chip_capabilities': [3, 0], 'chip_coords': [{'rack': 0, 'shelf': 0, 'y': 0, 'x': 0}, {'rack': 0, 'shelf': 0, 'y': 0, 'x': 1}], 'chip_channels': [{'endpoint0': 1, 'endpoint1': -1}, {'endpoint0': 0, 'endpoint1': -1}]}}

This requires metal API updates, specifically to support querying the chip location. I added it locally for testing, but currently this API is missing so the build will fail. @nsmithtt What's the process for adding such APIs to metal?

@nsmithtt
Copy link
Contributor

This requires metal API updates, specifically to support querying the chip location. I added it locally for testing, but currently this API is missing so the build will fail. @nsmithtt What's the process for adding such APIs to metal?

I think we can just create a PR and it shouldn't be an issue. Can you send me your branch name and I'll take a look?

chipCapabilities.push_back(chipCapability);
// Extract chip location
int x, y, rack, shelf;
std::tie(x, y, rack, shelf) = device.get_chip_location();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is get_chip_location the only new API you added to metal? If so let's file an issue on metal (to have an issue number in the commit) something like "Add device chip location API".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::tie(x, y, rack, shelf) = device.get_chip_location();
chipCoords.emplace_back(rack, shelf, y, x);
// Extract chip connected channels
std::pair<int, int> connectedChips = getConnectedChips(device);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't there be more than 1 connect chips? Seems like a remote chip in the middle of a galaxy could have up to 4 connections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm that's a good point, I'll update the chipChannels schema to be an array of ints then instead of having only 2 endpoints hardcoded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention for chipChannels was to be just a flat list of connections, not tied to the other arrays, but maybe this is more cumbersome to query. If you think of it like a graph you have a list of nodes chipIds and a list of edges chipChannels where each edge is just a pair of node ids (but could be extended with new attributes in the future, like number of streams).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsmithtt Ahhh I see, I misinterpreted what chipChannels was intended to do. I pushed an update that now constructs a flat list of connections - also ported to a metal device mesh API that handles opening/closing devices safely.

Updated system descriptor looks like:

{'version': {'major': 0, 'minor': 0, 'patch': 40}, 'ttmlir_git_hash': '1fdd4a0970b17925db82c74c163834ccb24c737f', 'product_identifier': 'unknown', 'system_desc': {'chip_descs': [{'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}, {'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}], 'chip_desc_indices': [0, 1], 'chip_capabilities': [3, 0], 'chip_coords': [{'rack': 0, 'shelf': 0, 'y': 0, 'x': 0}, {'rack': 0, 'shelf': 0, 'y': 0, 'x': 1}], 'chip_channels': [{'endpoint0': 0, 'endpoint1': 1}]}}

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Aug 6, 2024

@tapspatel I had to comment out the strict system descriptor check or else the assertion always fires. I'll open an issue for the compiler to properly generate a system descriptor.

@nsmithtt does this match what you were expecting? I ignored chip coords and updated chip channels to include the two endpoint devices and their connected ethernet coords.

Multi-device system descriptor looks like this for an x2 card:

{
  "chip_descs": [
    {
      "arch": "Wormhole_b0",
      "grid_size": {
        "y": 8,
        "x": 8
      },
      "l1_size": 1499136,
      "num_dram_channels": 12,
      "dram_channel_size": 1073741824,
      "noc_l1_address_align_bytes": 16,
      "pcie_address_align_bytes": 32,
      "noc_dram_address_align_bytes": 32
    },
    {
      "arch": "Wormhole_b0",
      "grid_size": {
        "y": 8,
        "x": 8
      },
      "l1_size": 1499136,
      "num_dram_channels": 12,
      "dram_channel_size": 1073741824,
      "noc_l1_address_align_bytes": 16,
      "pcie_address_align_bytes": 32,
      "noc_dram_address_align_bytes": 32
    }
  ],
  "chip_desc_indices": [0, 1],
  "chip_capabilities": [3, 0],
  "chip_coords": [],
  "chip_channels": [
    {
      "device_id0": 0,
      "ethernet_core_coord0": {
        "y": 8,
        "x": 0
      },
      "device_id1": 1,
      "ethernet_core_coord1": {
        "y": 0,
        "x": 0
      }
    }
  ]
}

@tapspatel
Copy link
Contributor

@jnie-TT yeah, I added that assert since we were technically generating for n300 but running on n150. I have an issue open that I'm working on right now to address that: #306 (although this is more from test POV). Be default, compiler will generate for n300.

}

inline ::tt::target::Dim2d toFlatbuffer(FlatbufferObjectCache &cache,
GridAttr arch) {
const GridAttr &arch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking the review, but typically we only pass the MLIR types, i.e. Value, *Attr, *Type, by value. MLIR already enforces that they are wrappers around some underlying storage and it is convention to treat them as value types. It doesn't hurt to pass it by const&, but it's not necessary. All MLIR types are also immutable, so they are truly value types in this sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see, thanks for the explanation, I'll undo these changes!

std::vector<::tt::target::ChipChannel> allConnections;
allConnections.resize(connectionSet.size());

std::transform(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Couple comments inline


for (const ::ttnn::Device *device : devices) {
// Construct chip descriptor
::tt::target::Dim2d deviceGrid = toFlatbuffer(device->logical_grid_size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device->logical_grid_size() -> device.compute_with_storage_grid_size()

@@ -153,7 +153,8 @@ def run(args):

# execution
print("executing action for all provided flatbuffers")
device = ttrt.runtime.open_device(device_ids)
system_desc, device_ids = ttrt.runtime.get_current_system_desc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already called earlier on in the function, you can remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! This was leftover after the rebase, removed it

@tapspatel
Copy link
Contributor

ttrt code lgtm!

@jnie-TT jnie-TT force-pushed the jnie/sys_desc branch 2 times, most recently from 71cde8c to 1bf8d22 Compare August 8, 2024 19:42
Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome refactor! Both runtimes have a shared system desc! 🙌

@jnie-TT jnie-TT merged commit 0b393ec into main Aug 10, 2024
7 checks passed
@kmabeeTT
Copy link
Contributor

Hey @jnie-TT , am I nuts or is there some issue here at the merged commit 0b393ec in main?

Looks good in this format:

% ttrt query --save-artifacts --system-desc
<snip>

    "chip_capabilities": [
      3,
      0
    ],

Not so good here though, empty brackets causes error:

% ./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline --ttir-load-system-desc="path=./ttrt-artifacts/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer

<stdin>:6:572: error: expected valid keyword
module attributes {tt.device = #tt.device<#tt.grid<8x8, (d0, d1) -> (0, d0, d1)>, [0]>, tt.system_desc = #tt.system_desc<[{arch = <wormhole_b0>, grid = 7x8, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32}, {arch = <wormhole_b0>, grid = 7x8, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32}], [0, 1], [<pcie|host_mmio>, <>], [<0, 0, 0, 0>]>} {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ```
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Aug 13, 2024

@kmabeeTT Seems like chip_capabilities=0 is causing an issue, I think it's an issue on the mlir side, I'll take a closer look at this. In the meantime you can try running on an n150 machine, the issue should go away.

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Aug 13, 2024

Hey @jnie-TT , am I nuts or is there some issue here at the merged commit 0b393ec in main?

Looks good in this format:

% ttrt query --save-artifacts --system-desc
<snip>

    "chip_capabilities": [
      3,
      0
    ],

Not so good here though, empty brackets causes error:

% ./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline --ttir-load-system-desc="path=./ttrt-artifacts/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer

<stdin>:6:572: error: expected valid keyword
module attributes {tt.device = #tt.device<#tt.grid<8x8, (d0, d1) -> (0, d0, d1)>, [0]>, tt.system_desc = #tt.system_desc<[{arch = <wormhole_b0>, grid = 7x8, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32}, {arch = <wormhole_b0>, grid = 7x8, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32}], [0, 1], [<pcie|host_mmio>, <>], [<0, 0, 0, 0>]>} {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ```
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

Created a ticket #380

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants