update all-reduce to enable basic n300 support #15802

SeanNijjar · 2024-12-06T23:32:46Z

Ticket

Problem description

N300 support wasn't enabled with all-gather so there were some basic issues with it.

What's changed

This commit resolves the basic issue which was that for n300, all-reduce was
a) trying to invoke a ring on an n300 (which is really a line)
b) using the reduce-scatter + all-gather implementation (which is buggy on line topology due to buggy line reduce scatter implementation) -> all-reduce will invoke the naive all-gather + local reduce implementation in this case

Note this is old CCL code that will be deprecated soon. I'm hoping to put as few man hours into it as possible.

Checklist

Post commit CI: https://github.com/tenstorrent/tt-metal/actions/runs/12207721173
T3k frequence, nightly, model perf: https://github.com/tenstorrent/tt-metal/actions/runs/12207718435
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

jvegaTT · 2024-12-07T00:20:23Z

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

    auto shape = input_tensor.get_logical_shape();
    auto rank = shape.rank();

    uint32_t all_reduce_dim = -1;
    bool optimized_version = false;

+    if (num_devices == 2) {
+        // 2 devices == n300 == linear topology


Probably here because of debug but delete this before merging

@jvegaTT - can you please clarify your concern here? Is it specifically the comment or the entire conditional block? If the latter, then this change is required for correctness. I kept it intentionally separate from the later

if (topology == ttnn::ccl::Topology::Linear) { // reduce scatter doesn't reliably support line topology yet optimized_version = false; }

because they are applying those modifications for different reasons:

This block in this comment is related to topology. If we are on a 2 device system, it implies line (atleast today - in the future we will need to generalize to make the op infra more thoroughly detect line vs ring be checking ethernet link connectivity or querying fabric).

The second block is temporarily there because reduce scatter isn't stable on line topology, and it's used for the optimized version of all-reduce, so we disable the optimized version if we are on a line.

OK, it was just the // 2 devices == n300 == linear topology comment. But on second look I think that it is good to clarify this so let's keep it.

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp

jvegaTT

I left a few very minor comments but it overall looks good so just accepting in advance.

update all-reduce to enable basic n300 support

d1ede11

SeanNijjar closed this Dec 6, 2024

SeanNijjar reopened this Dec 6, 2024

SeanNijjar marked this pull request as ready for review December 6, 2024 23:37

SeanNijjar requested review from jvegaTT, ayerofieiev-tt, dmakoviichuk-tt, cfjchu and TT-BrianLiu as code owners December 6, 2024 23:37

jvegaTT reviewed Dec 7, 2024

View reviewed changes

ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp Show resolved Hide resolved

jvegaTT approved these changes Dec 7, 2024

View reviewed changes

dmakoviichuk-tt approved these changes Dec 7, 2024

View reviewed changes

SeanNijjar merged commit 10eeea8 into main Dec 7, 2024
150 checks passed

SeanNijjar deleted the snijjar/issue-15789 branch December 7, 2024 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update all-reduce to enable basic n300 support #15802

update all-reduce to enable basic n300 support #15802

SeanNijjar commented Dec 6, 2024 •

edited

Loading

jvegaTT Dec 7, 2024

SeanNijjar Dec 7, 2024

jvegaTT Dec 7, 2024

jvegaTT left a comment

update all-reduce to enable basic n300 support #15802

update all-reduce to enable basic n300 support #15802

Conversation

SeanNijjar commented Dec 6, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

jvegaTT Dec 7, 2024

Choose a reason for hiding this comment

SeanNijjar Dec 7, 2024

Choose a reason for hiding this comment

jvegaTT Dec 7, 2024

Choose a reason for hiding this comment

jvegaTT left a comment

Choose a reason for hiding this comment

SeanNijjar commented Dec 6, 2024 •

edited

Loading