Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update all-reduce to enable basic n300 support #15802

Merged
merged 1 commit into from
Dec 7, 2024
Merged

Conversation

SeanNijjar
Copy link
Contributor

@SeanNijjar SeanNijjar commented Dec 6, 2024

Ticket

Link to Github Issue

Problem description

N300 support wasn't enabled with all-gather so there were some basic issues with it.

What's changed

This commit resolves the basic issue which was that for n300, all-reduce was
a) trying to invoke a ring on an n300 (which is really a line)
b) using the reduce-scatter + all-gather implementation (which is buggy on line topology due to buggy line reduce scatter implementation) -> all-reduce will invoke the naive all-gather + local reduce implementation in this case

Note this is old CCL code that will be deprecated soon. I'm hoping to put as few man hours into it as possible.

Checklist

@SeanNijjar SeanNijjar closed this Dec 6, 2024
@SeanNijjar SeanNijjar reopened this Dec 6, 2024
@SeanNijjar SeanNijjar marked this pull request as ready for review December 6, 2024 23:37
auto shape = input_tensor.get_logical_shape();
auto rank = shape.rank();

uint32_t all_reduce_dim = -1;
bool optimized_version = false;

if (num_devices == 2) {
// 2 devices == n300 == linear topology
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably here because of debug but delete this before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvegaTT - can you please clarify your concern here? Is it specifically the comment or the entire conditional block? If the latter, then this change is required for correctness. I kept it intentionally separate from the later

    if (topology == ttnn::ccl::Topology::Linear) {
        // reduce scatter doesn't reliably support line topology yet
        optimized_version = false;
    }

because they are applying those modifications for different reasons:

This block in this comment is related to topology. If we are on a 2 device system, it implies line (atleast today - in the future we will need to generalize to make the op infra more thoroughly detect line vs ring be checking ethernet link connectivity or querying fabric).

The second block is temporarily there because reduce scatter isn't stable on line topology, and it's used for the optimized version of all-reduce, so we disable the optimized version if we are on a line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it was just the // 2 devices == n300 == linear topology comment. But on second look I think that it is good to clarify this so let's keep it.

Copy link
Contributor

@jvegaTT jvegaTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few very minor comments but it overall looks good so just accepting in advance.

@SeanNijjar SeanNijjar merged commit 10eeea8 into main Dec 7, 2024
150 checks passed
@SeanNijjar SeanNijjar deleted the snijjar/issue-15789 branch December 7, 2024 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants