-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update all-reduce to enable basic n300 support #15802
Conversation
auto shape = input_tensor.get_logical_shape(); | ||
auto rank = shape.rank(); | ||
|
||
uint32_t all_reduce_dim = -1; | ||
bool optimized_version = false; | ||
|
||
if (num_devices == 2) { | ||
// 2 devices == n300 == linear topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably here because of debug but delete this before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvegaTT - can you please clarify your concern here? Is it specifically the comment or the entire conditional block? If the latter, then this change is required for correctness. I kept it intentionally separate from the later
if (topology == ttnn::ccl::Topology::Linear) {
// reduce scatter doesn't reliably support line topology yet
optimized_version = false;
}
because they are applying those modifications for different reasons:
This block in this comment is related to topology. If we are on a 2 device system, it implies line (atleast today - in the future we will need to generalize to make the op infra more thoroughly detect line vs ring be checking ethernet link connectivity or querying fabric).
The second block is temporarily there because reduce scatter isn't stable on line topology, and it's used for the optimized version of all-reduce, so we disable the optimized version if we are on a line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it was just the // 2 devices == n300 == linear topology comment. But on second look I think that it is good to clarify this so let's keep it.
ttnn/cpp/ttnn/operations/experimental/ccl/all_reduce/device/all_reduce_op.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few very minor comments but it overall looks good so just accepting in advance.
Ticket
Link to Github Issue
Problem description
N300 support wasn't enabled with all-gather so there were some basic issues with it.
What's changed
This commit resolves the basic issue which was that for n300, all-reduce was
a) trying to invoke a ring on an n300 (which is really a line)
b) using the reduce-scatter + all-gather implementation (which is buggy on line topology due to buggy line reduce scatter implementation) -> all-reduce will invoke the naive all-gather + local reduce implementation in this case
Note this is old CCL code that will be deprecated soon. I'm hoping to put as few man hours into it as possible.
Checklist