s4p44: all-reduce implementation #6

alanwaketan · 2024-07-16T00:31:32Z

In the slides, you mentioned that all-reduces are decomposed into reduce-scatter and all-gather. So basically, it costs double of those ops. However, in XLA on TPU, the reduce-scatter is often implemented with all-reduce and dynamic-slice which suggests the opposite way where all-reduce is much faster than reduce-scatter. Can you explain the differences?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s4p44: all-reduce implementation #6

s4p44: all-reduce implementation #6

alanwaketan commented Jul 16, 2024

s4p44: all-reduce implementation #6

s4p44: all-reduce implementation #6

Comments

alanwaketan commented Jul 16, 2024