Question about dist.barrier() use in diff_model_train.py #1933

cbe135 · 2025-01-31T01:35:13Z

Hi, there are two instances of dist.all_reduce() in diff_model_train.py.
One for scale_factor using ReduceOp.AVG, and one for loss_torch, using ReduceOp.SUM.
However, one uses dist.barrier() before dist.all_reduce() and one doesn't.

I have three short questions regarding this usage in this scenario.
1, Why is dist.barrier() not needed for loss_torch SUM?
2, Is there a reason why SUM is used instead of AVG for loss?
3, If iteration-level losses are added, should dist.barrier() be added for iteration-level loss dist.all_reduce(), SUM or AVG?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about dist.barrier() use in diff_model_train.py #1933

Question about dist.barrier() use in diff_model_train.py #1933

cbe135 commented Jan 31, 2025

Question about dist.barrier() use in diff_model_train.py #1933

Question about dist.barrier() use in diff_model_train.py #1933

Comments

cbe135 commented Jan 31, 2025