Data parallel helper #1407

angeloskath · 2024-09-12T23:04:37Z

This PR adds float16 and bfloat16 types to MPI as well as a helper to average gradients over a distributed group. It is a bit ugly but it helps nicely on ml-explore/mlx-examples#821 . So I am not sure if it belongs in nn.utils 🤔 .

Here is a simple scaling plot for a Llama 8B finetuning.

awni · 2024-09-16T16:02:17Z

python/mlx/nn/utils.py

+        return tree_map(_average, gradients)
+
+    else:
+        flat_grads = sorted(tree_flatten(gradients), key=lambda x: x[0])


What is the purpose of the sort here?

Yeah this should be either removed or the code just above is wrong. Basically in order for this to work tree_map needs to have a deterministic ordering across machines (because all_sum needs to be called on the equivalent arrays).

So this comes down to iteration ordering of dicts in python. Looking into it a bit more since python 3.7 the iteration is guaranteed to be the order of insertion which is why the code in line 110 works as well. So the conclusion is I will remove the sort :-)

I hadn't thought of the ordering across machines. That makes sense!

awni · 2024-09-16T16:05:03Z

python/mlx/nn/utils.py

@@ -68,3 +69,93 @@ def wrapped_checkpointed_fn(*args, **kwargs):
        return checkpointed_fn(module.trainable_parameters(), *args, **kwargs)

    return wrapped_checkpointed_fn
+
+
+def average_gradients(


I'm not sure it matters, but this function has two different return modes which could lead to some confusion:

Return reference to input tree if it's a no-op

Return copied tree structure o/w

So if you do:

avg_grads = average_gradients(grads) grads[0] = ...

The behavior would be different in the two cases. It seems like a really odd usage pattern.. but nevertheless it might be worth making a copy of the tree structure in all cases.

Hm hadn't thought of that. One reason to keep it as is would be that in the case of no-op we have absolutely no overhead (except a python function call which is nanoseconds). Otherwise doing a copy would add some overhead proportional to the number of parameters.

Let me know what you think. It could be fine as it is likely hidden behind computation anyway.

Yea I don't like making overhead for no reason and those trees can get pretty large. So probably let's just keep it the way it is and deal with it in the future if needed.

awni

Looks really nice. I think of all the places this could be nn.utils is probably the best. If you think there is a world in which we want to do more distributed stuff in nn.. we could make a new sub-package nn.distributed and put it there (which may also be a good home for the distributed layers in #1270).

awni · 2024-09-16T16:08:22Z

The benchmark scaling is pretty remarkable btw.. so much potential for fast distributed fine-tuning!

angeloskath added 2 commits September 11, 2024 23:36

Add float16 reduction to MPI

e0d6f2e

Provide a helper to average gradients

27a0b02

angeloskath requested a review from awni September 12, 2024 23:04

awni reviewed Sep 16, 2024

View reviewed changes

awni approved these changes Sep 16, 2024

View reviewed changes

Remove sort from tree_flatten

861c689

angeloskath merged commit 914409f into main Sep 17, 2024
4 checks passed

angeloskath deleted the data-parallel-helper branch September 17, 2024 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data parallel helper #1407

Data parallel helper #1407

angeloskath commented Sep 12, 2024

awni Sep 16, 2024

angeloskath Sep 16, 2024

awni Sep 16, 2024

awni Sep 16, 2024

angeloskath Sep 16, 2024

awni Sep 16, 2024

awni left a comment

awni commented Sep 16, 2024

Data parallel helper #1407

Data parallel helper #1407

Conversation

angeloskath commented Sep 12, 2024

awni Sep 16, 2024

Choose a reason for hiding this comment

angeloskath Sep 16, 2024

Choose a reason for hiding this comment

awni Sep 16, 2024

Choose a reason for hiding this comment

awni Sep 16, 2024

Choose a reason for hiding this comment

angeloskath Sep 16, 2024

Choose a reason for hiding this comment

awni Sep 16, 2024

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

awni commented Sep 16, 2024