Replies: 2 comments
-
Hi. I wonder why the distributed learning was slower. Was it the cost of initially partitioning and distributing a big training dataset to the nodes and then having a relatively small training time that did not offset that initial cost? I take it that for federated learning, each node will already have its own dataset and only the parameters are communicated to a central server. Will try to take a closer look at the project. |
Beta Was this translation helpful? Give feedback.
0 replies
-
High network costs, thunderbolt actually solves it, making apple-based AI clusters possible |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So pytorch has a torch.distributed. Right now, it feels like mlx is depending on quantized models + vertical scaling (M-series models getting more powerful). What about horizontal scaling?
I made simple-federated-learning to enable training over multiple macs on the same network. The problem is network costs are too high and this is 5x slower than running training on one mac lol. Thus, you see I pivoted this project to be about federated learning.
Any ideas on making mlx.distributed?
Beta Was this translation helpful? Give feedback.
All reactions