Creating mlx.distributed, torch.distributed for the mlx distributed #997

omkaark · 2024-04-15T00:33:30Z

omkaark
Apr 15, 2024

So pytorch has a torch.distributed. Right now, it feels like mlx is depending on quantized models + vertical scaling (M-series models getting more powerful). What about horizontal scaling?

I made simple-federated-learning to enable training over multiple macs on the same network. The problem is network costs are too high and this is 5x slower than running training on one mac lol. Thus, you see I pivoted this project to be about federated learning.

Any ideas on making mlx.distributed?

CBaquero · 2024-04-24T10:03:39Z

CBaquero
Apr 24, 2024

Hi. I wonder why the distributed learning was slower. Was it the cost of initially partitioning and distributing a big training dataset to the nodes and then having a relatively small training time that did not offset that initial cost?

I take it that for federated learning, each node will already have its own dataset and only the parameters are communicated to a central server. Will try to take a closer look at the project.

0 replies

omkaark · 2024-06-15T16:53:46Z

omkaark
Jun 15, 2024
Author

High network costs, thunderbolt actually solves it, making apple-based AI clusters possible

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating mlx.distributed, torch.distributed for the mlx distributed #997

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Creating mlx.distributed, torch.distributed for the mlx distributed #997

omkaark Apr 15, 2024

Replies: 2 comments

CBaquero Apr 24, 2024

omkaark Jun 15, 2024 Author

omkaark
Apr 15, 2024

CBaquero
Apr 24, 2024

omkaark
Jun 15, 2024
Author