You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used the default configs provided in TorchTitan to try and replicate performance. I have tried to modify your launch script, but I am unable to achieve reasonable performance (4 nodes is as fast as 1 node). There is one related issue #676.
Dependencies:
CUDA 12.6
torch 2.6.0 nightly 20241121
Things I tried:
Eliminating pyxis/enroot and thereby docker images. just using conda did not solve it.
I tried to play around with SBATCH and NCCL variables, but ultimately couldn't make it perform any better.
Performance
All nodes are on DGX Cloud through OCI with 3.2 TB/s bandwidth per node.
multi-node (32x H100)
1: [rank0]:2024-11-27 14:28:39,087 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 32, sequence length 8192, total steps 1000 (warmup 200)
0: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10 loss: 1.5568 memory: 26.47GiB(33.47%) wps: 1,786 mfu: 10.46%
3: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10 loss: 1.5568 memory: 26.47GiB(33.47%) wps: 1,786 mfu: 10.46%
1: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10 loss: 1.5568 memory: 26.47GiB(33.47%) wps: 1,786 mfu: 10.46%
2: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10 loss: 1.5568 memory: 26.47GiB(33.47%) wps: 1,786 mfu: 10.46%
single-node (8x H100)
0: [rank0]:2024-11-27 14:38:23,621 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 8, sequence length 8192, total steps 1000 (warmup 200)
0: [rank0]:2024-11-27 14:39:07,085 - root - INFO - step: 20 loss: 1.5182 memory: 37.87GiB(47.87%) wps: 7,357 mfu: 43.08%
I used the default configs provided in TorchTitan to try and replicate performance. I have tried to modify your launch script, but I am unable to achieve reasonable performance (4 nodes is as fast as 1 node). There is one related issue #676.
Dependencies:
Things I tried:
Performance
All nodes are on DGX Cloud through OCI with 3.2 TB/s bandwidth per node.
multi-node (32x H100)
single-node (8x H100)
Launch script
The text was updated successfully, but these errors were encountered: