Add a new resnet50+convergence+spmd test #1013

yeounoh · 2023-11-11T01:40:13Z

Description

Add a new test for Resnet50+SPMD for convergence. Use batch sharding.

jonb377 · 2023-11-12T22:20:26Z

@yeounoh Do you have a oneshot run for this?

jonb377 · 2023-11-12T23:33:33Z

I tried a local run, it's very slow:

Epoch 1 train begin 22:36:40
| Training Device=xla:0/0 Epoch=1 Step=0 Loss=7.11059 Rate=9.68 GlobalRate=9.68 Time=22:38:26
| Training Device=xla:0/0 Epoch=1 Step=200 Loss=6.48232 Rate=67.85 GlobalRate=101.57 Time=23:10:26

We might need to optimize our resnet training script some more to avoid timeouts.

EDIT: It seems like dataloading is the bottleneck, PjRt convergence tests are also pretty slow: http://shortn/_whmMQ4flXg

jonb377

LGTM, thanks Yeounoh!

jonb377 · 2023-11-13T00:48:26Z

tests/pytorch/nightly/resnet50-mp.libsonnet

@@ -209,5 +209,6 @@ local tpus = import 'templates/tpus.libsonnet';
    // SPMD
    resnet50 + functional + v4_8 + timeouts.Hours(2) + spmd(['batch']),
    resnet50 + functional + v4_8 + timeouts.Hours(2) + spmd(['spatial']),
+    resnet50 + convergence + v4_8 + timeouts.Hours(2) + spmd(['batch']),


Let's make the timeout 14 hours to match the PjRt tests. I've kicked off a oneshot here, let's merge if it passes: http://shortn/_b8rAgZ0mkY

cloudtop [~/d/x/1/ml-testing-accelerators] % git diff diff --git a/tests/pytorch/nightly/resnet50-mp.libsonnet b/tests/pytorch/nightly/resnet50-mp.libsonnet index b24e5bc5..ba86abab 100644 --- a/tests/pytorch/nightly/resnet50-mp.libsonnet +++ b/tests/pytorch/nightly/resnet50-mp.libsonnet @@ -209,5 +209,6 @@ local tpus = import 'templates/tpus.libsonnet'; // SPMD resnet50 + functional + v4_8 + timeouts.Hours(2) + spmd(['batch']), resnet50 + functional + v4_8 + timeouts.Hours(2) + spmd(['spatial']), + resnet50 + convergence + v4_8 + timeouts.Hours(14) + spmd(['batch']), ], }

Thanks, it failed due to RESOURCE_EXHAUSTED after an epoch. Is this the same error you were running into for the convergence test?

I would guess this is related to the eval loop's dataloader sharding. I also noticed that our rate is still ~1/8 that of the regular PjRt tests - probably data processing for all devices in a single thread.

alanwaketan · 2023-11-13T23:13:14Z

I tried a local run, it's very slow:
Epoch 1 train begin 22:36:40
| Training Device=xla:0/0 Epoch=1 Step=0 Loss=7.11059 Rate=9.68 GlobalRate=9.68 Time=22:38:26
| Training Device=xla:0/0 Epoch=1 Step=200 Loss=6.48232 Rate=67.85 GlobalRate=101.57 Time=23:10:26
We might need to optimize our resnet training script some more to avoid timeouts.

EDIT: It seems like dataloading is the bottleneck, PjRt convergence tests are also pretty slow: http://shortn/_whmMQ4flXg

If that's the case, can we use bert or some other language models?

yeounoh · 2023-11-14T01:12:09Z

I tried a local run, it's very slow:
Epoch 1 train begin 22:36:40
| Training Device=xla:0/0 Epoch=1 Step=0 Loss=7.11059 Rate=9.68 GlobalRate=9.68 Time=22:38:26
| Training Device=xla:0/0 Epoch=1 Step=200 Loss=6.48232 Rate=67.85 GlobalRate=101.57 Time=23:10:26
We might need to optimize our resnet training script some more to avoid timeouts.
EDIT: It seems like dataloading is the bottleneck, PjRt convergence tests are also pretty slow: http://shortn/_whmMQ4flXg
If that's the case, can we use bert or some other language models?

Yea, we can just use llama, but it would require us to emit the correct tb metrics. Will follow up with @will-cromar @jonb377

Convergence too slow

alanwaketan · 2023-12-06T05:35:41Z

Curious on are we going to use resnet or not for the convergence test?

jonb377 · 2023-12-11T19:01:17Z

Curious on are we going to use resnet or not for the convergence test?

I chatted with @will-cromar - there's still hope for resnet. We can increase the number of worker threads to better match the MP PjRt performance without reinventing the SPMD dataloader.

I'll test this out later today (run with --num_workers 64, it's still pretty slow)

yeounoh and others added 4 commits September 5, 2023 18:08

Skip saving pretrained model for diffuser.

157c08a

Skip saving pretrained model for diffuser, in nightly

bfd1dcc

Merge branch 'GoogleCloudPlatform:master' into master

19bdaa6

Merge branch 'GoogleCloudPlatform:master' into master

49e8b9e

yeounoh requested review from jonb377 and ManfeiBai November 11, 2023 01:40

yeounoh self-assigned this Nov 11, 2023

yeounoh requested a review from alanwaketan November 11, 2023 01:40

jonb377 previously approved these changes Nov 13, 2023

View reviewed changes

Add resnet50 + convergence + v4_8 test

2967337

yeounoh force-pushed the resnet50_conv_spmd branch from 4b52d17 to 2967337 Compare November 13, 2023 19:01

jonb377 self-requested a review November 14, 2023 01:13

yeounoh marked this pull request as draft November 14, 2023 20:28

yeounoh closed this Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new resnet50+convergence+spmd test #1013

Add a new resnet50+convergence+spmd test #1013

yeounoh commented Nov 11, 2023

jonb377 commented Nov 12, 2023

jonb377 commented Nov 12, 2023 •

edited

Loading

jonb377 left a comment

jonb377 Nov 13, 2023 •

edited

Loading

yeounoh Nov 13, 2023

jonb377 Nov 13, 2023

alanwaketan commented Nov 13, 2023

yeounoh commented Nov 14, 2023

alanwaketan commented Dec 6, 2023

jonb377 commented Dec 11, 2023 •

edited

Loading

Add a new resnet50+convergence+spmd test #1013

Add a new resnet50+convergence+spmd test #1013

Conversation

yeounoh commented Nov 11, 2023

Description

jonb377 commented Nov 12, 2023

jonb377 commented Nov 12, 2023 • edited Loading

jonb377 left a comment

Choose a reason for hiding this comment

jonb377 Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

yeounoh Nov 13, 2023

Choose a reason for hiding this comment

jonb377 Nov 13, 2023

Choose a reason for hiding this comment

alanwaketan commented Nov 13, 2023

yeounoh commented Nov 14, 2023

alanwaketan commented Dec 6, 2023

jonb377 commented Dec 11, 2023 • edited Loading

jonb377 commented Nov 12, 2023 •

edited

Loading

jonb377 Nov 13, 2023 •

edited

Loading

jonb377 commented Dec 11, 2023 •

edited

Loading