Sync 4 layer norms - bf16, fp32, optimizer states on restart #274

tjruwase · 2022-03-28T19:11:15Z

this PR uses microsoft/DeepSpeed#1801 @ d911e67 to sync layer norms:

for bf16 weights
for fp32 weights in bf16 optimizer
for 2 optimizer states

all_reduce/OP.AVG is used in all 3 cases.

automatically works for all layers and all 4 types of layer norms - weights and biases.

this has been successfully applied to the live model and lead to layers getting in sync - but after some iterations some layers got out of sync again - some other bug to figure out.

…_layer_norm

…orms

thomasw21 and others added 30 commits March 24, 2022 00:39

WIP

8d7a603

Wip

240f673

Woops

1cdcd7d

WIP

2937280

Woops

7fcff06

Woops

1f2f800

Woops

f152e48

Test with alibi

ce02dd1

Still trying to reproduce

02365d1

Huh

42d6b4e

Have high LR to see weights actually change

c20c8ba

Launch bf16

7f2441e

Woops

a4172bf

Make test to work with both bf16 and fp16 to see who fails

5fbe107

Woops

a0c0913

Remove assert

6b19339

Try to figure out how the divergence happens

a5e3295

I think bias starts to diverge first

7145f6d

Woops

311e531

Woops

39d4b8f

Woops

8ffb278

Add embed layer norm

2389bfd

Woops

0cf35ee

Backward compatibility on torch

f0d6d17

Better

07ccb3d

Merge remote-tracking branch 'origin/main' into thomas/test_different…

3c5e491

…_layer_norm

fix

a5b5edc

Sync lp/hp/optim for layer norms

c7f2006

fix requirements

8f2ea60

dynamically discovered layer norm weights / refactor

fc8f813

stas00 added 2 commits March 28, 2022 18:18

fix regex

4443e6d

add the test script

d2aa4f1

stas00 changed the title ~~Olruwase/sync layer norms~~ Sync 4 layer norms - bf16, fp32, optimizer states on restart Mar 29, 2022

stas00 and others added 7 commits March 29, 2022 08:16

compare on cpu

d64a947

add 2 more weights to sync

bf7eeb3

fp32 accessors

8482595

improve the doc, and comment out the demo

86b726c

typo

2ac141b

Sync torch_rng_state (#277)

d576775

Fix device issue when using torch.broadcast

475f373

stas00 mentioned this pull request May 24, 2022

sync layer norms #272

Merged

Merge remote-tracking branch 'origin/main' into olruwase/sync_layer_n…

5b36884

…orms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync 4 layer norms - bf16, fp32, optimizer states on restart #274

Sync 4 layer norms - bf16, fp32, optimizer states on restart #274

tjruwase commented Mar 28, 2022 •

edited by stas00

Loading

Sync 4 layer norms - bf16, fp32, optimizer states on restart #274

Are you sure you want to change the base?

Sync 4 layer norms - bf16, fp32, optimizer states on restart #274

Conversation

tjruwase commented Mar 28, 2022 • edited by stas00 Loading

tjruwase commented Mar 28, 2022 •

edited by stas00

Loading