Add UL2 data sampling and pretraining #358

janEbert · 2022-12-13T14:24:36Z

This adds pretraining using UL2 for both encoder-decoder, non-causal decoder-only, and causal decoder-only models.
I have not yet run large-scale tests to see if it yields the desired training improvements, but I wanted to give others the option to take a look at the code already.

Since we create them in the T5 data loader, why not use them?

Handles backward-compatibility, so the rest of the code base does not need to change.

Namely sampling from uniform and normal distributions.

... which also improve error messages.

Instead, the user should choose a larger maximum sequence length, which an error warns them about.

janEbert · 2022-12-14T08:22:41Z

Previously, I truncated sequences so the maximum amount of duplicated extra_id tokens would fit in and still be accepted by the model, losing a bit of data most of the time. I now changed it so the program just errors out and asks the user to put in a longer sequence length for the model.

This is probably a worse/undesired solution, so I kept the other code in for now (but commented).

Note that erroring out is also how the T5Dataset does it.

Instead of concatenating arrays and lists to get a certain dtype.

Muennighoff · 2022-12-28T14:04:11Z

megatron/data/dataset_utils.py

        # Note(mingdachen):
        # By default, we set the probilities to favor shorter ngram sequences.
        pvals = 1. / np.arange(1, max_ngrams + 1)
        pvals /= pvals.sum(keepdims=True)
        if favor_longer_ngram:
            pvals = pvals[::-1]
+    elif sampling_style is SamplingStyle.NORMAL:
+        normal_mean = (max_ngrams + 1) / 2


normal_mean is not used it seems

It's used here. :)
d8db189#diff-e1d14be32f4489a01cb8d571804fbba003f7f90715ef3cb3a27d9099e0245d6fR298

For small sequence lengths or low probability/mean ngram values, we could get `max_ngrams` < 1 and `max_predictions_per_seq` < 1, causing no masking to be done.

Now same as in the UL2 paper code snippet.

janEbert · 2023-01-03T12:50:33Z

There were several issues still remaining in the UL2 implementation, most notably that I only tested for micro batch sizes of 1, which when increased made the decoder-only models fail. :p
Also most notably in terms of the UL2 sampling, there was an issue regarding the S-denoisers, in which the mean was not correctly positioned, leading to shorter masks than desired.

The implementation also more closely follows the seqio implementation in the UL2 paper now, which omits the single extra_id token for the Prefix-LM task, which we previously added.

As in the T5 codebase. This could have highly detrimental effects on performance of TorchScript cannot easily type-dispatch the `bias_dropout_add` function.

More code reuse, change some methods to functions and change their visibility.

For readability.

By pre-allocating more data.

janEbert · 2023-04-06T14:24:16Z

I can finally report results... Comparing standard T5 training vs training with UL2 or UL2R, results in lm-eval-harness were almost always better with UL2/UL2R. Which should mean this code does improve evaluation results. :)

DS = DeepSpeed No idea why this happens, I couldn't explain it after briefly looking into the DeepSpeed source.

That is, the reproduced objective token.

Was missing `max_seq_length_dec`.

This was already the case for encoder-decoders, but is now also the case for decoder-only models.

This also fixes problems with decoder-only attention masks.

When using the custom fused softmax kernel.

janEbert added 5 commits December 13, 2022 12:12

Fix PretrainedFromHF tokenizer with T5 training

b2fc665

Allow passing existing casual attention masks

13becf1

Since we create them in the T5 data loader, why not use them?

Refactor masked LM sampling style selection

7f50532

Handles backward-compatibility, so the rest of the code base does not need to change.

Add more masked LM sampling styles

d8db189

Namely sampling from uniform and normal distributions.

Allow Prefix-LM style masked LM

006c4e9

janEbert force-pushed the ul2 branch 3 times, most recently from db95ce8 to 4d9ff77 Compare December 13, 2022 17:24

janEbert added 2 commits December 13, 2022 18:25

Add UL2 pretraining for T5 model

f802317

Refactor span merging

deed87f

janEbert force-pushed the ul2 branch from 4d9ff77 to ab85833 Compare December 13, 2022 17:25

Support UL2 for decoder-only models

728e076

janEbert force-pushed the ul2 branch from ab85833 to 728e076 Compare December 13, 2022 18:39

janEbert added 4 commits December 13, 2022 20:25

Unconditionally use safe maximum sequence length

42ece6b

Add custom exceptions

d18f84e

... which also improve error messages.

Error out on too long sequences

fa5aa68

Remove additional sequence truncation

c7d8a8b

Instead, the user should choose a larger maximum sequence length, which an error warns them about.

Prefer array-from-list creation

c722516

Instead of concatenating arrays and lists to get a certain dtype.

Muennighoff reviewed Dec 28, 2022

View reviewed changes

janEbert added 8 commits January 2, 2023 11:51

Remove redundant imports

69f6e70

Fix not inserting prefixes

f08a104

For small sequence lengths or low probability/mean ngram values, we could get `max_ngrams` < 1 and `max_predictions_per_seq` < 1, causing no masking to be done.

Do not insert extra_id tokens for PrefixLM task

d2fd03e

Now same as in the UL2 paper code snippet.

Document max_seq_length_dec argument

daf52cc

Skip redundant computations

04be590

Fix PrefixLM mean location

7bc5a87

Pad decoder-only inputs to same length

775e99d

Fix decoder-only attention mask shape

538c30b

Document index set selection for PrefixLM masking

ba4476c

janEbert added 18 commits March 9, 2023 09:19

Do not use bias for 2nd MLP layer if using T5 GLU

482f0ea

As in the T5 codebase. This could have highly detrimental effects on performance of TorchScript cannot easily type-dispatch the `bias_dropout_add` function.

Fix T5 GLU constructor arguments

4385f7b

Refactor samples dict creation

2d24b13

More code reuse, change some methods to functions and change their visibility.

Move callees under caller

bd461f5

For readability.

Handle empty context

35b2956

Handle more possible model types

f0171e0

Fix fully truncated contexts with prefix tokens

92158d8

Make T5 GLU checks safer

3b7692f

Improve import code style

b37d3ee

Refactor dummy barriers

5959e89

Refactor file name creation

ce8c1a5

Allow packing only full documents

3e52966

Use full-doc packing for T5-style datasets

23efa88

Fix trying to all-reduce non-existent bias

88eb98a

Fix truncating packed sequences without padding

59e8451

Speed up packed dataset indexing

24d46ff

By pre-allocating more data.

Try to exit padding removal early

600542d

Fix xPos embedding

58831d2

janEbert added 11 commits April 13, 2023 16:25

Fix padding loss mask

fe45cea

Handle failure mode regarding non-DS checkpoints

15e7b98

DS = DeepSpeed No idea why this happens, I couldn't explain it after briefly looking into the DeepSpeed source.

Fix decoder-only and no-mask-tokens seq lengths

ae45a9e

Omit second objective token if without mask tokens

0c91b96

That is, the reproduced objective token.

Fix NumPy deprecations

0c246c4

Fix supplied arguments

7ce8635

Was missing `max_seq_length_dec`.

Do not add separator if S-denoising

7290181

This was already the case for encoder-decoders, but is now also the case for decoder-only models.

Fix caching error

628d847

Fix number of labels calculation for decoder-only

9c727e7

Do not automatically add <EOS> token when packing

4ffa951

This also fixes problems with decoder-only attention masks.

Allow silently ignoring causal attention mask

ff5787e

When using the custom fused softmax kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UL2 data sampling and pretraining #358

Add UL2 data sampling and pretraining #358

janEbert commented Dec 13, 2022

janEbert commented Dec 14, 2022 •

edited

Loading

Muennighoff Dec 28, 2022

janEbert Jan 2, 2023

janEbert commented Jan 3, 2023

janEbert commented Apr 6, 2023

Add UL2 data sampling and pretraining #358

Are you sure you want to change the base?

Add UL2 data sampling and pretraining #358

Conversation

janEbert commented Dec 13, 2022

janEbert commented Dec 14, 2022 • edited Loading

Muennighoff Dec 28, 2022

Choose a reason for hiding this comment

janEbert Jan 2, 2023

Choose a reason for hiding this comment

janEbert commented Jan 3, 2023

janEbert commented Apr 6, 2023

janEbert commented Dec 14, 2022 •

edited

Loading