Add UL2 data sampling and pretraining #358

Since we create them in the T5 data loader, why not use them?

Handles backward-compatibility, so the rest of the code base does not need to change.

Namely sampling from uniform and normal distributions.

... which also improve error messages.

Instead, the user should choose a larger maximum sequence length, which an error warns them about.

Instead of concatenating arrays and lists to get a certain dtype.

For small sequence lengths or low probability/mean ngram values, we could get `max_ngrams` < 1 and `max_predictions_per_seq` < 1, causing no masking to be done.

Now same as in the UL2 paper code snippet.

Since the normal distribution is unbounded, we cannot have `max_ngrams` set to a bounded value.

Filtered means not `cls_id` or `sep_id` tokens. This slightly improves calculated statistics for long sequences and greatly for very short sequences.

Usually we do not iterate through all indices, so we can save quite some time if `max_ngrams` is large.

Via an extra "private" argument.

The GPT tokenizer does not handle the difference between UL2 tokens and other special tokens well. This should be fine as UL2 tokens being distinct from other special tokens is never assumed at the moment (although other tokenizers implement it like that). In general, `additional_special_token_ids` is new for the GPT tokenizer, so there is no backward compatibility trouble.

With this, we also adjust the `additional_special_token_ids` to only return extra ID tokens.

Personally, this makes the model more holistic and we never inherited correctly anyway, changing the public API. Finally, this allows usage of tokenizers without `cls_id`, which was previously redundantly queried due to the mentioned incorrect inheritance. Finally, the inheritance never saved much repetition to begin with.

Removing all inheritance from the class was a bit too eager.

For readability.

Could make sense in the future to even allow different tokens for same denoising objectives. (E.g. one R-denoiser has token `[R]`, other R-denoiser has `[R+]`.)

Backward-compatible since passing `sentinel_tokens=None` would have resulted in an error previously.

Backward-incompatible change as we put this before an existing positional argument.

Was wrong for decoder-only case.

Previously the model didn't know _where_ the data was actually inserted. Now it repeats the input sequence and inserts the masked data in the correct place. See example in Fig. 1 of AlexaTM 20B paper (arXiv/2208.01448).

This wording was confusing and basically stated the wrong thing. The number/amount of n-grams is not bounded by `max_ngrams`, even though the variable name sounds like it. Instead, `max_ngrams` bounds n.

It's just too ugly to leave it like the original.

Expecting the user to supply a sequence length greater than any data point is ridiculous. So now we greedily truncate the sequence based on the maximum amount of `extra_id`s, which wastes a lot of data. An alternative would be going a statistical route with significance attached to it; allowing the expected amount of tokens with some leeway, while handling an unlikely length excession error. This only handles the decoder-only case, while the encoder-decoder case is left as is. This is because errors are much less like for the encoder-decoder case unless massive corruption is configured or if the decoder has a smaller sequence length than the encoder.

Backward-incompatible change due to positional argument without default, inserted before another positional argument.

Forgot to apply fixes here.

Accordingly, rename to `get_samples`.

Now we won't exceed the desired seq length.

Just pull them out of the other ones (and add separating whitespace/join lines).

Did not include additional special tokens.

Useful for evaluation.

"lambada" was renamed to "lambada_openai" in the upstream lm-eval-harness repo.

This corrupts the targets. There is no good reason for this.

Previously we always gave the whole sequence as context, when it also includes the answer. This is obviously not desired. We only want to give enough context to reach the answer.

These models have moved into DeepSpeed but were never probably replaced here after they have been removed.

When indexing into `False` or `None`.

At worst, these may be mapped to the wrong tokens. However, the chance that the amount of unknown tokens are as many or fewer than the few UL2 tokens is very low. And if there are more unknown tokens than UL2 tokens, we'll get errors.

`XPos` → `XPosEmbedding`

As in the T5 codebase. This could have highly detrimental effects on performance of TorchScript cannot easily type-dispatch the `bias_dropout_add` function.

More code reuse, change some methods to functions and change their visibility.

For readability.

By pre-allocating more data.

DS = DeepSpeed No idea why this happens, I couldn't explain it after briefly looking into the DeepSpeed source.

That is, the reproduced objective token.

Was missing `max_seq_length_dec`.

This was already the case for encoder-decoders, but is now also the case for decoder-only models.

This also fixes problems with decoder-only attention masks.

When using the custom fused softmax kernel.

Commits on Jan 2, 2023

Remove redundant imports

janEbert committed Jan 2, 2023

Configuration menu

View commit details

Copy full SHA for 69f6e70

Browse repository at this point

Copy the full SHA

69f6e70 View commit details

Browse the repository at this point in the history

Commits on Mar 2, 2023

Fix context batch size padding

janEbert committed Mar 2, 2023

Configuration menu

View commit details

Copy full SHA for 9c4c718

Browse repository at this point

Copy the full SHA

9c4c718 View commit details

Browse the repository at this point in the history

Commits on Apr 3, 2023

Try to exit padding removal early

janEbert committed Apr 3, 2023

Configuration menu

View commit details

Copy full SHA for 600542d

Browse repository at this point

Copy the full SHA

600542d View commit details

Browse the repository at this point in the history

Commits on Apr 4, 2023

Fix xPos embedding

janEbert committed Apr 4, 2023

Configuration menu

View commit details

Copy full SHA for 58831d2

Browse repository at this point

Copy the full SHA

58831d2 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UL2 data sampling and pretraining #358

Add UL2 data sampling and pretraining #358

Commits on Dec 13, 2022

Commits on Dec 14, 2022

Commits on Jan 2, 2023

Commits on Jan 3, 2023

Commits on Jan 23, 2023

Commits on Jan 24, 2023

Commits on Feb 14, 2023

Commits on Feb 15, 2023

Commits on Feb 16, 2023

Commits on Feb 17, 2023

Commits on Feb 22, 2023

Commits on Feb 23, 2023

Commits on Feb 24, 2023

Commits on Feb 27, 2023

Commits on Feb 28, 2023

Commits on Mar 1, 2023

Commits on Mar 2, 2023

Commits on Mar 7, 2023

Commits on Mar 9, 2023

Commits on Mar 10, 2023

Commits on Mar 20, 2023

Commits on Mar 21, 2023

Commits on Mar 24, 2023

Commits on Apr 3, 2023

Commits on Apr 4, 2023

Commits on Apr 13, 2023

Commits on Jun 7, 2023

Commits on Jun 26, 2023

Commits on Jun 29, 2023

Add UL2 data sampling and pretraining #358

Are you sure you want to change the base?

Add UL2 data sampling and pretraining #358

Commits on Dec 13, 2022

Commits on Dec 14, 2022

Commits on Jan 2, 2023

Commits on Jan 3, 2023

Commits on Jan 23, 2023

Commits on Jan 24, 2023

Commits on Feb 14, 2023

Commits on Feb 15, 2023

Commits on Feb 16, 2023

Commits on Feb 17, 2023

Commits on Feb 22, 2023

Commits on Feb 23, 2023

Commits on Feb 24, 2023

Commits on Feb 27, 2023

Commits on Feb 28, 2023

Commits on Mar 1, 2023

Commits on Mar 2, 2023

Commits on Mar 7, 2023

Commits on Mar 9, 2023

Commits on Mar 10, 2023

Commits on Mar 20, 2023

Commits on Mar 21, 2023

Commits on Mar 24, 2023

Commits on Apr 3, 2023

Commits on Apr 4, 2023

Commits on Apr 13, 2023

Commits on Jun 7, 2023

Commits on Jun 26, 2023

Commits on Jun 29, 2023