-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UL2 data sampling and pretraining #358
base: main
Are you sure you want to change the base?
Commits on Dec 13, 2022
-
Configuration menu - View commit details
-
Copy full SHA for b2fc665 - Browse repository at this point
Copy the full SHA b2fc665View commit details -
Allow passing existing casual attention masks
Since we create them in the T5 data loader, why not use them?
Configuration menu - View commit details
-
Copy full SHA for 13becf1 - Browse repository at this point
Copy the full SHA 13becf1View commit details -
Refactor masked LM sampling style selection
Handles backward-compatibility, so the rest of the code base does not need to change.
Configuration menu - View commit details
-
Copy full SHA for 7f50532 - Browse repository at this point
Copy the full SHA 7f50532View commit details -
Add more masked LM sampling styles
Namely sampling from uniform and normal distributions.
Configuration menu - View commit details
-
Copy full SHA for d8db189 - Browse repository at this point
Copy the full SHA d8db189View commit details -
Configuration menu - View commit details
-
Copy full SHA for 006c4e9 - Browse repository at this point
Copy the full SHA 006c4e9View commit details -
Configuration menu - View commit details
-
Copy full SHA for f802317 - Browse repository at this point
Copy the full SHA f802317View commit details -
Configuration menu - View commit details
-
Copy full SHA for deed87f - Browse repository at this point
Copy the full SHA deed87fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 728e076 - Browse repository at this point
Copy the full SHA 728e076View commit details -
Configuration menu - View commit details
-
Copy full SHA for 42ece6b - Browse repository at this point
Copy the full SHA 42ece6bView commit details
Commits on Dec 14, 2022
-
Configuration menu - View commit details
-
Copy full SHA for d18f84e - Browse repository at this point
Copy the full SHA d18f84eView commit details -
Configuration menu - View commit details
-
Copy full SHA for fa5aa68 - Browse repository at this point
Copy the full SHA fa5aa68View commit details -
Remove additional sequence truncation
Instead, the user should choose a larger maximum sequence length, which an error warns them about.
Configuration menu - View commit details
-
Copy full SHA for c7d8a8b - Browse repository at this point
Copy the full SHA c7d8a8bView commit details -
Prefer array-from-list creation
Instead of concatenating arrays and lists to get a certain dtype.
Configuration menu - View commit details
-
Copy full SHA for c722516 - Browse repository at this point
Copy the full SHA c722516View commit details
Commits on Jan 2, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 69f6e70 - Browse repository at this point
Copy the full SHA 69f6e70View commit details
Commits on Jan 3, 2023
-
For small sequence lengths or low probability/mean ngram values, we could get `max_ngrams` < 1 and `max_predictions_per_seq` < 1, causing no masking to be done.
Configuration menu - View commit details
-
Copy full SHA for f08a104 - Browse repository at this point
Copy the full SHA f08a104View commit details -
Do not insert
extra_id
tokens for PrefixLM taskNow same as in the UL2 paper code snippet.
Configuration menu - View commit details
-
Copy full SHA for d2fd03e - Browse repository at this point
Copy the full SHA d2fd03eView commit details -
Configuration menu - View commit details
-
Copy full SHA for daf52cc - Browse repository at this point
Copy the full SHA daf52ccView commit details -
Configuration menu - View commit details
-
Copy full SHA for 04be590 - Browse repository at this point
Copy the full SHA 04be590View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7bc5a87 - Browse repository at this point
Copy the full SHA 7bc5a87View commit details -
Configuration menu - View commit details
-
Copy full SHA for 775e99d - Browse repository at this point
Copy the full SHA 775e99dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 538c30b - Browse repository at this point
Copy the full SHA 538c30bView commit details
Commits on Jan 23, 2023
-
Configuration menu - View commit details
-
Copy full SHA for ba4476c - Browse repository at this point
Copy the full SHA ba4476cView commit details -
Fix
max_ngrams
for normal sampling styleSince the normal distribution is unbounded, we cannot have `max_ngrams` set to a bounded value.
Configuration menu - View commit details
-
Copy full SHA for 678fbdc - Browse repository at this point
Copy the full SHA 678fbdcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 00479e5 - Browse repository at this point
Copy the full SHA 00479e5View commit details -
Calculate and use amount of filtered tokens
Filtered means not `cls_id` or `sep_id` tokens. This slightly improves calculated statistics for long sequences and greatly for very short sequences.
Configuration menu - View commit details
-
Copy full SHA for 795caef - Browse repository at this point
Copy the full SHA 795caefView commit details -
Configuration menu - View commit details
-
Copy full SHA for 689e15f - Browse repository at this point
Copy the full SHA 689e15fView commit details -
Configuration menu - View commit details
-
Copy full SHA for e44d0e4 - Browse repository at this point
Copy the full SHA e44d0e4View commit details
Commits on Jan 24, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 075f05f - Browse repository at this point
Copy the full SHA 075f05fView commit details -
Calculate n-gram indices lazily
Usually we do not iterate through all indices, so we can save quite some time if `max_ngrams` is large.
Configuration menu - View commit details
-
Copy full SHA for 6bc7471 - Browse repository at this point
Copy the full SHA 6bc7471View commit details -
Configuration menu - View commit details
-
Copy full SHA for a105f32 - Browse repository at this point
Copy the full SHA a105f32View commit details -
Configuration menu - View commit details
-
Copy full SHA for f0fe282 - Browse repository at this point
Copy the full SHA f0fe282View commit details
Commits on Feb 14, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 11bd6db - Browse repository at this point
Copy the full SHA 11bd6dbView commit details -
Support UL2 tokens for all tokenizers
The GPT tokenizer does not handle the difference between UL2 tokens and other special tokens well. This should be fine as UL2 tokens being distinct from other special tokens is never assumed at the moment (although other tokenizers implement it like that). In general, `additional_special_token_ids` is new for the GPT tokenizer, so there is no backward compatibility trouble.
Configuration menu - View commit details
-
Copy full SHA for 43eee93 - Browse repository at this point
Copy the full SHA 43eee93View commit details -
Support
<extra_id>
tokens for GPT tokenizerWith this, we also adjust the `additional_special_token_ids` to only return extra ID tokens.
Configuration menu - View commit details
-
Copy full SHA for 6686f04 - Browse repository at this point
Copy the full SHA 6686f04View commit details -
Configuration menu - View commit details
-
Copy full SHA for f6128c6 - Browse repository at this point
Copy the full SHA f6128c6View commit details -
Revert inheriting from
T5Dataset
Personally, this makes the model more holistic and we never inherited correctly anyway, changing the public API. Finally, this allows usage of tokenizers without `cls_id`, which was previously redundantly queried due to the mentioned incorrect inheritance. Finally, the inheritance never saved much repetition to begin with.
Configuration menu - View commit details
-
Copy full SHA for 8f48763 - Browse repository at this point
Copy the full SHA 8f48763View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7f99a12 - Browse repository at this point
Copy the full SHA 7f99a12View commit details -
Do inherit from
torch.utils.data.Dataset
Removing all inheritance from the class was a bit too eager.
Configuration menu - View commit details
-
Copy full SHA for 535a306 - Browse repository at this point
Copy the full SHA 535a306View commit details -
Configuration menu - View commit details
-
Copy full SHA for db623b3 - Browse repository at this point
Copy the full SHA db623b3View commit details -
Allow selectively disabling denoiser token
Could make sense in the future to even allow different tokens for same denoising objectives. (E.g. one R-denoiser has token `[R]`, other R-denoiser has `[R+]`.)
Configuration menu - View commit details
-
Copy full SHA for ef72280 - Browse repository at this point
Copy the full SHA ef72280View commit details -
Allow not replacing masks with sentinel tokens
Backward-compatible since passing `sentinel_tokens=None` would have resulted in an error previously.
Configuration menu - View commit details
-
Copy full SHA for 001b50c - Browse repository at this point
Copy the full SHA 001b50cView commit details -
Support not adding mask tokens in span corruption
Backward-incompatible change as we put this before an existing positional argument.
Configuration menu - View commit details
-
Copy full SHA for 23c052f - Browse repository at this point
Copy the full SHA 23c052fView commit details
Commits on Feb 15, 2023
-
Fix expected number of added tokens
Was wrong for decoder-only case.
Configuration menu - View commit details
-
Copy full SHA for 0f4fd3f - Browse repository at this point
Copy the full SHA 0f4fd3fView commit details
Commits on Feb 16, 2023
-
Previously the model didn't know _where_ the data was actually inserted. Now it repeats the input sequence and inserts the masked data in the correct place. See example in Fig. 1 of AlexaTM 20B paper (arXiv/2208.01448).
Configuration menu - View commit details
-
Copy full SHA for da1f4e9 - Browse repository at this point
Copy the full SHA da1f4e9View commit details -
This wording was confusing and basically stated the wrong thing. The number/amount of n-grams is not bounded by `max_ngrams`, even though the variable name sounds like it. Instead, `max_ngrams` bounds n.
Configuration menu - View commit details
-
Copy full SHA for 55320ea - Browse repository at this point
Copy the full SHA 55320eaView commit details
Commits on Feb 17, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 5d27b27 - Browse repository at this point
Copy the full SHA 5d27b27View commit details -
Configuration menu - View commit details
-
Copy full SHA for 23181ab - Browse repository at this point
Copy the full SHA 23181abView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6032cc6 - Browse repository at this point
Copy the full SHA 6032cc6View commit details -
Automatically truncate sequences for decoder-only
Expecting the user to supply a sequence length greater than any data point is ridiculous. So now we greedily truncate the sequence based on the maximum amount of `extra_id`s, which wastes a lot of data. An alternative would be going a statistical route with significance attached to it; allowing the expected amount of tokens with some leeway, while handling an unlikely length excession error. This only handles the decoder-only case, while the encoder-decoder case is left as is. This is because errors are much less like for the encoder-decoder case unless massive corruption is configured or if the decoder has a smaller sequence length than the encoder.
Configuration menu - View commit details
-
Copy full SHA for c9c336f - Browse repository at this point
Copy the full SHA c9c336fView commit details -
Configuration menu - View commit details
-
Copy full SHA for b8003cb - Browse repository at this point
Copy the full SHA b8003cbView commit details -
Configuration menu - View commit details
-
Copy full SHA for e3d91a6 - Browse repository at this point
Copy the full SHA e3d91a6View commit details -
Configuration menu - View commit details
-
Copy full SHA for e61e78f - Browse repository at this point
Copy the full SHA e61e78fView commit details -
Add sample packing to T5 dataset
Backward-incompatible change due to positional argument without default, inserted before another positional argument.
Configuration menu - View commit details
-
Copy full SHA for c3b0a55 - Browse repository at this point
Copy the full SHA c3b0a55View commit details -
Configuration menu - View commit details
-
Copy full SHA for c4d748b - Browse repository at this point
Copy the full SHA c4d748bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 689b57e - Browse repository at this point
Copy the full SHA 689b57eView commit details -
Configuration menu - View commit details
-
Copy full SHA for af204e7 - Browse repository at this point
Copy the full SHA af204e7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 78eb035 - Browse repository at this point
Copy the full SHA 78eb035View commit details -
Configuration menu - View commit details
-
Copy full SHA for c03eed4 - Browse repository at this point
Copy the full SHA c03eed4View commit details
Commits on Feb 22, 2023
-
Refactor
get_sample
to return a listAccordingly, rename to `get_samples`.
Configuration menu - View commit details
-
Copy full SHA for 9e84f06 - Browse repository at this point
Copy the full SHA 9e84f06View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5e2b4f5 - Browse repository at this point
Copy the full SHA 5e2b4f5View commit details -
Configuration menu - View commit details
-
Copy full SHA for e2a0c36 - Browse repository at this point
Copy the full SHA e2a0c36View commit details -
Configuration menu - View commit details
-
Copy full SHA for c2884c8 - Browse repository at this point
Copy the full SHA c2884c8View commit details
Commits on Feb 23, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 7eb7923 - Browse repository at this point
Copy the full SHA 7eb7923View commit details -
Configuration menu - View commit details
-
Copy full SHA for dd4c0d0 - Browse repository at this point
Copy the full SHA dd4c0d0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 58148f8 - Browse repository at this point
Copy the full SHA 58148f8View commit details -
Configuration menu - View commit details
-
Copy full SHA for c41fecd - Browse repository at this point
Copy the full SHA c41fecdView commit details -
Refactor sample packing functions
Just pull them out of the other ones (and add separating whitespace/join lines).
Configuration menu - View commit details
-
Copy full SHA for 057bb47 - Browse repository at this point
Copy the full SHA 057bb47View commit details -
Configuration menu - View commit details
-
Copy full SHA for e2062b7 - Browse repository at this point
Copy the full SHA e2062b7View commit details -
Configuration menu - View commit details
-
Copy full SHA for d31b89f - Browse repository at this point
Copy the full SHA d31b89fView commit details
Commits on Feb 24, 2023
-
Fix GPT tokenizer vocab size query
Did not include additional special tokens.
Configuration menu - View commit details
-
Copy full SHA for 17dca4f - Browse repository at this point
Copy the full SHA 17dca4fView commit details -
Configuration menu - View commit details
-
Copy full SHA for bf9b1eb - Browse repository at this point
Copy the full SHA bf9b1ebView commit details
Commits on Feb 27, 2023
-
Configuration menu - View commit details
-
Copy full SHA for c4aa4cd - Browse repository at this point
Copy the full SHA c4aa4cdView commit details -
Allow full prefix Prefix-LM attention sampling
Useful for evaluation.
Configuration menu - View commit details
-
Copy full SHA for 8d7a0df - Browse repository at this point
Copy the full SHA 8d7a0dfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9bd6e1e - Browse repository at this point
Copy the full SHA 9bd6e1eView commit details -
Configuration menu - View commit details
-
Copy full SHA for ba4ab49 - Browse repository at this point
Copy the full SHA ba4ab49View commit details -
"lambada" was renamed to "lambada_openai" in the upstream lm-eval-harness repo.
Configuration menu - View commit details
-
Copy full SHA for 9f53171 - Browse repository at this point
Copy the full SHA 9f53171View commit details
Commits on Feb 28, 2023
-
This corrupts the targets. There is no good reason for this.
Configuration menu - View commit details
-
Copy full SHA for 5b63d0b - Browse repository at this point
Copy the full SHA 5b63d0bView commit details -
Previously we always gave the whole sequence as context, when it also includes the answer. This is obviously not desired. We only want to give enough context to reach the answer.
Configuration menu - View commit details
-
Copy full SHA for 639b71d - Browse repository at this point
Copy the full SHA 639b71dView commit details -
These models have moved into DeepSpeed but were never probably replaced here after they have been removed.
Configuration menu - View commit details
-
Copy full SHA for 127d1e4 - Browse repository at this point
Copy the full SHA 127d1e4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1bb788d - Browse repository at this point
Copy the full SHA 1bb788dView commit details -
Configuration menu - View commit details
-
Copy full SHA for cf5965a - Browse repository at this point
Copy the full SHA cf5965aView commit details -
At worst, these may be mapped to the wrong tokens. However, the chance that the amount of unknown tokens are as many or fewer than the few UL2 tokens is very low. And if there are more unknown tokens than UL2 tokens, we'll get errors.
Configuration menu - View commit details
-
Copy full SHA for a538238 - Browse repository at this point
Copy the full SHA a538238View commit details
Commits on Mar 1, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 3a8bc35 - Browse repository at this point
Copy the full SHA 3a8bc35View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6f0e33a - Browse repository at this point
Copy the full SHA 6f0e33aView commit details
Commits on Mar 2, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 9c4c718 - Browse repository at this point
Copy the full SHA 9c4c718View commit details
Commits on Mar 7, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 754cf21 - Browse repository at this point
Copy the full SHA 754cf21View commit details -
Configuration menu - View commit details
-
Copy full SHA for 08b0eaf - Browse repository at this point
Copy the full SHA 08b0eafView commit details -
Configuration menu - View commit details
-
Copy full SHA for 15622d2 - Browse repository at this point
Copy the full SHA 15622d2View commit details
Commits on Mar 9, 2023
-
Configuration menu - View commit details
-
Copy full SHA for e5a6169 - Browse repository at this point
Copy the full SHA e5a6169View commit details -
Configuration menu - View commit details
-
Copy full SHA for d583fe9 - Browse repository at this point
Copy the full SHA d583fe9View commit details -
Configuration menu - View commit details
-
Copy full SHA for ad7de7e - Browse repository at this point
Copy the full SHA ad7de7eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 81a68f7 - Browse repository at this point
Copy the full SHA 81a68f7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 46e145d - Browse repository at this point
Copy the full SHA 46e145dView commit details -
Do not use bias for 2nd MLP layer if using T5 GLU
As in the T5 codebase. This could have highly detrimental effects on performance of TorchScript cannot easily type-dispatch the `bias_dropout_add` function.
Configuration menu - View commit details
-
Copy full SHA for 482f0ea - Browse repository at this point
Copy the full SHA 482f0eaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4385f7b - Browse repository at this point
Copy the full SHA 4385f7bView commit details -
Refactor samples dict creation
More code reuse, change some methods to functions and change their visibility.
Configuration menu - View commit details
-
Copy full SHA for 2d24b13 - Browse repository at this point
Copy the full SHA 2d24b13View commit details -
Configuration menu - View commit details
-
Copy full SHA for bd461f5 - Browse repository at this point
Copy the full SHA bd461f5View commit details
Commits on Mar 10, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 35b2956 - Browse repository at this point
Copy the full SHA 35b2956View commit details -
Configuration menu - View commit details
-
Copy full SHA for f0171e0 - Browse repository at this point
Copy the full SHA f0171e0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 92158d8 - Browse repository at this point
Copy the full SHA 92158d8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3b7692f - Browse repository at this point
Copy the full SHA 3b7692fView commit details
Commits on Mar 20, 2023
-
Configuration menu - View commit details
-
Copy full SHA for b37d3ee - Browse repository at this point
Copy the full SHA b37d3eeView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5959e89 - Browse repository at this point
Copy the full SHA 5959e89View commit details -
Configuration menu - View commit details
-
Copy full SHA for ce8c1a5 - Browse repository at this point
Copy the full SHA ce8c1a5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3e52966 - Browse repository at this point
Copy the full SHA 3e52966View commit details -
Configuration menu - View commit details
-
Copy full SHA for 23efa88 - Browse repository at this point
Copy the full SHA 23efa88View commit details -
Configuration menu - View commit details
-
Copy full SHA for 88eb98a - Browse repository at this point
Copy the full SHA 88eb98aView commit details
Commits on Mar 21, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 59e8451 - Browse repository at this point
Copy the full SHA 59e8451View commit details
Commits on Mar 24, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 24d46ff - Browse repository at this point
Copy the full SHA 24d46ffView commit details
Commits on Apr 3, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 600542d - Browse repository at this point
Copy the full SHA 600542dView commit details
Commits on Apr 4, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 58831d2 - Browse repository at this point
Copy the full SHA 58831d2View commit details
Commits on Apr 13, 2023
-
Configuration menu - View commit details
-
Copy full SHA for fe45cea - Browse repository at this point
Copy the full SHA fe45ceaView commit details -
Handle failure mode regarding non-DS checkpoints
DS = DeepSpeed No idea why this happens, I couldn't explain it after briefly looking into the DeepSpeed source.
Configuration menu - View commit details
-
Copy full SHA for 15e7b98 - Browse repository at this point
Copy the full SHA 15e7b98View commit details
Commits on Jun 7, 2023
-
Configuration menu - View commit details
-
Copy full SHA for ae45a9e - Browse repository at this point
Copy the full SHA ae45a9eView commit details -
Omit second objective token if without mask tokens
That is, the reproduced objective token.
Configuration menu - View commit details
-
Copy full SHA for 0c91b96 - Browse repository at this point
Copy the full SHA 0c91b96View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0c246c4 - Browse repository at this point
Copy the full SHA 0c246c4View commit details
Commits on Jun 26, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 7ce8635 - Browse repository at this point
Copy the full SHA 7ce8635View commit details -
Do not add separator if S-denoising
This was already the case for encoder-decoders, but is now also the case for decoder-only models.
Configuration menu - View commit details
-
Copy full SHA for 7290181 - Browse repository at this point
Copy the full SHA 7290181View commit details -
Configuration menu - View commit details
-
Copy full SHA for 628d847 - Browse repository at this point
Copy the full SHA 628d847View commit details
Commits on Jun 29, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 9c727e7 - Browse repository at this point
Copy the full SHA 9c727e7View commit details -
Do not automatically add <EOS> token when packing
This also fixes problems with decoder-only attention masks.
Configuration menu - View commit details
-
Copy full SHA for 4ffa951 - Browse repository at this point
Copy the full SHA 4ffa951View commit details -
Allow silently ignoring causal attention mask
When using the custom fused softmax kernel.
Configuration menu - View commit details
-
Copy full SHA for ff5787e - Browse repository at this point
Copy the full SHA ff5787eView commit details