feature: add aligned samples for completion prompt strategy #687
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By default, data will be chopped into samples, with the last sample being right-padded to fill up the context. This will often result in last-samples-of-a-text that have a short text followed by a bunch of pads. While this is fine from a training perspective, it is far more common in real life to encounter a completion task where you are given a very short amount of the starting text. This PR flips the padding side, when possible, so that the very first sample begins with padding tokens, followed by the starting text, and the remaining samples are then perfectly aligned with the sequence length so that no more padding is required.
In other words, we end up with filled up context sizes for all samples except the first ones in each input, which now looks like
[PAD] [PAD] [PAD] ... [PAD] Once upon a time, there was a
whereas we would normally end up with filled up context sizes for all except the last one in each input, which looks like
happily ever after.[PAD] [PAD] [PAD] ...[PAD] [PAD] [PAD]
I should note that this approach would most likely benefit instruction format prompt strategies as well, with the caveat that it at minimum gets to the Response part.