Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add aligned samples for completion prompt strategy #687

Closed

Conversation

kallewoof
Copy link
Contributor

@kallewoof kallewoof commented Oct 6, 2023

By default, data will be chopped into samples, with the last sample being right-padded to fill up the context. This will often result in last-samples-of-a-text that have a short text followed by a bunch of pads. While this is fine from a training perspective, it is far more common in real life to encounter a completion task where you are given a very short amount of the starting text. This PR flips the padding side, when possible, so that the very first sample begins with padding tokens, followed by the starting text, and the remaining samples are then perfectly aligned with the sequence length so that no more padding is required.

In other words, we end up with filled up context sizes for all samples except the first ones in each input, which now looks like

[PAD] [PAD] [PAD] ... [PAD] Once upon a time, there was a

whereas we would normally end up with filled up context sizes for all except the last one in each input, which looks like

happily ever after.[PAD] [PAD] [PAD] ...[PAD] [PAD] [PAD]

I should note that this approach would most likely benefit instruction format prompt strategies as well, with the caveat that it at minimum gets to the Response part.

@kallewoof kallewoof force-pushed the 202310-aligned-completion branch 4 times, most recently from 02f217a to 91a80b3 Compare October 6, 2023 09:36
@kallewoof kallewoof marked this pull request as ready for review October 6, 2023 13:52
@kallewoof
Copy link
Contributor Author

Tested. This code works as intended. It is giving me [PAD] [PAD] [PAD] [PAD] [PAD] Once upon a time for starting samples and everything else looks unpadded.

By default, data will be chopped into samples, with the last sample being right-padded to fill up the context. This will often result in last-samples-of-a-text that have a short text followed by a bunch of pads. While this is fine from a training perspective, it is far more common in real life to encounter a completion task where you are given a very short amount of the *starting text*. This PR flips the padding side, when possible, so that the very first sample begins with padding tokens, followed by the starting text, and the remaining samples are then perfectly aligned with the sequence length so that no more padding is required.
@kallewoof kallewoof force-pushed the 202310-aligned-completion branch from 91a80b3 to 76ce0e5 Compare October 6, 2023 13:59
@winglian
Copy link
Collaborator

winglian commented Oct 6, 2023

would you be able to add a unit test to validate this behavior?

@kallewoof kallewoof force-pushed the 202310-aligned-completion branch from 1600413 to 2ef1f90 Compare October 7, 2023 14:04
@kallewoof
Copy link
Contributor Author

kallewoof commented Oct 7, 2023

Done. Converting to draft though, as I am seeing some odd losses when using this in training. Digging.

@kallewoof kallewoof marked this pull request as draft October 7, 2023 14:20
@kallewoof
Copy link
Contributor Author

kallewoof commented Oct 8, 2023

The odd losses I was seeing were unrelated to this PR (verified both by seeing the weird loss on main branch, and by seeing the weird loss go away when I fixed my settings). I think this is RFM.

@kallewoof kallewoof marked this pull request as ready for review October 8, 2023 12:58
@kallewoof kallewoof closed this Oct 12, 2023
@kallewoof kallewoof deleted the 202310-aligned-completion branch October 19, 2023 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants