-
-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: add overlap_len option to completion strategy #668
Conversation
933545c
to
2c80e9b
Compare
great idea! I think we could implement this without needing to modify all the other prompt strategies. Right now completion is pretty isolated to https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/completion.py You'll notice on line 81: def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None): there is an optional ds_cfg var. datasets:
- path: ...
type: completion
overlap_len: 128 then you can reference that from |
Sounds good! Edit: looking at your suggestion again, I think that actually can be used in the other types as well, so nevermind. |
We still do want to include the overlap length in the MD5 sum for the prepared data cache. Isolating it inside the completion strat makes that a bit less straightforward. Still thinking of ways to do that cleanly. Edit: I think I have a solution; it does mean all existing dataset caches are invalidated though, but that was the case with my initial approach as well. |
2c80e9b
to
f40a0ac
Compare
This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).
f40a0ac
to
01528e7
Compare
thanks, looks good. Can you run |
Done. |
Is this merge-ready, or is there something else needed? |
This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).
This PR only adds support to the completion prompt strategy, as that's the only one I've used so far.