Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add overlap_len option to completion strategy #668

Closed
wants to merge 2 commits into from

Conversation

kallewoof
Copy link
Contributor

This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).

This PR only adds support to the completion prompt strategy, as that's the only one I've used so far.

@kallewoof kallewoof changed the title feature: add overlap_len to prompt strategies feature: add overlap_len option to prompt strategies Oct 3, 2023
@winglian
Copy link
Collaborator

winglian commented Oct 3, 2023

great idea! I think we could implement this without needing to modify all the other prompt strategies. Right now completion is pretty isolated to https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/completion.py

You'll notice on line 81:

def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):

there is an optional ds_cfg var.
so if your yml was

datasets:
  - path: ...
     type: completion
     overlap_len: 128

then you can reference that from ds_cfg.overlap_len

@kallewoof
Copy link
Contributor Author

kallewoof commented Oct 4, 2023

Sounds good! Are you sure this is not a desired feature for the other types? It would seem that even instruction sets with responses larger than the context size could benefit as well.,

Edit: looking at your suggestion again, I think that actually can be used in the other types as well, so nevermind.

@kallewoof
Copy link
Contributor Author

kallewoof commented Oct 4, 2023

We still do want to include the overlap length in the MD5 sum for the prepared data cache. Isolating it inside the completion strat makes that a bit less straightforward. Still thinking of ways to do that cleanly.

Edit: I think I have a solution; it does mean all existing dataset caches are invalidated though, but that was the case with my initial approach as well.

@kallewoof kallewoof changed the title feature: add overlap_len option to prompt strategies feature: add overlap_len option to completion strategy Oct 4, 2023
This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).
@winglian
Copy link
Collaborator

winglian commented Oct 4, 2023

thanks, looks good. Can you run pre-commit run --all-files to lint the changes? thanks!

@kallewoof
Copy link
Contributor Author

Done.

@kallewoof
Copy link
Contributor Author

Is this merge-ready, or is there something else needed?

@kallewoof kallewoof closed this Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants