feature: add overlap_len option to completion strategy #668

kallewoof · 2023-10-03T06:42:57Z

This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).

This PR only adds support to the completion prompt strategy, as that's the only one I've used so far.

winglian · 2023-10-03T19:08:14Z

great idea! I think we could implement this without needing to modify all the other prompt strategies. Right now completion is pretty isolated to https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/completion.py

You'll notice on line 81:

def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):

there is an optional ds_cfg var.
so if your yml was

datasets:
  - path: ...
     type: completion
     overlap_len: 128

then you can reference that from ds_cfg.overlap_len

kallewoof · 2023-10-04T00:30:03Z

Sounds good! ~~Are you sure this is not a desired feature for the other types? It would seem that even instruction sets with responses larger than the context size could benefit as well.,~~

Edit: looking at your suggestion again, I think that actually can be used in the other types as well, so nevermind.

kallewoof · 2023-10-04T00:39:03Z

We still do want to include the overlap length in the MD5 sum for the prepared data cache. Isolating it inside the completion strat makes that a bit less straightforward. Still thinking of ways to do that cleanly.

Edit: I think I have a solution; it does mean all existing dataset caches are invalidated though, but that was the case with my initial approach as well.

This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).

winglian · 2023-10-04T13:32:36Z

thanks, looks good. Can you run pre-commit run --all-files to lint the changes? thanks!

kallewoof · 2023-10-04T13:41:21Z

Done.

kallewoof · 2023-10-06T13:55:56Z

Is this merge-ready, or is there something else needed?

kallewoof changed the title ~~feature: add overlap_len to prompt strategies~~ feature: add overlap_len option to prompt strategies Oct 3, 2023

kallewoof force-pushed the 202310-overlap-len branch from 933545c to 2c80e9b Compare October 3, 2023 07:18

kallewoof force-pushed the 202310-overlap-len branch from 2c80e9b to f40a0ac Compare October 4, 2023 01:10

kallewoof changed the title ~~feature: add overlap_len option to prompt strategies~~ feature: add overlap_len option to completion strategy Oct 4, 2023

feature: add overlap_len option to completion strategy

01528e7

This is useful with smaller datasets, where the default to split the data into context size length chunks (thus only showing each part of the data a single time).

kallewoof force-pushed the 202310-overlap-len branch from f40a0ac to 01528e7 Compare October 4, 2023 01:18

pre-commit hook tweaks

5e071b2

kallewoof closed this Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: add overlap_len option to completion strategy #668

feature: add overlap_len option to completion strategy #668

kallewoof commented Oct 3, 2023

winglian commented Oct 3, 2023 •

edited

Loading

kallewoof commented Oct 4, 2023 •

edited

Loading

kallewoof commented Oct 4, 2023 •

edited

Loading

winglian commented Oct 4, 2023

kallewoof commented Oct 4, 2023

kallewoof commented Oct 6, 2023

feature: add overlap_len option to completion strategy #668

feature: add overlap_len option to completion strategy #668

Conversation

kallewoof commented Oct 3, 2023

winglian commented Oct 3, 2023 • edited Loading

kallewoof commented Oct 4, 2023 • edited Loading

kallewoof commented Oct 4, 2023 • edited Loading

winglian commented Oct 4, 2023

kallewoof commented Oct 4, 2023

kallewoof commented Oct 6, 2023

winglian commented Oct 3, 2023 •

edited

Loading

kallewoof commented Oct 4, 2023 •

edited

Loading

kallewoof commented Oct 4, 2023 •

edited

Loading