About pretrain data size in sft-8-datasets. #3336

fengyh3 · 2023-06-07T07:56:59Z

fengyh3
Jun 7, 2023

Hi, i noticed that in the config of sft-8-datasets, 5% red_pajama are added in sft training.
So there are 3 question i was confused:

Will the pretrain data size be more larger and the instruction data size?
Will this situation affect the effect of SFT training?
How you guys pick the fraction of pretrain data?

andreaskoepf · 2023-06-07T19:55:34Z

andreaskoepf
Jun 7, 2023
Maintainer

We ran the longest pre-training for SFT-8 but the unfortunately more did not mean better in this case - eval results were pretty bad, see https://tju01.github.io/ilm-eval/

The idea to add red_pajama was to continue training basic language modelling. It was also used in the 2nd stage with the hope to reduce overfitting, but the effect was not clear and the end-result was "mediocre".
A new 2-stage training without red-pajama, stack-exchange & the prosocial datasets is currently running (now with LoRA).

Just for reference, the old SFT-8 stage-1 dataset config was (bad, don't use):

pretrain:
  num_train_epochs: 1
  weight_decay: 0.0
  use_custom_sampler: true
  sort_by_length: false
  datasets:
    - gpteacher_roleplay:
        val_split: 0.05
    - red_pajama:
        fraction: 0.25
        max_val_set: 1000
    - wizardlm_70k:
        val_split: 0.05
        max_val_set: 500
    - joke:
        val_split: 0.05
    - poem_instructions:
        val_split: 0.025
    - oa_stackexchange:
        val_split: 0.05
        fraction: 0.1
        max_val_set: 1000
    - tell_a_joke:
        val_split: 0.05
        max_val_set: 250
    - webgpt:
        val_split: 0.05
        max_val_set: 250
    - gpt4all:
        val_split: 0.01
        max_val_set: 1000
    - alpaca_gpt4:
        val_split: 0.025
        max_val_set: 250
    - code_alpaca:
        val_split: 0.05
        max_val_set: 250
    - vicuna:
        max_val_set: 250
    - oig_file:
        source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl
        max_count: 10000
        min_length: 250
        val_split: 0.05
        max_val_set: 250
    - minimath:
        val_split: 0.05
    - humaneval_mbpp_codegen_qa:
        val_split: 0.05
    - humaneval_mbpp_testgen_qa:
        val_split: 0.05
    - grade_school_math_instructions:
        val_split: 0.05
    - recipes:
        val_split: 0.05
    - cmu_wiki_qa:
        val_split: 0.05
    - oa_wiki_qa_bart_10000row:
        val_split: 0.05
        max_val_set: 250
    - prosocial_dialogue:
        fraction: 0.1
        max_val_set: 250
    - explain_prosocial:
        fraction: 0.075
        max_val_set: 250
    - soda:
        fraction: 0.25
        max_val_set: 1000
    - oa_leet10k:
        val_split: 0.05
        max_val_set: 250
    - dolly15k:
        val_split: 0.05
        max_val_set: 300

0 replies

fengyh3 · 2023-06-08T02:26:22Z

fengyh3
Jun 8, 2023
Author

Thanks for your reply. Moreover, is there a good fraction of pretrain data in the sft training stage? Now i used 15% pretrain data and 85% instruction tuning data in the sft training. Is there any suggesiton about the fraction between these two part of data?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About pretrain data size in sft-8-datasets. #3336

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

About pretrain data size in sft-8-datasets. #3336

fengyh3 Jun 7, 2023

Replies: 2 comments

andreaskoepf Jun 7, 2023 Maintainer

fengyh3 Jun 8, 2023 Author

fengyh3
Jun 7, 2023

andreaskoepf
Jun 7, 2023
Maintainer

fengyh3
Jun 8, 2023
Author