Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load ORPO dataset in a *.json file #1868

Open
6 of 8 tasks
SicariusSicariiStuff opened this issue Aug 26, 2024 · 3 comments
Open
6 of 8 tasks

Unable to load ORPO dataset in a *.json file #1868

SicariusSicariiStuff opened this issue Aug 26, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@SicariusSicariiStuff
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Local dataset to work the same as with loading a dataset from HF hub

Current behaviour

FileNotFoundError: Couldn't find a dataset script at

Steps to reproduce

If you have:

rl: orpo
orpo_alpha: 0.1
chat_template: chatml
datasets:
  - path: HF_username/Dataset_name
    type: chat_template.argilla
    chat_template: chatml

Replace it with the same file locally (parquet\json doesn't matter)
And you'll get :

FileNotFoundError: Couldn't find a dataset script at

Config yaml

base_model: SicariusSicariiStuff/2B-ad
output_dir: /home/sicarius/test/
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
save_safetensors: true

num_epochs: 2
saves_per_epoch: 1
saves_per_epoch: 1
save_total_limit: 2

learning_rate: 4e-6
lora_r: 16
lora_alpha: 32

sequence_len: 1024

lora_target_modules:

rl: orpo
orpo_alpha: 0.1
chat_template: chatml
datasets:
  - path: /home/sicarius/test/orpo1.json
    type: chat_template.argilla
    chat_template: chatml

remove_unused_columns: false
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: false

val_set_size: 0.0


adapter: qlora
lora_dropout: 0
lora_target_linear: true
load_in_8bit: false
load_in_4bit: true
strict: false

gradient_accumulation_steps: 1
micro_batch_size: 1

#optimizer: adamw_torch
optimizer: adamw_bnb_8bit
lr_scheduler: cosine


train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 0
#warmup_ratio: 0.1
evals_per_epoch:
eval_table_size:
eval_max_new_tokens: 128

debug:
#deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
lora_modules_to_save: [embed_tokens, lm_head]
special_tokens:
  eos_token: "<|im_end|>"
  pad_token: "<|end_of_text|>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

Using a similar processing logic as in a loaded dataset from the hub

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

latest release

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@SicariusSicariiStuff SicariusSicariiStuff added the bug Something isn't working label Aug 26, 2024
@axolotl-ai-cloud axolotl-ai-cloud deleted a comment Aug 26, 2024
@axolotl-ai-cloud axolotl-ai-cloud deleted a comment Aug 26, 2024
@RishabhMaheshwary
Copy link

I am also facing the same issue with the following trace:

Traceback (most recent call last):
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/rishabh/axolotl/src/axolotl/cli/preprocess.py", line 103, in <module>
    fire.Fire(do_cli)
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/rishabh/axolotl/src/axolotl/cli/preprocess.py", line 74, in do_cli
    load_rl_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
  File "/mnt/rishabh/axolotl/src/axolotl/cli/__init__.py", line 445, in load_rl_datasets
    train_dataset, eval_dataset = load_prepare_dpo_datasets(cfg)
  File "/mnt/rishabh/axolotl/src/axolotl/utils/data/rl.py", line 131, in load_prepare_dpo_datasets
    train_dataset = load_split(cfg.datasets, cfg)
  File "/mnt/rishabh/axolotl/src/axolotl/utils/data/rl.py", line 110, in load_split
    split_datasets[i] = map_dataset(
  File "/mnt/rishabh/axolotl/src/axolotl/utils/data/rl.py", line 67, in map_dataset
    data_set = data_set.map(
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3156, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "/mnt/rishabh/anaconda3/envs/axolotl1/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/mnt/rishabh/axolotl/src/axolotl/prompt_strategies/orpo/chat_template.py", line 272, in transform_fn
    [msg.model_dump() for msg in dataset_parser.get_prompt(sample).messages],
  File "/mnt/rishabh/axolotl/src/axolotl/prompt_strategies/orpo/chat_template.py", line 111, in get_prompt
    content=prompt["chosen"][i * 2 + 1]["content"],

@github-staff github-staff deleted a comment from masooddahmedd Sep 10, 2024
@SicariusSicariiStuff
Copy link
Author

any update?

@bursteratom
Copy link
Collaborator

Hi @SicariusSicariiStuff thank you for following up! Any chance you can provide us the dataset you are using so we can do deeper testing on it?

@bursteratom bursteratom self-assigned this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants
@RishabhMaheshwary @bursteratom @SicariusSicariiStuff and others