Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPT training is giving pretty unstalbe results with the learning rate 2e-5. #149

Open
shamanez opened this issue Apr 4, 2024 · 1 comment

Comments

@shamanez
Copy link

shamanez commented Apr 4, 2024

I am trying to conduct CPT with a mistral-instruct-v2. But every time, I notice an overshooting in the grad norm. I tried different datasets and managed to re-produce the same issue.

I am using 8 80GB GPUs, and my effective batch size is 1024.

My config:

# Model arguments
model_name_or_path: mistralai/Mistral-7B-Instruct-v0.2
model_revision: main
torch_dtype: bfloat16

# Data training arguments
dataset_mixer:
  arcee-ai/sec-data-full: 1.0
dataset_splits:
  - train
preprocessing_num_workers: 12
text_column: "text"

# SFT trainer config
bf16: true
do_eval: False
evaluation_strategy: "no"
gradient_accumulation_steps: 64
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
hub_model_id: arcee-ai/mistral-instruct-v2-sec
hub_strategy: every_save
learning_rate: 2.0e-04
log_level: info
logging_steps: 1  
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 4096
max_steps: -1
num_train_epochs: 1
output_dir: data/mistral-instruct-v2-sec-expanded
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 4
push_to_hub: true 
remove_unused_columns: true
report_to:
- wandb
save_strategy: 'steps'
save_steps: 50
save_total_limit: 2
seed: 42
warmup_ratio: 0.1
@shamanez
Copy link
Author

shamanez commented Apr 5, 2024

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant