Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why epoch in log is different from progress #58

Open
jimmy-walker opened this issue Dec 20, 2023 · 0 comments
Open

Why epoch in log is different from progress #58

jimmy-walker opened this issue Dec 20, 2023 · 0 comments

Comments

@jimmy-walker
Copy link

Thanks for your work.
I wanna ask a question why epoch in log is different from progress.

I have used the command to run the lora tuning with 8 gpus.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python chatglm2_lora_tuning.py \
    --tokenized_dataset resultcombine \
    --lora_rank 8 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --max_steps 50753 \
    --save_steps 10000 \
    --save_total_limit 2 \
    --learning_rate 1e-4 \
    --fp16 \
    --remove_unused_columns false \
    --logging_steps 1000 \
    --output_dir weights/resultcombine 

My dataset's size is 162410. Now I have batch_size = per_device_train_batch_size * devices = 4*8 = 32. One iteration over dataset is 162410/32=5075.3125 steps.
So I set the max_steps as 50753 to make 10 epochs.
But I found that although my progress is nearly finished as followed.(50711/50753)
But the epoch shows still 1.26.

{'loss': 0.0807, 'learning_rate': 1.1368786081610939e-05, 'epoch': 1.13}
{'loss': 0.079, 'learning_rate': 9.398459204382007e-06, 'epoch': 1.15}
{'loss': 0.08, 'learning_rate': 7.428132327153076e-06, 'epoch': 1.18}
{'loss': 0.0806, 'learning_rate': 5.459775776801371e-06, 'epoch': 1.21}
{'loss': 0.0805, 'learning_rate': 3.4894488995724393e-06, 'epoch': 1.23}
{'loss': 0.079, 'learning_rate': 1.5210923492207357e-06, 'epoch': 1.26}
100%|█████████████████████████████████████████████████▉| 50711/50753 [36:56:34<01:53,  2.69s/it]

So I wanna ask was it normal?
Does it mean that epoch only show one card info or there was something wrong?
Thanks for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant