Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint? #405

Open
tiggerwu opened this issue Jun 27, 2024 · 0 comments

Comments

@tiggerwu
Copy link

Hello, I have trained a checkpoint without '--no-pipeline-parallel' args, which constructs the model named GPTModelPipe() in pretrain_gpt.py. Now, I modify the checkpoint to universal checkpoint and manually change the layer names. Then, I want to resume training with '--no-pipeline-parallel' which would construct the model named GPTModel(). Finally, I find the resumed training loss changes a lot (from 1.7 to 3.0).
Therefore, I want to ask whether there are some methods to solve this problem.
Also, the difference between GPTModel() and GPTModelPipe().
THX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant