How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint? #405

tiggerwu · 2024-06-27T07:07:11Z

Hello, I have trained a checkpoint without '--no-pipeline-parallel' args, which constructs the model named GPTModelPipe() in pretrain_gpt.py. Now, I modify the checkpoint to universal checkpoint and manually change the layer names. Then, I want to resume training with '--no-pipeline-parallel' which would construct the model named GPTModel(). Finally, I find the resumed training loss changes a lot (from 1.7 to 3.0).
Therefore, I want to ask whether there are some methods to solve this problem.
Also, the difference between GPTModel() and GPTModelPipe().
THX.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint? #405

How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint? #405

tiggerwu commented Jun 27, 2024

How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint? #405

How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint? #405

Comments

tiggerwu commented Jun 27, 2024