fix warmup lr when using deepspeed #2125

Eikor · 2023-11-06T10:00:18Z

What does this PR do?

Fixes #2124
Current DeepSpeedEngineWrapper.backwards() calls DeepSpeedEngine._take_model_step(), which performs optimizer step first, followed by lr_scheduler step.

accelerate/src/accelerate/utils/deepspeed.py

Line 176 in 4f10031

self.engine.step()

However, the DS optimizer will be initialized with the maximum learning rate in the prepare function, the optimizer will update model parameters with the maximum learning rate at the first step, which will cause an unexpected behavior.

accelerate/src/accelerate/accelerator.py

Line 1665 in 4f10031

engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)

This PR solves this problem by initializing the optimizer with warmup_min_lr if the warmup lr scheduler is activated.

@pacman100

HuggingFaceDocBuilderDev · 2023-11-06T10:39:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

muellerzr · 2023-11-06T18:35:16Z

@Eikor can you please do pip install -e .[quality]; make style; make quality? This should fix the failing test, thanks!

pacman100 · 2023-11-08T09:06:17Z

I don't think this should be the case. Let me deep dive into this.

github-actions · 2023-12-06T15:06:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mjbommar · 2024-01-07T19:23:01Z

Was there a reason this was never merged, @pacman100 and @muellerzr?

Pretty sure it's still happening on 4.36.2/0.25.0:

10.19.136.148: {'loss': 10.375, 'learning_rate': 1e-07, 'epoch': 0.0}                                                                                                                        
10.19.140.225: {'loss': 10.375, 'learning_rate': 1e-07, 'epoch': 0.0}                                                                                                                        
10.19.140.13: {'loss': 10.375, 'learning_rate': 1e-07, 'epoch': 0.0}                                                                                                                         
10.19.139.205: {'loss': 10.375, 'learning_rate': 1e-07, 'epoch': 0.0}                         
10.19.138.33: {'loss': 10.375, 'learning_rate': 1e-07, 'epoch': 0.0}                                                                                                                         
10.19.136.148: {'loss': 10.375, 'learning_rate': 3.772122370810605e-05, 'epoch': 0.0}                                                                                                        
10.19.140.225: {'loss': 10.375, 'learning_rate': 3.772122370810605e-05, 'epoch': 0.0}                                                                                                        
10.19.139.205: {'loss': 10.375, 'learning_rate': 3.772122370810605e-05, 'epoch': 0.0}                                                                                                        
10.19.140.13: {'loss': 10.375, 'learning_rate': 3.772122370810605e-05, 'epoch': 0.0}                                                                                                         
10.19.138.33: {'loss': 10.375, 'learning_rate': 3.772122370810605e-05, 'epoch': 0.0}                                                                                                         
10.19.140.225: {'loss': 10.375, 'learning_rate': 5.972822880858981e-05, 'epoch': 0.0}                                                                                                        
10.19.140.13: {'loss': 10.375, 'learning_rate': 5.972822880858981e-05, 'epoch': 0.0}                                                                                                         
10.19.139.205: {'loss': 10.375, 'learning_rate': 5.972822880858981e-05, 'epoch': 0.0}                                                                                                        
10.19.136.148: {'loss': 10.375, 'learning_rate': 5.972822880858981e-05, 'epoch': 0.0}                                                                                                        
10.19.138.33: {'loss': 10.375, 'learning_rate': 5.972822880858981e-05, 'epoch': 0.0}                                                                                                         
10.19.136.148: {'loss': 10.0375, 'learning_rate': 7.53424474162121e-05, 'epoch': 0.0}                                                                                                        
10.19.140.225: {'loss': 10.0375, 'learning_rate': 7.53424474162121e-05, 'epoch': 0.0}                                                                                                        
10.19.140.13: {'loss': 10.0375, 'learning_rate': 7.53424474162121e-05, 'epoch': 0.0}                                                                                                         
10.19.139.205: {'loss': 10.0375, 'learning_rate': 7.53424474162121e-05, 'epoch': 0.0}                                                                                                        
10.19.138.33: {'loss': 10.0375, 'learning_rate': 7.53424474162121e-05, 'epoch': 0.0}                                                                                                         
10.19.140.225: {'loss': 9.9891, 'learning_rate': 8.745377629189394e-05, 'epoch': 0.0}                                                                                                        
10.19.136.148: {'loss': 9.9891, 'learning_rate': 8.745377629189394e-05, 'epoch': 0.0}                                                                                                        
10.19.140.13: {'loss': 9.9891, 'learning_rate': 8.745377629189394e-05, 'epoch': 0.0}                                                                                                         
10.19.139.205: {'loss': 9.9891, 'learning_rate': 8.745377629189394e-05, 'epoch': 0.0}                                                                                                        
10.19.138.33: {'loss': 9.9891, 'learning_rate': 8.745377629189394e-05, 'epoch': 0.0}                                                                                                         
10.19.139.205: {'loss': 9.875, 'learning_rate': 9.734945251669586e-05, 'epoch': 0.0}                                                                                                         
10.19.136.148: {'loss': 9.875, 'learning_rate': 9.734945251669586e-05, 'epoch': 0.0}                                                                                                         
10.19.140.225: {'loss': 9.875, 'learning_rate': 9.734945251669586e-05, 'epoch': 0.0}                                                                                                         
10.19.140.13: {'loss': 9.875, 'learning_rate': 9.734945251669586e-05, 'epoch': 0.0}                                                                                                          
10.19.138.33: {'loss': 9.875, 'learning_rate': 9.734945251669586e-05, 'epoch': 0.0}                                                                                                          
10.19.136.148: {'loss': 9.8125, 'learning_rate': 0.00010571612755078175, 'epoch': 0.0}                                                                                                       
10.19.140.225: {'loss': 9.8125, 'learning_rate': 0.00010571612755078175, 'epoch': 0.0}                                                                                                       
10.19.140.13: {'loss': 9.8125, 'learning_rate': 0.00010571612755078175, 'epoch': 0.0}                                                                                                        
10.19.139.205: {'loss': 9.8125, 'learning_rate': 0.00010571612755078175, 'epoch': 0.0}                                                                                                       
10.19.138.33: {'loss': 9.8125, 'learning_rate': 0.00010571612755078175, 'epoch': 0.0}                                                                                                        
10.19.136.148: {'loss': 9.6984, 'learning_rate': 0.00011296367112431813, 'epoch': 0.0}                                                                                                       
10.19.140.225: {'loss': 9.6984, 'learning_rate': 0.00011296367112431813, 'epoch': 0.0}                                                                                                       
10.19.140.13: {'loss': 9.6984, 'learning_rate': 0.00011296367112431813, 'epoch': 0.0}                                                                                                        
10.19.139.205: {'loss': 9.6984, 'learning_rate': 0.00011296367112431813, 'epoch': 0.0}                                                                                                       
10.19.138.33: {'loss': 9.6984, 'learning_rate': 0.00011296367112431813, 'epoch': 0.0}                                                                                                        
10.19.140.225: {'loss': 9.5625, 'learning_rate': 0.00011935645761717962, 'epoch': 0.0}                                                                                                       
10.19.139.205: {'loss': 9.5625, 'learning_rate': 0.00011935645761717962, 'epoch': 0.0}                                                                                                       
10.19.140.13: {'loss': 9.5625, 'learning_rate': 0.00011935645761717962, 'epoch': 0.0}
10.19.136.148: {'loss': 9.5625, 'learning_rate': 0.00011935645761717962, 'epoch': 0.0}
10.19.138.33: {'loss': 9.5625, 'learning_rate': 0.00011935645761717962, 'epoch': 0.0}

fix warmup lr

be0d829

muellerzr mentioned this pull request Nov 6, 2023

The warmup lr schedule are not working as expected when using Deepspeed. #2124

Closed

github-actions bot closed this Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix warmup lr when using deepspeed #2125

fix warmup lr when using deepspeed #2125

Eikor commented Nov 6, 2023

HuggingFaceDocBuilderDev commented Nov 6, 2023

muellerzr commented Nov 6, 2023

pacman100 commented Nov 8, 2023

github-actions bot commented Dec 6, 2023

mjbommar commented Jan 7, 2024 •

edited

Loading

fix warmup lr when using deepspeed #2125

fix warmup lr when using deepspeed #2125

Conversation

Eikor commented Nov 6, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 6, 2023

muellerzr commented Nov 6, 2023

pacman100 commented Nov 8, 2023

github-actions bot commented Dec 6, 2023

mjbommar commented Jan 7, 2024 • edited Loading

mjbommar commented Jan 7, 2024 •

edited

Loading