Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

Open
traincheck-team opened this issue Nov 20, 2024 · 4 comments

Comments

@traincheck-team
Copy link

traincheck-team commented Nov 20, 2024

Describe the bug

related to #3718

An KeyError is thrown inside deepspeed.initialize at runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping, due to inconsistent usage of model parameters and parameters managed by the optimizer.

Full Traceback (Click to Show)
```log
Traceback (most recent call last):
  File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 53, in 
    expose_bug()
  File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 50, in expose_bug
    model_engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 553, in __init__
    self._param_slice_mappings = self._create_param_mapping()
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping
    lp_name = self.param_names[lp]
KeyError: tensor([-0.1179,  0.0190, -0.2361, -0.0330, -0.1097,  0.0685, -0.2788, -0.0424,
         0.1805, -0.1327,  0.2295, -0.1572,  0.2908, -0.2656, -0.2262,  0.2296,
        -0.0720, -0.1561, -0.0781, -0.2091,  0.1213,  0.2913,  0.0560, -0.1072,
         0.0990, -0.2922, -0.1741, -0.0852,  0.1493,  0.0510, -0.3086,  0.0510,
         0.0163,  0.1622, -0.1566,  0.2947, -0.1843,  0.2590,  0.1528, -0.0735,
        -0.1443, -0.1165, -0.3152,  0.1250, -0.0102,  0.1953,  0.2947,  0.2566,
         0.2942,  0.1985, -0.2688, -0.1689,  0.2786, -0.0039,  0.0989,  0.2617,
         0.0778, -0.1381,  0.0307, -0.1530, -0.0360,  0.1461, -0.0746,  0.1142,
        -0.2235, -0.2544, -0.1772, -0.1136, -0.0287,  0.3057,  0.1761,  0.0782,
         0.1699,  0.0997, -0.1385, -0.0923, -0.1219, -0.2313, -0.0925,  0.1703,
         0.0032,  0.3071, -0.0467,  0.0065, -0.2251,  0.2949, -0.2507,  0.2847,
        -0.0638, -0.1945,  0.1630,  0.2463, -0.0249, -0.0586, -0.1923, -0.0291,
         0.0769,  0.2839,  0.0655, -0.3152,  0.2118, -0.0918,  0.0764,  0.2585,
        -0.0240,  0.1981, -0.1708, -0.2991,  0.1741,  0.0652,  0.2600, -0.2397,
         0.0431,  0.0839,  0.1979,  0.0051, -0.2389,  0.0657,  0.1400,  0.3115,
         0.2091, -0.2200,  0.2610, -0.0994, -0.2996, -0.2710, -0.0466,  0.0237,
        -0.0053,  0.2881,  0.0077,  0.1194,  0.0026,  0.1493,  0.2361,  0.2915,
         0.1206,  0.2198, -0.2668, -0.2032, -0.2406,  0.2112,  0.1519,  0.1636,
        -0.1826, -0.2490,  0.2637,  0.2380, -0.2703, -0.2249,  0.0025,  0.1674,
        -0.2751,  0.0442, -0.1255,  0.0972, -0.2766,  0.0444,  0.0058,  0.0765,
         0.2798,  0.0579,  0.2499, -0.2095,  0.0173,  0.0000], device='cuda:0',
       requires_grad=True)
[2024-11-20 13:00:07,916] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3765479
```

Suspected Root Cause

deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

This issue can be triggered in any case if in the arguments to deepspeed.initialize, parameters in optimizer.param_groups is not a subset of model.parameters.

At the fault location, the code is trying to access parameter's names stored in self.param_names using tensors in self.bit16_groups.

location-of-the-bug

self.bit16_groups is populated from optimizer.param_groups, while
bit16-params-is-from-optimizer

self.param_names is populated from the model itself.

param-names-source-is-the-model

Thus, if the optimizer's parameters are not exactly a subset of the model, a KeyError will be thrown.
The case where optimizer's parameters are not exactly a subset of the model is quite common, due to optimization techniques like Parameter Grouping and ZeRO Optimization.

To Reproduce

We prepared a rather simple reproduction script to reproduce this error. In this script, deepspeed.initialize is accidently called twice. After the first deepspeed.initialize, optimizer.param_groups was consolidated into one single parameter, and causes key error in the second deepspeed.initialize.

  1. Install deepspeed 0.15.4

  2. run bug.py using deepspeed --num_gpus=1 bug.py

# bug.py
import torch
import deepspeed
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 5)
    def forward(self, x):
        x = self.fc1(x)
        return self.fc2(x)

# Main function to expose the bug
def expose_bug():
    # Initialize model
    model = SimpleModel()
    # Initialize DeepSpeed configurations for fp16
    ds_config_fp16 = {
        "train_micro_batch_size_per_gpu": 1,
        "fp16": {"enabled": True,},
        "zero_optimization": {"stage": 2}
    }

    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
    # optimizer have 4 params now
    print(optimizer.param_groups)
    # Initialize DeepSpeed engine
    model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
    # optimizer have 1 params now
    print(optimizer.param_groups)
    # EXCEPTION!!!
    model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

if __name__ == "__main__":
    expose_bug()
  1. Notice that the second deepspeed.initialize throws the KeyError exception.

  2. Also notice that the first print of optimizer.param_groups shows 4 params, while the second print shows only one param (the content of one param is the merge of the 4 param).

prior to deepspeed.initialize
params-not-merged

After deepspeed.initialize
merged-params

Since in the second deepspeed.initialize, the merged param actually does not exist in the model, an KeyError will be thrown.

Expected behavior / Suggested Fix

We expect two behaviors here from DeepSpeed

  1. Forbid deepspeed.initialize on models / optimizers that have already been used in another deepspeed.initialize.
  2. Check for "optimizer.param_group should be a subset of model.parameters()" explicitly and throw a more user-friendly exception or warning.

ds_report output

Click to Show
collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/xxx/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/xxx/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 31.24 GB

I will be more than happy to contribute to the two suggested fixes, let me know what you think!

@tjruwase
Copy link
Contributor

tjruwase commented Dec 4, 2024

@traincheck-team, thanks for the detailed report. Are you able to provide a fix?

@traincheck-team
Copy link
Author

traincheck-team commented Dec 4, 2024

@tjruwase
Sure! We can work on that. Do you prefer to implement 1 or 2 or both in our suggested fix?
The accelerate library is doing a similar thing to 2 at huggingface/accelerate#3213

Personally, I think both should be done.

1 should be implemented to raise an exception when invoking deepspeed.initialize on models / optimizers that have already been used in another deepspeed.initialize. This also closes issues caused by user-side repeated initialization.

2 can be implemented to raise a friendly warning if optimizer.param_group is not a subset of model.parameters().

On the downside, 1 limits API flexibility while 2 adds to already crowded log messages. Let me know what you think!

@tjruwase
Copy link
Contributor

tjruwase commented Dec 4, 2024

@traincheck-team, thanks for agreeing to help and sharing the trade-offs. I prefer option 1 because nested deepspeed.initialize is not something that we want supported at this point in time. I don't see a benefit for such feature, and it could result in strange behaviors. Also, it would be great if you could convert your repro into a unit test.

@traincheck-team
Copy link
Author

Sure. Will ship the PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants