Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled #4505

Closed
bcol23 opened this issue Oct 12, 2023 · 3 comments
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat

Comments

@bcol23
Copy link

bcol23 commented Oct 12, 2023

Describe the bug
During Step 2 - Reward Model of DeepSpeed-Chat, an AssertionError occurs in the backward process for ZeRO stage 3 if gradient_checkpointing is enabled, while it works if gradient_checkpointing is disabled

Log output

Traceback (most recent call last):
  File "run_bloom.py", line 49, in <module>
    main()
  File "run_bloom.py", line 45, in main
    trainer.train()
  File "trainer.py", line 177, in train
    model.backward(loss)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1929, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2094, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/miniconda3/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1074, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1369, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1109, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1146, in __reduce_and_partition_ipg_grads
    assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError

I've added some print before the assertion in stage3.py

print(set(p.ds_id for p in self.params_in_ipg_bucket))
>>> {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292}

print([p.ds_id for p in self.params_in_ipg_bucket])
>>> [290, 289, 288, 287, 285, 286, 284, 283, 282, 281, 279, 280, 278, 277, 276, 275, 273, 274, 272, 271, 270, 269, 267, 268, 266, 265, 264, 263, 261, 262, 260, 259, 258, 257, 255, 256, 254, 253, 252, 251, 249, 250, 248, 247, 246, 245, 243, 244, 242, 241, 240, 239, 237, 238, 236, 235, 234, 233, 231, 232, 230, 229, 228, 227, 225, 226, 224, 223, 222, 221, 219, 220, 218, 217, 216, 215, 213, 214, 212, 211, 210, 209, 207, 208, 206, 205, 204, 203, 201, 202, 200, 199, 198, 197, 195, 196, 194, 193, 192, 191, 189, 190, 188, 187, 186, 185, 183, 184, 182, 181, 180, 179, 177, 178, 176, 175, 174, 173, 171, 172, 170, 169, 168, 167, 165, 166, 164, 163, 162, 161, 159, 160, 158, 157, 156, 155, 153, 154, 152, 151, 150, 149, 147, 148, 146, 145, 144, 143, 141, 142, 140, 139, 138, 137, 135, 136, 134, 133, 132, 131, 129, 130, 128, 127, 126, 125, 123, 124, 122, 121, 120, 119, 117, 118, 116, 115, 114, 113, 111, 112, 110, 109, 108, 107, 105, 106, 104, 103, 102, 101, 99, 100, 98, 97, 96, 95, 93, 94, 92, 91, 90, 89, 87, 88, 86, 85, 84, 83, 81, 82, 80, 79, 78, 77, 75, 76, 74, 73, 72, 71, 69, 70, 68, 67, 66, 65, 63, 64, 62, 61, 60, 59, 57, 58, 56, 55, 54, 53, 51, 52, 50, 49, 48, 47, 45, 46, 44, 43, 42, 41, 39, 40, 38, 37, 36, 35, 33, 34, 32, 31, 30, 29, 27, 28, 26, 25, 24, 23, 21, 22, 20, 19, 18, 17, 15, 16, 14, 13, 12, 11, 9, 10, 8, 7, 6, 5, 3, 4, 291, 292, 290, 289, 288, 287, 285, 286, 284, 283, 282, 281, 279, 280, 278, 277, 276, 275, 273, 274, 272, 271, 270, 269, 267, 268, 266, 265, 264, 263, 261, 262, 260, 259, 258, 257, 255, 256, 254, 253, 252, 251, 249, 250, 248, 247, 246, 245, 243, 244, 242, 241, 240, 239, 237, 238, 236, 235, 234, 233, 231, 232, 230, 229, 228, 227, 225, 226, 224, 223, 222, 221, 219, 220, 218, 217, 216, 215, 213, 214, 212, 211, 210, 209, 207, 208, 206, 205, 204, 203, 201, 202, 200, 199, 198, 197, 195, 196, 194, 193, 192, 191, 189, 190, 188, 187, 186, 185, 183, 184, 182, 181, 180, 179, 177, 178, 176, 175, 174, 173, 171, 172, 170, 169, 168, 167, 165, 166, 164, 163, 162, 161, 159, 160, 158, 157, 156, 155, 153, 154, 152, 151, 150, 149, 147, 148, 146, 145, 144, 143, 141, 142, 140, 139, 138, 137, 135, 136, 134, 133, 132, 131, 129, 130, 128, 127, 126, 125, 123, 124, 122, 121, 120, 119, 117, 118, 116, 115, 114, 113, 111, 112, 110, 109, 108, 107, 105, 106, 104]

print(len(self.params_in_ipg_bucket))
>>> 477

assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.0.1
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 125.87 GB
@bcol23 bcol23 added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Oct 12, 2023
@bcol23 bcol23 changed the title [BUG] params_in_ipg_bucket AssertionError in backward for stage 3 [BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled Oct 17, 2023
@bcol23
Copy link
Author

bcol23 commented Oct 17, 2023

Same issue of trl huggingface/trl#835

@bcol23
Copy link
Author

bcol23 commented Oct 18, 2023

Fixed by adding {"reduce_bucket_size": 1e6} into the zero_opt_dict

@bcol23 bcol23 closed this as completed Oct 18, 2023
@twotwoiscute
Copy link

Fixed by adding {"reduce_bucket_size": 1e6} into the zero_opt_dict

How did you find out that solving this issue involved adjusting the reduce_bucket_size ? Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat
Projects
None yet
Development

No branches or pull requests

2 participants