[BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled #4505

bcol23 · 2023-10-12T09:13:03Z

Describe the bug
During Step 2 - Reward Model of DeepSpeed-Chat, an AssertionError occurs in the backward process for ZeRO stage 3 if gradient_checkpointing is enabled, while it works if gradient_checkpointing is disabled

Log output

Traceback (most recent call last):
  File "run_bloom.py", line 49, in <module>
    main()
  File "run_bloom.py", line 45, in main
    trainer.train()
  File "trainer.py", line 177, in train
    model.backward(loss)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1929, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2094, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/miniconda3/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1074, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1369, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1109, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1146, in __reduce_and_partition_ipg_grads
    assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError

I've added some print before the assertion in stage3.py

print(set(p.ds_id for p in self.params_in_ipg_bucket))
>>> {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292}

print([p.ds_id for p in self.params_in_ipg_bucket])
>>> [290, 289, 288, 287, 285, 286, 284, 283, 282, 281, 279, 280, 278, 277, 276, 275, 273, 274, 272, 271, 270, 269, 267, 268, 266, 265, 264, 263, 261, 262, 260, 259, 258, 257, 255, 256, 254, 253, 252, 251, 249, 250, 248, 247, 246, 245, 243, 244, 242, 241, 240, 239, 237, 238, 236, 235, 234, 233, 231, 232, 230, 229, 228, 227, 225, 226, 224, 223, 222, 221, 219, 220, 218, 217, 216, 215, 213, 214, 212, 211, 210, 209, 207, 208, 206, 205, 204, 203, 201, 202, 200, 199, 198, 197, 195, 196, 194, 193, 192, 191, 189, 190, 188, 187, 186, 185, 183, 184, 182, 181, 180, 179, 177, 178, 176, 175, 174, 173, 171, 172, 170, 169, 168, 167, 165, 166, 164, 163, 162, 161, 159, 160, 158, 157, 156, 155, 153, 154, 152, 151, 150, 149, 147, 148, 146, 145, 144, 143, 141, 142, 140, 139, 138, 137, 135, 136, 134, 133, 132, 131, 129, 130, 128, 127, 126, 125, 123, 124, 122, 121, 120, 119, 117, 118, 116, 115, 114, 113, 111, 112, 110, 109, 108, 107, 105, 106, 104, 103, 102, 101, 99, 100, 98, 97, 96, 95, 93, 94, 92, 91, 90, 89, 87, 88, 86, 85, 84, 83, 81, 82, 80, 79, 78, 77, 75, 76, 74, 73, 72, 71, 69, 70, 68, 67, 66, 65, 63, 64, 62, 61, 60, 59, 57, 58, 56, 55, 54, 53, 51, 52, 50, 49, 48, 47, 45, 46, 44, 43, 42, 41, 39, 40, 38, 37, 36, 35, 33, 34, 32, 31, 30, 29, 27, 28, 26, 25, 24, 23, 21, 22, 20, 19, 18, 17, 15, 16, 14, 13, 12, 11, 9, 10, 8, 7, 6, 5, 3, 4, 291, 292, 290, 289, 288, 287, 285, 286, 284, 283, 282, 281, 279, 280, 278, 277, 276, 275, 273, 274, 272, 271, 270, 269, 267, 268, 266, 265, 264, 263, 261, 262, 260, 259, 258, 257, 255, 256, 254, 253, 252, 251, 249, 250, 248, 247, 246, 245, 243, 244, 242, 241, 240, 239, 237, 238, 236, 235, 234, 233, 231, 232, 230, 229, 228, 227, 225, 226, 224, 223, 222, 221, 219, 220, 218, 217, 216, 215, 213, 214, 212, 211, 210, 209, 207, 208, 206, 205, 204, 203, 201, 202, 200, 199, 198, 197, 195, 196, 194, 193, 192, 191, 189, 190, 188, 187, 186, 185, 183, 184, 182, 181, 180, 179, 177, 178, 176, 175, 174, 173, 171, 172, 170, 169, 168, 167, 165, 166, 164, 163, 162, 161, 159, 160, 158, 157, 156, 155, 153, 154, 152, 151, 150, 149, 147, 148, 146, 145, 144, 143, 141, 142, 140, 139, 138, 137, 135, 136, 134, 133, 132, 131, 129, 130, 128, 127, 126, 125, 123, 124, 122, 121, 120, 119, 117, 118, 116, 115, 114, 113, 111, 112, 110, 109, 108, 107, 105, 106, 104]

print(len(self.params_in_ipg_bucket))
>>> 477

assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.0.1
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 125.87 GB

The text was updated successfully, but these errors were encountered:

bcol23 · 2023-10-17T08:50:37Z

Same issue of trl huggingface/trl#835

bcol23 · 2023-10-18T01:45:24Z

Fixed by adding {"reduce_bucket_size": 1e6} into the zero_opt_dict

twotwoiscute · 2024-06-21T06:32:56Z

Fixed by adding {"reduce_bucket_size": 1e6} into the zero_opt_dict

How did you find out that solving this issue involved adjusting the reduce_bucket_size ? Thanks !

bcol23 added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Oct 12, 2023

bcol23 changed the title ~~[BUG] params_in_ipg_bucket AssertionError in backward for stage 3~~ [BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled Oct 17, 2023

bcol23 closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled #4505

[BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled #4505

bcol23 commented Oct 12, 2023 •

edited

Loading

bcol23 commented Oct 17, 2023

bcol23 commented Oct 18, 2023

twotwoiscute commented Jun 21, 2024

[BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled #4505

[BUG] params_in_ipg_bucket AssertionError in backward if gradient_checkpointing is enabled #4505

Comments

bcol23 commented Oct 12, 2023 • edited Loading

bcol23 commented Oct 17, 2023

bcol23 commented Oct 18, 2023

twotwoiscute commented Jun 21, 2024

bcol23 commented Oct 12, 2023 •

edited

Loading