Question About Gradient Synchronization During the Accelerate Process #3285

Klein-Lan · 2024-12-10T08:53:12Z

Hello, I have some questions regarding gradient synchronization that I hope you can help clarify.

In distributed training, we use model = accelerate.prepare(model) to wrap the model.

According to the documentation, we should use the wrapped model from accelerator for forward propagation. However, due to some project constraints, I might not directly use loss = model(inputs) during the forward pass, but instead use loss = model.module(inputs).

I would like to know if this will affect gradient synchronization when using accelerator.backward(loss). Or, when updating parameters, is it essentially equivalent to using loss = model(inputs) even if I use loss = model.module(inputs)?

Thank you for your help.

muellerzr · 2024-12-10T14:38:06Z

From what I've read and seen, it should be equivalent/fine. However for gradient synchronization, we're explicitly avoiding the wrapped DistributedDataParallel wrapper, which should result in the same slowdown/synchronization I believe

muellerzr · 2024-12-10T14:45:02Z

Asking around internally to get a solid answer on this, because its never been one I've looked into for an exact answer before

muellerzr · 2024-12-10T18:13:10Z

The answer is it's quite a bit complex in the end. If we are not doing gradient accumulation, you need to something like so:

model.train()
for x, y in train_loader:
    optimizer.zero_grad()
    outputs = model.module.model(x)
    model.reducer.prepare_for_backward([])
    model._clear_grad_buffer()

    loss = criterion(outputs, y.unsqueeze(1))
    loss.backward()
    optimizer.step()
    loss = gather(loss.detach()).mean()
    state.print(loss)

If we are, there's added checks we need to do and it requires really digging into the hidden calls inside of DistributedDataParallel.

So the answer is no, it is not at all because you're explicitly avoiding what DDP does under the hood. Is this in conjunction with gradient accumulation or no?

Klein-Lan · 2024-12-11T03:03:05Z

The answer is it's quite a bit complex in the end. If we are not doing gradient accumulation, you need to something like so:
model.train()
for x, y in train_loader:
    optimizer.zero_grad()
    outputs = model.module.model(x)
    model.reducer.prepare_for_backward([])
    model._clear_grad_buffer()

    loss = criterion(outputs, y.unsqueeze(1))
    loss.backward()
    optimizer.step()
    loss = gather(loss.detach()).mean()
    state.print(loss)
If we are, there's added checks we need to do and it requires really digging into the hidden calls inside of DistributedDataParallel.

So the answer is no, it is not at all because you're explicitly avoiding what DDP does under the hood. Is this in conjunction with gradient accumulation or no?

Thank you for your patient response!

I can simplify my current situation: the model I pass to accelerator.prepare is actually composed of two smaller models (input -> model1 -> model2 -> output) concatenated together. So the model I pass to accelerator.prepare is a wrapper around these two smaller models.

However, during the training phase, I am trying to compute the loss using only the forward pass of one of the models. This forces me to use model.module.model1 for the forward inference. As a result, I am concerned about potential issues with gradient synchronization, especially since I am not currently using gradient accumulation (though I might in the future).

So, would a better solution be as follows:

model1 = accelerator.prepare(model1)
model2 = accelerator.prepare(model2)

intermediate_results = model1(input)
output = model2(intermediate_results)
loss1 = loss_fn(output, ground_truth)

another_output = model1(input)
loss2 = loss_fn(another_output, another_result)

loss = loss1 + loss2

Thank you again for your help.

Klein-Lan changed the title ~~Question about gradient synchronization during the accelerate process~~ Question About Gradient Synchronization During the Accelerate Process Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question About Gradient Synchronization During the Accelerate Process #3285

Question About Gradient Synchronization During the Accelerate Process #3285

Klein-Lan commented Dec 10, 2024

muellerzr commented Dec 10, 2024

muellerzr commented Dec 10, 2024

muellerzr commented Dec 10, 2024

Klein-Lan commented Dec 11, 2024 •

edited

Loading

Question About Gradient Synchronization During the Accelerate Process #3285

Question About Gradient Synchronization During the Accelerate Process #3285

Comments

Klein-Lan commented Dec 10, 2024

muellerzr commented Dec 10, 2024

muellerzr commented Dec 10, 2024

muellerzr commented Dec 10, 2024

Klein-Lan commented Dec 11, 2024 • edited Loading

Klein-Lan commented Dec 11, 2024 •

edited

Loading