traning speed is very slow #30

liuweie · 2024-06-25T11:29:55Z

Hi, I use grad_cache to train my model, but it seems very slow, I want to konw is this normal?
Does using grad cache generally affect the training speed?

luyug · 2024-06-25T17:44:17Z

It'd be hard to diagnose this based on qualitative descriptions. Maybe you can share some of you setups, observed&reference throughput/latency, etc.

liuweie · 2024-06-26T03:25:52Z

It'd be hard to diagnose this based on qualitative descriptions. Maybe you can share some of you setups, observed&reference throughput/latency, etc.

thanks, I am using huggingface Trainer to train a Qwen7B model, here is my setups and corespodding code:

① compute loss function, which override the original Trainer compute_loss function:

 def compute_loss(
        self,
        model: nn.Module,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        return_outputs: bool = False,
        ) -> Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:

        
        features, labels = inputs
        query_input = features[0]
        pos_doc_input = features[1]

        loss_fn = DistributedContrastiveLoss(temperature=20, n_hard_negatives=0)
        gc = GradCache(
            models=[self.model, self.model],
            chunk_sizes=1,
            loss_fn=loss_fn,
            get_rep_fn=None,
            fp16=False
            )

        detached_loss = gc(query_input, pos_doc_input).requires_grad_()
        return detached_loss`

As you can see, I set chunksize=1,and I also tried to set chunksize=4\16\64, and batchsize in trainer setting is 256；
the device is A800(80G) with 2 GPUs

②DistributedContrastiveLoss function, similar with DistributedContrastiveLoss in loss.py of grad_cache pakage, only added a temperature parameter to scale the score;

class DistributedContrastiveLoss(SimpleContrastiveLoss):
    def __init__(self, temperature, n_hard_negatives: int = 0):
        assert dist.is_initialized(), "Distributed training has not been properly initialized."

        super().__init__(temperature=temperature,n_hard_negatives=n_hard_negatives)
        self.word_size = dist.get_world_size()
        self.rank = dist.get_rank()

    def __call__(self, x: Tensor, y: Tensor, **kwargs):
        dist_x = self.gather_tensor(x)
        dist_y = self.gather_tensor(y)

        return super().__call__(dist_x, dist_y, **kwargs)

    def gather_tensor(self, t):
        gathered = [torch.empty_like(t) for _ in range(self.word_size)]
        dist.all_gather(gathered, t)
        gathered[self.rank] = t

        return torch.cat(gathered, dim=0)

which SimpleContrastiveLoss like this:

class SimpleContrastiveLoss:
    def __init__(self, temperature, n_hard_negatives: int = 0):
        self.target_per_qry = n_hard_negatives + 1
        self.temperature = temperature

    def __call__(self, x: Tensor, y: Tensor, target: Tensor = None, reduction: str = 'mean'):
        if target is None:
            assert x.size(0) * self.target_per_qry == y.size(0)
            target = torch.arange(0, x.size(0) * self.target_per_qry, self.target_per_qry, device=x.device)

        logits = torch.matmul(x, y.transpose(0, 1))
        logits = logits*self.temperature
        return F.cross_entropy(logits, target, reduction=reduction)

finally， code can run success, but grad updata is extremely slow

luyug · 2024-06-26T16:19:07Z

can you share the observed runtime and reference runtime

meanwhile, one thing to note is that huggingface Trainer can trigger features like deepspeed zero, which came after gradcache release and therefore may not be smoothly supported.

liuweie · 2024-07-01T03:26:51Z

@luyug I think I have figure this prolem out, thanks.
but during my experiment , I found that the loss is very difficult to converge, here is my log:

{'loss': 4.8621, 'grad_norm': 3.703125, 'learning_rate': 0.0001978902953586498, 'epoch': 0.08}
{'loss': 3.3299, 'grad_norm': 1.140625, 'learning_rate': 0.00019261603375527427, 'epoch': 0.16}
{'loss': 3.0912, 'grad_norm': 3.578125, 'learning_rate': 0.00018734177215189873, 'epoch': 0.23}
{'loss': 2.6461, 'grad_norm': 1.0234375, 'learning_rate': 0.00018206751054852322, 'epoch': 0.31}
{'loss': 1.8239, 'grad_norm': 0.7578125, 'learning_rate': 0.00017679324894514769, 'epoch': 0.39}
{'loss': 2.4623, 'grad_norm': 4.4375, 'learning_rate': 0.00017151898734177218, 'epoch': 0.47}
{'loss': 2.1719, 'grad_norm': 1.1640625, 'learning_rate': 0.00016624472573839661, 'epoch': 0.55}
{'loss': 2.6063, 'grad_norm': 16.125, 'learning_rate': 0.0001609704641350211, 'epoch': 0.62}
{'loss': 2.2289, 'grad_norm': 6.40625, 'learning_rate': 0.0001556962025316456, 'epoch': 0.7}
{'loss': 2.1505, 'grad_norm': 1.5078125, 'learning_rate': 0.00015042194092827003, 'epoch': 0.78}
{'loss': 2.2342, 'grad_norm': 3.375, 'learning_rate': 0.00014514767932489453, 'epoch': 0.86}
{'loss': 1.7903, 'grad_norm': 2.125, 'learning_rate': 0.000139873417721519, 'epoch': 0.93}
{'loss': 2.5553, 'grad_norm': 2.21875, 'learning_rate': 0.00013459915611814345, 'epoch': 1.01}
{'loss': 1.8375, 'grad_norm': 0.95703125, 'learning_rate': 0.00012932489451476795, 'epoch': 1.09}
{'loss': 2.379, 'grad_norm': 0.71875, 'learning_rate': 0.0001240506329113924, 'epoch': 1.17}
{'loss': 2.5603, 'grad_norm': 3.203125, 'learning_rate': 0.00011877637130801689, 'epoch': 1.25}
....
....
...
{'loss': 1.9158, 'grad_norm': 2.84375, 'learning_rate': 3.966244725738397e-05, 'epoch': 2.41}
{'loss': 2.2063, 'grad_norm': 3.625, 'learning_rate': 3.438818565400844e-05, 'epoch': 2.49}
{'loss': 2.1187, 'grad_norm': 1.46875, 'learning_rate': 2.9113924050632914e-05, 'epoch': 2.57}
{'loss': 2.055, 'grad_norm': 0.94140625, 'learning_rate': 2.3839662447257385e-05, 'epoch': 2.65}

As you can see, the loss is alway around 2， and if I don't use grad cache , loss can converge to 0.2

lfb-1 · 2024-09-25T03:27:12Z

@luyug I think I have figure this prolem out, thanks. but during my experiment , I found that the loss is very difficult to converge, here is my log:

{'loss': 4.8621, 'grad_norm': 3.703125, 'learning_rate': 0.0001978902953586498, 'epoch': 0.08}
{'loss': 3.3299, 'grad_norm': 1.140625, 'learning_rate': 0.00019261603375527427, 'epoch': 0.16}
{'loss': 3.0912, 'grad_norm': 3.578125, 'learning_rate': 0.00018734177215189873, 'epoch': 0.23}
{'loss': 2.6461, 'grad_norm': 1.0234375, 'learning_rate': 0.00018206751054852322, 'epoch': 0.31}
{'loss': 1.8239, 'grad_norm': 0.7578125, 'learning_rate': 0.00017679324894514769, 'epoch': 0.39}
{'loss': 2.4623, 'grad_norm': 4.4375, 'learning_rate': 0.00017151898734177218, 'epoch': 0.47}
{'loss': 2.1719, 'grad_norm': 1.1640625, 'learning_rate': 0.00016624472573839661, 'epoch': 0.55}
{'loss': 2.6063, 'grad_norm': 16.125, 'learning_rate': 0.0001609704641350211, 'epoch': 0.62}
{'loss': 2.2289, 'grad_norm': 6.40625, 'learning_rate': 0.0001556962025316456, 'epoch': 0.7}
{'loss': 2.1505, 'grad_norm': 1.5078125, 'learning_rate': 0.00015042194092827003, 'epoch': 0.78}
{'loss': 2.2342, 'grad_norm': 3.375, 'learning_rate': 0.00014514767932489453, 'epoch': 0.86}
{'loss': 1.7903, 'grad_norm': 2.125, 'learning_rate': 0.000139873417721519, 'epoch': 0.93}
{'loss': 2.5553, 'grad_norm': 2.21875, 'learning_rate': 0.00013459915611814345, 'epoch': 1.01}
{'loss': 1.8375, 'grad_norm': 0.95703125, 'learning_rate': 0.00012932489451476795, 'epoch': 1.09}
{'loss': 2.379, 'grad_norm': 0.71875, 'learning_rate': 0.0001240506329113924, 'epoch': 1.17}
{'loss': 2.5603, 'grad_norm': 3.203125, 'learning_rate': 0.00011877637130801689, 'epoch': 1.25}
....
....
...
{'loss': 1.9158, 'grad_norm': 2.84375, 'learning_rate': 3.966244725738397e-05, 'epoch': 2.41}
{'loss': 2.2063, 'grad_norm': 3.625, 'learning_rate': 3.438818565400844e-05, 'epoch': 2.49}
{'loss': 2.1187, 'grad_norm': 1.46875, 'learning_rate': 2.9113924050632914e-05, 'epoch': 2.57}
{'loss': 2.055, 'grad_norm': 0.94140625, 'learning_rate': 2.3839662447257385e-05, 'epoch': 2.65}

As you can see, the loss is alway around 2， and if I don't use grad cache , loss can converge to 0.2

Hi, I am having similar issue. The loss does not converge after using gradcache. Did you solve this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

traning speed is very slow #30

traning speed is very slow #30

liuweie commented Jun 25, 2024

luyug commented Jun 25, 2024

liuweie commented Jun 26, 2024 •

edited

Loading

luyug commented Jun 26, 2024

liuweie commented Jul 1, 2024

lfb-1 commented Sep 25, 2024

traning speed is very slow #30

traning speed is very slow #30

Comments

liuweie commented Jun 25, 2024

luyug commented Jun 25, 2024

liuweie commented Jun 26, 2024 • edited Loading

luyug commented Jun 26, 2024

liuweie commented Jul 1, 2024

lfb-1 commented Sep 25, 2024

liuweie commented Jun 26, 2024 •

edited

Loading