You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I saw one tiny issue which is that the final weights of the model is different when training with multiple sub_batches per step vs 1 big_batch per step. I'm not sure if such numerical differences are expected when using this library.
I'm using clip with contrastive loss, here's my quick experimental code that I made sure to run multiple times and it results in exactly the same output each time:
(note: I'm using CLIP with 151 million parameters and a dataset of only 32 samples for experimental purposes)
Above we see the difference still exists for two different gradcache batch sizes
However this library is still working amazingly as if I compare it with normally training with whatever maximum batch size fits in my GPU, I get a huge difference (which is expected and exactly why I need this library) as seen below
Hi! I alse train CLIP in this way.
The reason that causes inconsistent, I think it's BN layer do not synchronous. You should replace BN with GN.
Besides, I think you should sample without replacement. That's means you need design a sampler. This is important if you only use CLIP's loss.
If you don't do this, It will get same ID in one iteration. It's a wrong optimization goal for CLIP.
Hi, thanks for this amazing library.
I saw one tiny issue which is that the final weights of the model is different when training with multiple sub_batches per step vs 1 big_batch per step. I'm not sure if such numerical differences are expected when using this library.
I'm using clip with contrastive loss, here's my quick experimental code that I made sure to run multiple times and it results in exactly the same output each time:
(note: I'm using CLIP with 151 million parameters and a dataset of only 32 samples for experimental purposes)
Above we see that training for two sub_batches of 8 vs training for 1 batch of 16 gives a tiny different in the norm of the weights of the two models.
Above we see that the models are equivalent when making gradcache perform a backward every batch
Above we see the difference still exists for two different gradcache batch sizes
However this library is still working amazingly as if I compare it with normally training with whatever maximum batch size fits in my GPU, I get a huge difference (which is expected and exactly why I need this library) as seen below
Below is my code in case the problem is with it:
The text was updated successfully, but these errors were encountered: