You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After examining clip.py and modules.py, I noticed a few issues. Starting from the end, the symmetric cross-entropy seems partially incorrect. Specifically:
While I appreciate the introduction of "soft targets" over the original one-hot encoding, I believe the softmax should be included directly within the cross_entropy function to correctly apply over both rows and columns. Transposing the targets is not equivalent to applying softmax on the first dimension (dim=0), which could cause convergence issues, especially at low temperature during initial training. As temperature increases, targets start resembling an identity matrix, making targets close to targets.T. Additionally, I suggest clamping the temperature to prevent scaling logits by more than 100, as recommended in the original paper to avoid instability.
To ensure that the model focus on the directional similarity between image and text embeddings, rather than their magnitudes , I would also remove the final layer norm from the ProjectionHead module. This layer adjusts the input's mean and standard deviation but does not produce unit norm, so replacing it with L2 normalization on image_embeddings and text_embeddings would better align with the original loss formulation:
These changes should help smooth integration with the initial CLIP implementation. Let me know if anything is unclear or if I've missed anything. Thanks!
The text was updated successfully, but these errors were encountered:
Hi everyone,
After examining
clip.py
andmodules.py
, I noticed a few issues. Starting from the end, the symmetric cross-entropy seems partially incorrect. Specifically:While I appreciate the introduction of "soft targets" over the original one-hot encoding, I believe the
softmax
should be included directly within thecross_entropy
function to correctly apply over both rows and columns. Transposing the targets is not equivalent to applyingsoftmax
on the first dimension (dim=0
), which could cause convergence issues, especially at low temperature during initial training. As temperature increases, targets start resembling an identity matrix, makingtargets
close totargets.T
. Additionally, I suggest clamping the temperature to prevent scaling logits by more than 100, as recommended in the original paper to avoid instability.My suggested code adjustment would be:
To ensure that the model focus on the directional similarity between image and text embeddings, rather than their magnitudes , I would also remove the final layer norm from the
ProjectionHead
module. This layer adjusts the input's mean and standard deviation but does not produce unit norm, so replacing it with L2 normalization onimage_embeddings
andtext_embeddings
would better align with the original loss formulation:These changes should help smooth integration with the initial CLIP implementation. Let me know if anything is unclear or if I've missed anything. Thanks!
The text was updated successfully, but these errors were encountered: