You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most of us know that CLIPScore for long captions doesn't work well. As such torchmetrics truncate the text embedding tokens to 77 to give correct results. However, it would be nice to have a metric which do check long caption score. As such I would like to propose the use of Jina Clip v2 for long caption clip score calculation. We could add it on top of the CLIPScore metric (as its very similar; add a string for jina in the model_name_or_path) or setup its own metric like below.
classCLIPJinaScore(Metric):
"""Implements the CLIPScore using the Jina-CLIP-v2 model."""def__init__(self, **kwargs):
super().__init__(**kwargs)
self.model, self.processor=self._get_jina_model_and_processor()
self.add_state("score", torch.tensor(0.0), dist_reduce_fx="sum")
self.add_state(
"n_samples", torch.tensor(0, dtype=torch.long), dist_reduce_fx="sum"
)
defupdate(self, images, text):
"""Update score on a batch of images and text."""score, n_samples=self._score_update(images, text, self.model, self.processor)
self.score+=score.sum(0)
self.n_samples+=n_samplesdefcompute(self):
"""Compute accumulated score."""returntorch.max(self.score/self.n_samples, torch.zeros_like(self.score))
def_get_jina_model_and_processor(self):
"""Returns the Jina-CLIP-v2 model and processor."""fromtransformersimportAutoModel, AutoProcessormodel=AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True)
processor=AutoProcessor.from_pretrained(
"jinaai/jina-clip-v2", trust_remote_code=True
)
returnmodel, processordef_score_update(self, images, text, model, processor):
"""Update score on a batch of images and text."""device=images[0].deviceprocessed_input=processor(
text=text,
images=[transforms.functional.to_pil_image(i.cpu()) foriinimages], # the preprocessor takes pil onlyreturn_tensors="pt",
padding=True,
)
img_features=model.get_image_features(
processed_input["pixel_values"].to(device)
)
img_features=img_features/img_features.norm(p=2, dim=-1, keepdim=True)
txt_features=model.get_text_features(
processed_input["input_ids"].to(device),
processed_input["attention_mask"].to(device),
)
txt_features=txt_features/txt_features.norm(p=2, dim=-1, keepdim=True)
# cosine similarity between feature vectorsscore=100* (img_features*txt_features).sum(axis=-1)
returnscore, len(text)
🚀 Feature
Most of us know that
CLIPScore
for long captions doesn't work well. As suchtorchmetrics
truncate the text embedding tokens to 77 to give correct results. However, it would be nice to have a metric which do check long caption score. As such I would like to propose the use of Jina Clip v2 for long caption clip score calculation. We could add it on top of theCLIPScore
metric (as its very similar; add a string for jina in themodel_name_or_path
) or setup its own metric like below.Let me know what you think.
cc @Borda @lantiga @awaelchli
The text was updated successfully, but these errors were encountered: