CLIPScore with different multimodal models for longer captions #2906

arijit-hub · 2025-01-15T23:42:33Z

🚀 Feature

Most of us know that CLIPScore for long captions doesn't work well. As such torchmetrics truncate the text embedding tokens to 77 to give correct results. However, it would be nice to have a metric which do check long caption score. As such I would like to propose the use of Jina Clip v2 for long caption clip score calculation. We could add it on top of the CLIPScore metric (as its very similar; add a string for jina in the model_name_or_path) or setup its own metric like below.

class CLIPJinaScore(Metric):
    """Implements the CLIPScore using the Jina-CLIP-v2 model."""

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.model, self.processor = self._get_jina_model_and_processor()
        self.add_state("score", torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state(
            "n_samples", torch.tensor(0, dtype=torch.long), dist_reduce_fx="sum"
        )

    def update(self, images, text):
        """Update score on a batch of images and text."""
        score, n_samples = self._score_update(images, text, self.model, self.processor)
        self.score += score.sum(0)
        self.n_samples += n_samples

    def compute(self):
        """Compute accumulated score."""
        return torch.max(self.score / self.n_samples, torch.zeros_like(self.score))

    def _get_jina_model_and_processor(self):
        """Returns the Jina-CLIP-v2 model and processor."""
        from transformers import AutoModel, AutoProcessor

        model = AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True)

        processor = AutoProcessor.from_pretrained(
            "jinaai/jina-clip-v2", trust_remote_code=True
        )

        return model, processor

    def _score_update(self, images, text, model, processor):
        """Update score on a batch of images and text."""

        device = images[0].device

        processed_input = processor(
            text=text,
            images=[transforms.functional.to_pil_image(i.cpu()) for i in images], # the preprocessor takes pil only
            return_tensors="pt",
            padding=True,
        )

        img_features = model.get_image_features(
            processed_input["pixel_values"].to(device)
        )
        img_features = img_features / img_features.norm(p=2, dim=-1, keepdim=True)

        txt_features = model.get_text_features(
            processed_input["input_ids"].to(device),
            processed_input["attention_mask"].to(device),
        )
        txt_features = txt_features / txt_features.norm(p=2, dim=-1, keepdim=True)

        # cosine similarity between feature vectors
        score = 100 * (img_features * txt_features).sum(axis=-1)
        return score, len(text)

Let me know what you think.

cc @Borda @lantiga @awaelchli

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-15T23:42:57Z

Hi! thanks for your contribution!, great first issue!

arijit-hub added the enhancement New feature or request label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIPScore with different multimodal models for longer captions #2906

CLIPScore with different multimodal models for longer captions #2906

arijit-hub commented Jan 15, 2025

github-actions bot commented Jan 15, 2025

CLIPScore with different multimodal models for longer captions #2906

CLIPScore with different multimodal models for longer captions #2906

Comments

arijit-hub commented Jan 15, 2025

🚀 Feature

github-actions bot commented Jan 15, 2025