Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIPScore with different multimodal models for longer captions #2906

Open
arijit-hub opened this issue Jan 15, 2025 · 1 comment
Open

CLIPScore with different multimodal models for longer captions #2906

arijit-hub opened this issue Jan 15, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@arijit-hub
Copy link

🚀 Feature

Most of us know that CLIPScore for long captions doesn't work well. As such torchmetrics truncate the text embedding tokens to 77 to give correct results. However, it would be nice to have a metric which do check long caption score. As such I would like to propose the use of Jina Clip v2 for long caption clip score calculation. We could add it on top of the CLIPScore metric (as its very similar; add a string for jina in the model_name_or_path) or setup its own metric like below.

class CLIPJinaScore(Metric):
    """Implements the CLIPScore using the Jina-CLIP-v2 model."""

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.model, self.processor = self._get_jina_model_and_processor()
        self.add_state("score", torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state(
            "n_samples", torch.tensor(0, dtype=torch.long), dist_reduce_fx="sum"
        )

    def update(self, images, text):
        """Update score on a batch of images and text."""
        score, n_samples = self._score_update(images, text, self.model, self.processor)
        self.score += score.sum(0)
        self.n_samples += n_samples

    def compute(self):
        """Compute accumulated score."""
        return torch.max(self.score / self.n_samples, torch.zeros_like(self.score))

    def _get_jina_model_and_processor(self):
        """Returns the Jina-CLIP-v2 model and processor."""
        from transformers import AutoModel, AutoProcessor

        model = AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True)

        processor = AutoProcessor.from_pretrained(
            "jinaai/jina-clip-v2", trust_remote_code=True
        )

        return model, processor

    def _score_update(self, images, text, model, processor):
        """Update score on a batch of images and text."""

        device = images[0].device

        processed_input = processor(
            text=text,
            images=[transforms.functional.to_pil_image(i.cpu()) for i in images], # the preprocessor takes pil only
            return_tensors="pt",
            padding=True,
        )

        img_features = model.get_image_features(
            processed_input["pixel_values"].to(device)
        )
        img_features = img_features / img_features.norm(p=2, dim=-1, keepdim=True)

        txt_features = model.get_text_features(
            processed_input["input_ids"].to(device),
            processed_input["attention_mask"].to(device),
        )
        txt_features = txt_features / txt_features.norm(p=2, dim=-1, keepdim=True)

        # cosine similarity between feature vectors
        score = 100 * (img_features * txt_features).sum(axis=-1)
        return score, len(text)

Let me know what you think.

cc @Borda @lantiga @awaelchli

@arijit-hub arijit-hub added the enhancement New feature or request label Jan 15, 2025
Copy link

Hi! thanks for your contribution!, great first issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant