CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing 👁️📜🖋️

Image caption generation resides at the intersection of computer vision and natural language processing, with its primary goal being the creation of descriptive and coherent textual narratives that faithfully depict the content of an image.

This paper presents two models that leverage CLIP as the image encoder and fine-tune GPT-2 for caption generation on the Flickr30k and Flickr8k datasets. The first model utilizes a straightforward mapping network and outperforms the original architecture with a BLEU-1 score of 0.700, BLEU-4 score of 0.257, and ROUGE score of 0.569 on the Flickr8k dataset. The second model constitutes a new architecture exploring the boundaries of minimal visual information required for captioning. It incorporates CLIP's text encoder to produce input for the generator, while the image embedding serves solely as a validation mechanism. Despite its relatively lower performance, with a BLEU-1 score of 0.546, BLEU-4 score of 0.108, and ROUGE score of 0.444 on the Flickr8k dataset, this model demonstrates the decoder's ability to create captions based on keyword descriptions alone, without direct access to the context vector.

Dataset

We use the Flickr8k and Flickr30k dataset

Evaluation

We use 6 metrics: Bleu-1 to 4, Meteor and Rouge

Table 1: Result comparison of models trained on Flickr8k

Models	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE
CLIP-prefix (Original)	0.698	0.508	0.363	0.259	0.257	0.565
Ours: CLIP-prefix -- Gradient clipping	0.700	0.513	0.372	0.262	0.257	0.569
Ours: CLIP-prefix -- Custom tokenizer	0.682	0.497	0.351	0.246	0.249	0.557
Ours: SBG -- One caption	0.499	0.276	0.153	0.087	0.167	0.410
Ours: SBG -- Top-2 caption	0.520	0.293	0.161	0.089	0.179	0.420
Ours: SBG -- Top-5 caption	0.546	0.319	0.186	0.108	0.192	0.444
Merge-RNN	0.601	0.411	0.272	0.179	0.191	0.439

Table 2: Result comparison of models trained on Flickr30k

Models	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE
CLIP-prefix (Original)	0.715	0.506	0.351	0.243	0.232	0.529
Ours: CLIP-prefix -- Gradient clipping	0.715	0.503	0.349	0.235	0.233	0.528
Ours: CLIP-prefix -- Custom tokenizer	0.733	0.525	0.366	0.254	0.232	0.536
Ours: SBG -- One caption	0.495	0.261	0.139	0.076	0.154	0.378
Ours: SBG -- Top-2 caption	0.510	0.279	0.150	0.082	0.164	0.391
Ours: SBG -- Top-5 caption	0.543	0.304	0.170	0.095	0.175	0.411
Merge-RNN	0.596	0.404	0.270	0.181	0.175	0.416

Inference

Run our demo on Colab:

CLIP-prefix
SBG

We also create a Stable diffusion WebUI extension to interact with our models (Clip-prefix gradient8k & SBG 8k) locally. Load from this repo

Models

Our model weights are published on Huggingface:

CLIP-prefix
SBG (flickr8k)

CLIP model used is ViT-L-14

Contact

Team members: Triet Minh Huynh, Duy Linh Nguyen, Thanh Tri Nguyen

Citation

@InProceedings{10.1007/978-3-031-67357-3_14,
author="Huynh, Triet Minh
and Nguyen, Duy Linh
and Nguyen, Thanh Tri
and Vu, Thuy-Duong Thi
and Dang-Ngoc, Hanh
and Dang, Duc Ngoc Minh",
editor="Vo, Nguyen-Son
and Ha, Dac-Binh
and Jung, Haejoon",
title="CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing",
booktitle="Industrial Networks and Intelligent Systems",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="189--203",
abstract="Image caption generation resides at the intersection of computer vision and natural language processing, with its primary goal being the creation of descriptive and coherent textual narratives that faithfully depict the content of an image. This paper presents two models that leverage CLIP as the image encoder and fine-tune GPT-2 for caption generation on the Flickr30k and Flickr8k datasets. The first model utilizes a straightforward mapping network and outperforms the original architecture with a BLEU-1 score of 0.700, BLEU-4 score of 0.257, and ROUGE score of 0.569 on the Flickr8k dataset. The second model constitutes a new architecture exploring the boundaries of minimal visual information required for captioning. It incorporates CLIP's text encoder to produce input for the generator, while the image embedding serves solely as a validation mechanism. Despite its relatively lower performance, with a BLEU-1 score of 0.546, BLEU-4 score of 0.108, and ROUGE score of 0.444 on the Flickr8k dataset, this model demonstrates the decoder's ability to create captions based on keyword descriptions alone, without direct access to the context vector.",
isbn="978-3-031-67357-3"
}

Acknowledgments

This project was inspired by CLIP_prefix_caption

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
LICENSE		LICENSE
README.md		README.md
SBG.ipynb		SBG.ipynb
modifiedClipCap.ipynb		modifiedClipCap.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing 👁️📜🖋️

Dataset

Evaluation

Inference

Models

Contact

Citation

Acknowledgments

About

Languages

License

Anshler/CLIP-prefix_blind-guessing

Folders and files

Latest commit

History

Repository files navigation

CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing 👁️📜🖋️

Dataset

Evaluation

Inference

Models

Contact

Citation

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages