DetailCLIP

Detail-Oriented CLIP for Fine-Grained Tasks (Amin Karimi Monsefi, Kishore Prakash Sailaja, Ali Alilooee, Ser-Nam Lim, Rajiv Ramnath)

DetailCLIP enhances CLIP-based models for fine-grained tasks like segmentation by using patch-level comparison and pixel-level reconstruction, with an attention-based token removal to focus on semantically relevant details. This results in superior segmentation accuracy and generalization across diverse datasets.

Performance comparison of various models on detail-oriented visual tasks, including segmentation and object detection. All models utilize the vision component’s ViT-B (Vision Transformer Base) architecture.

Segmentation & Object Detection

Methods	Dataset	Epoch	Effective View	ADE20K (UperNet)	ADE20K (Linear)	COCO (AP^b)	COCO (AP^m)
Self-Supervised
DeiT	IN-1K	300	720M	47.4	-	44.1	39.8
MAE	IN-1K	800	960M	46.5	34.3	46.2	39.1
DINO	IN-1K	800	960M	46.8	34.5	47.4	40.1
iBOT	IN-1K	300	720M	47.3	34.7	48.4	42.1
AttMask	IN-1K	300	432M	47.5	35.2	48.9	42.2
CLIP-Based Model
CLIP	400M	-	-	46.4	34.2	43.6	39.5
SLIP	YFCC-15M	25	750M	46.6	36.1	44.0	40.3
MaskCLIP	YFCC-15M	25	750M	47.5	36.3	45.8	40.9
A-CLIP	YFCC-15M	25	750M	47.0	34.7	45.8	41.7
DetailCLIP	YFCC-15M	25	750M	48.1	37.3	48.9	42.5
DetailCLIP	YFCC-15M	50	1500M	48.8	39.3	50.1	43.3

Below includes the performance of DetailCLIP I2T and T2I retrieval on a COCO, Flickr30k, and Imagenet-1K.

Image-to-Text and Text-to-Image Retrieval

Methods	E	Flickr30K (I2T)	Flickr30K (T2I)	COCO (I2T)	COCO (T2I)	IN-1K (0-Shot)
CLIP	25	51.4	32.6	27.9	17.6	37.6
SLIP	25	57.2	41.2	33.6	21.9	42.8
MaskCLIP	25	60.0	38.8	34.1	21.2	42.7
A-CLIP	25	62.7	42.1	38.0	23.2	43.9
DetailCLIP	25	62.8	42.2	38.3	22.9	43.9
CLIP	50	53.9	35.8	30.2	19.2	39.4
SLIP	50	60.6	41.1	33.2	22.3	44.1
A-CLIP	50	66.7	43.2	39.8	24.4	46.3
DetailCLIP	50	65.9	44.7	39.8	24.9	46.2

Installation

Code tested with pytorch 2.0.0, torchvision 0.15.0, cuda 11.7, and timm 0.5.4.

YFCC15M Setup: Please refer to SLIP's repo.

Usage

the code has been tested with SLRUM distrubted training. Sample SLURM script is provided in sample_jobscript.sh.

Checkpoint

50 epoch checkpoint

Train DetaiCLIP ViT-B/16:

python -m torch.distributed.launch --nproc_per_node=$NUM_PROC --nnodes=$NUM_NODES main.py \
--model DetailCLIP_VITB16  --dataset yfcc15m --metadata yfcc15m.pkl \
--output-dir output/$JOB_NAME --mask-ratio 0.5 --epochs 50 \
--batch-size 256 --lr 5e-4 --wd 0.5 \
--workers $NUM_WORKERS --clip_loss_weight 1 --ibot_patch_loss_weight 1 \
--ibot_cls_loss_weight 1 --reconst_loss_weight 1 --print-freq 1000

Zero-shot evaluation

python eval_zeroshot.py --resume /path/to/checkpoint.pt

For semantic segmenation and object detection evaluation, refer to IBOT's repo. We use mmseg framework to run evaluations.

extract_backbone_weights.py can be used to extract backbone weights from the model. This might be useful for evaluation and other downstream tasks.

Visualization

Attention Maps and Token Removal

Segmentation

Object Detection

Citation

If the code or paper helped your work, please cite:

@misc{monsefi2024detailclipdetailorientedclipfinegrained,
      title={DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks}, 
      author={Amin Karimi Monsefi and Kishore Prakash Sailaja and Ali Alilooee and Ser-Nam Lim and Rajiv Ramnath},
      year={2024},
      eprint={2409.06809},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.06809}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
JOB		JOB
misc		misc
DetailCLIP.png		DetailCLIP.png
README.md		README.md
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
dataset_catalog.json		dataset_catalog.json
datasets.py		datasets.py
eval_zeroshot.py		eval_zeroshot.py
extract_backbone_weights.py		extract_backbone_weights.py
labels.json		labels.json
losses.py		losses.py
main.py		main.py
make_dataset.py		make_dataset.py
models.py		models.py
templates.json		templates.json
tokenizer.py		tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DetailCLIP

Segmentation & Object Detection

Image-to-Text and Text-to-Image Retrieval

Installation

Usage

Checkpoint

Train DetaiCLIP ViT-B/16:

Zero-shot evaluation

Visualization

Attention Maps and Token Removal

Segmentation

Object Detection

Citation

About

Releases

Packages

Languages

KishoreP1/DetailCLIP

Folders and files

Latest commit

History

Repository files navigation

DetailCLIP

Segmentation & Object Detection

Image-to-Text and Text-to-Image Retrieval

Installation

Usage

Checkpoint

Train DetaiCLIP ViT-B/16:

Zero-shot evaluation

Visualization

Attention Maps and Token Removal

Segmentation

Object Detection

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages