[NeurIPS 2024] TripletCLIP : Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
This repository will provide access to the dataset, pretrained checkpoints, inference, and training code for our paper, TripletCLIP. We provide our training scripts written from scratch to train the models reported in paper and OpenCLIP varient for easy reproducibility.
-
Release High-Quality Subset of TripletData. -
Release all pre-trained and finetuned checkpoints. -
Release TripletCLIP adaption on OpenCLIP../src/openclip - Release data generation scripts.
- Release full TripletData.
- Release original TripletCLIP training scripts for reproducibility.
Below are the checkpoints for the models trained on CC3M and CC12M datasets. The fine-tuning checkpoint is also provided for further customization.
Methods | CC3M | CC12M |
---|---|---|
LaCLIP | Link | Link |
LaCLIP+HN | Link | - |
NegCLIP | Link | Link |
NegCLIP++ | Link | Link |
TripletCLIP (ours) | Link | Link |
For fine-tuning based model checkpoint, please refer to the following link:
If you find the TripletCLIP useful, then consider citing:
@article{patel2024tripletclip,
author = {Patel, Maitreya and Kusumba, Abhiram and Cheng, Sheng and Kim, Changhoon and Gokhale, Tejas and Baral, Chitta and Yang, Yezhou},
title = {TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives},
journal={Advances in neural information processing systems},
year = {2024},
}
We would like to acknowledge the excelletn open-source community OpenCLIP, Huggingface, LAION-AI, and OpenAI for their efforts on making CLIP inference/finetuning and benchmarking easily accessible to all.