Skip to content

Latest commit

 

History

History
1992 lines (1970 loc) · 387 KB

README_multimodal.md

File metadata and controls

1992 lines (1970 loc) · 387 KB

(back to README.md and README_2.md for other categories)

Overview


Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Multi-Modality

Visual Captioning

  • General:
    • SAT: "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention", ICML, 2015. [paper]
    • ETA-Transformer: "Entangled Transformer for Image Captioning", ICCV, 2019 (UTS). [Paper]
    • M2-Transformer: "Meshed-Memory Transformer for Image Captioning", CVPR, 2020 (UniMoRE). [Paper][PyTorch]
    • MCCFormers: "Describing and Localizing Multiple Changes with Transformers", ICCV, 2021 (AIST). [Paper][Website]
    • SATIC: "Semi-Autoregressive Transformer for Image Captioning", ICCVW, 2021 (Hefei University of Technology). [Paper][PyTorch]
    • DGCN: "Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning", ACMMM, 2021 (Wuhan University). [Paper]
    • CPTR: "CPTR: Full Transformer Network for Image Captioning", arXiv, 2021 (CAS). [Paper]
    • ReFormer: "ReFormer: The Relational Transformer for Image Captioning", arXiv, 2021 (Stony Brook University). [Paper]
    • LAViTeR: "LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation", arXiv, 2021 (University at Buffalo). [Paper]
    • LATGeO: "Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
    • GEVST: "Geometry-Entangled Visual Semantic Transformer for Image Captioning", arXiv, 2021 (NTU, Singapore). [Paper]
    • GAT: "Geometry Attention Transformer with Position-aware LSTMs for Image Captioning", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
    • PureT: "End-to-End Transformer Based Model for Image Captioning", AAAI, 2022 (CAS). [Paper]
    • VisualGPT: "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning", CVPR, 2022 (KAUST). [Paper][PyTorch]
    • ViTCAP: "Injecting Semantic Concepts into End-to-End Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • CLIP-Event: "CLIP-Event: Connecting Text and Images with Event Structures", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • ?: "Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning", CVPR, 2022 (Georgia Tech). [Paper][PyTorch]
    • CLIP4IDC: "CLIP4IDC: CLIP for Image Difference Captioning", CVPRW, 2022 (Aalto University, Finland). [Paper][Code (in construction)]
    • ?: "A Dual-Attentive Approach to Style-Based Image Captioning Using a CNN-Transformer Model", CVPRW, 2022 (The University of the West Indies, Jamaica). [Paper]
    • SpaCap3D: "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI, 2022 (University of Sydney). [Paper][Code (in construction)][Website]
    • RA-Transformer: "Retrieval-Augmented Transformer for Image Captioning", International Conference on Content-based Multimedia Indexing (CMBI), 2022 (University of Modena and Reggio Emilia, Italy). [Paper]
    • GRIT: "GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features", ECCV, 2022 (Tohoku University + RIKEN AIP). [Paper][PyTorch]
    • ?: "Object-Centric Unsupervised Image Captioning", ECCV, 2022 (Meta). [Paper][PyTorch]
    • UEDVC: "Unifying Event Detection and Captioning as Sequence Generation via Pre-Training", ECCV, 2022 (Renmin University of China). [Paper][PyTorch]
    • TIger: "Explicit Image Caption Editing", ECCV, 2022 (Zhejiang University). [Paper][Code]
    • DML: "Learning Distinct and Representative Modes for Image Captioning", NeurIPS, 2022 (University of Adelaide, Australia). [Paper]
    • P2C: "Paraphrasing Is All You Need for Novel Object Captioning", NeurIPS, 2022 (NTU + CMU). [Paper]
    • BEST: "Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning", NeurIPS, 2022 (Microsoft). [Paper]
    • CapDec: "Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP, 2022 (Tel Aviv). [Paper][Pytorch]
    • ?: "Focus! Relevant and Sufficient Context Selection for News Image Captioning", EMNLP Findings, 2022 (UC Davis). [Paper]
    • CVLNM: "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", IJCV, 2022 (Southeast University, China). [Paper][PyTorch]
    • ViNTER: "ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer", arXiv, 2022 (The University of Tokyo). [Paper]
    • VaT: "Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning", arXiv, 2022 (Tongji University). [Paper]
    • SCST-GEG: "Distincive Image Captioning via CLIP Guided Group Optimization", arXiv, 2022 (McGill University). [Paper]
    • ?: "Vision Transformer Based Model for Describing a Set of Images as a Story", arXiv, 2022 (The University of Western Australia). [Paper]
    • CLM: "Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment", arXiv, 2022 (CAS). [Paper]
    • PTSN: "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
    • DDCap: "Exploring Discrete Diffusion Models for Image Captioning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • ARIC: "Aesthetically Relevant Image Captioning", AAAI, 2023 (Shenzhen University). [Paper][Code (in construction)]
    • UAIC: "Uncertainty-Aware Image Captioning", AAAI, 2023 (Meituan). [Paper]
    • LiMBeR: "Linearly Mapping from Image to Text Space", ICLR, 2023 (Brown University). [Paper]
    • DiscriTune: "Cross-Domain Image Captioning with Discriminative Finetuning", CVPR, 2023 (Universitat Pompeu Fabra (UPF), Spain). [Paper]
    • LIBRA: "Model-Agnostic Gender Debiased Image Captioning", CVPR, 2023 (Osaka University). [Paper]
    • A-CAP: "A-CAP: Anticipation Captioning with Commonsense Knowledge", CVPR, 2023 (The University of Tokyo). [Paper]
    • HAAV: "HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning", CVPR, 2023 (Georgia Tech). [Paper][Website]
    • ?: "Cross-Domain Image Captioning with Discriminative Finetuning", CVPR, 2023 (Universitat Pompeu Fabra (UPF), Spain). [Paper]
    • PAC-S: "Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation", CVPR, 2023 (UniMoRE, Italy). [Paper][PyTorch]
    • SCD-Net: "Semantic-Conditional Diffusion Networks for Image Captioning", CVPR, 2023 (JD). [Paper][PyTorch]
    • ConZIC: "ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing", CVPR, 2023 (Xidian University). [Paper][PyTorch]
    • SmallCap: "SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation", CVPR, 2023 (University of Lisbon, Portugal). [Paper][PyTorch]
    • LSML: "Crossing the Gap: Domain Generalization for Image Captioning", CVPR, 2023 (USTC). [Paper]
    • MuE: "You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model", CVPR, 2023 (NC State). [Paper]
    • OxfordTVG-HIC: "OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?", ICCV, 2023 (Oxford). [Paper][Website]
    • ?: "Guiding Image Captioning Models Toward More Specific Captions", ICCV, 2023 (Google). [Paper]
    • ViECap: "Transferable Decoding with Visual Entities for Zero-Shot Image Captioning", ICCV, 2023 (Southern University of Science and Technology). [Paper][PyTorch]
    • PMA-Net: "With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning", ICCV, 2023 (University of Modena and Reggio Emilia (UniMoRE), Italy). [Paper][Code (in construction)]
    • SCORER: "Self-supervised Cross-view Representation Reconstruction for Change Captioning", ICCV, 2023 (CAS). [Paper][Code (in construction)]
    • PromptCap: "PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3", ICCV, 2023 (UW). [Paper][PyTorch][Website]
    • NoC: "Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning", ICCV, 2023 (Kakao). [Paper][PyTorch]
    • TSG: "Transforming Visual Scene Graphs to Image Captions", ACL, 2023 (Southeast University, China). [Paper][PyTorch]
    • InfoMetIC: "InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation", ACL, 2023 (Renmin University of China). [Paper][Code (in construction)]
    • MultiCapCLIP: "MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning", ACL, 2023 (Peking). [Paper][PyTorch (in construction)]
    • Cur-VL: "Learning from Children: Improving Image-Caption Pretraining via Curriculum", ACL Findings, 2023 (Columbia). [Paper][Code (in construction)]
    • ?: "Text-Only Training for Visual Storytelling", ACMMM, 2023 (USTC). [Paper]
    • CgT-GAN: "CgT-GAN: CLIP-guided Text GAN for Image Captioning", ACMMM, 2023 (USTC). [Paper][PyTorch]
    • CLAIR: "CLAIR: Evaluating Image Captions with Large Language Models", EMNLP, 2023 (Berkeley). [Paper][Code][Website]
    • SCP-WGCN: "Improving Image Captioning via Predicting Structured Concepts", EMNLP, 2023 (USTC). [Paper][Code (in construction)]
    • ExploreCfg: "Exploring Diverse In-Context Configurations for Image Captioning", NeurIPS, 2023 (Southeast University, China). [Paper][PyTorch]
    • COLA: "COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?", NeurIPS, 2023 (Boston). [Paper][Website]
    • Re-ViLM: "Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning", arXiv, 2023 (NVIDIA). [Paper]
    • Knight: "From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • VTT: "Visual Transformation Telling", arXiv, 2023 (CAS). [Paper]
    • Caption-Anything: "Caption Anything: Interactive Image Description with Diverse Multimodal Controls", arXiv, 2023 (Southern University of Science and Technology). [Paper][PyTorch]
    • ?: "Data Curation for Image Captioning with Text-to-Image Generative Models", arXiv, 2023 (University of Copenhagen, Denmark). [Paper]
    • TLC: "Simple Token-Level Confidence Improves Caption Correctness", arXiv, 2023 (Meta). [Paper]
    • VIVID: "Album Storytelling with Iterative Story-aware Captioning and Large Language Models", arXiv, 2023 (Peking). [Paper]
    • MCDG: "Text-Only Image Captioning with Multi-Context Data Generation", arXiv, 2023 (USTC). [Paper]
    • FuseCap: "FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions", arXiv, 2023 (Israel Institute of Technology). [Paper]
    • StoryGen: "Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch (in construction)][Website]
    • ?: "Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion", arXiv, 2023 (University of Milano-Bicocca, Italy). [Paper]
    • SITTA: "SITTA: A Semantic Image-Text Alignment for Image Captioning", arXiv, 2023 (Johannes Kepler University, Austria). [Paper][PyTorch]
    • MMNS: "Multimodal Neurons in Pretrained Text-Only Transformers", arXiv, 2023 (MIT). [Paper]
    • RegionBLIP: "RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • ?: "Visually-Aware Context Modeling for News Image Captioning", arXiv, 2023 (KU Leuven). [Paper]
    • EVCap: "EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension", arXiv, 2023 (The University of Tokyo). [Paper][Website]
    • SCA: "Segment and Caption Anything", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • sDCI: "A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions", arXiv, 2023 (Meta). [Paper][PyTorch]
    • DisCLIP: "DisCLIP: Open-Vocabulary Referring Expression Generation", arXiv, 2024 (Bar-Ilan University, Israel). [Paper]
    • MacCap: "Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training", AAAI, 2024 (ShanghaiTech). [Paper][Code (in construction)]
    • RegionGPT: "RegionGPT: Towards Region Understanding Vision Language Model", CVPR, 2024 (NVIDIA). [Paper][Website]
    • MeaCap: "MeaCap: Memory-Augmented Zero-shot Image Captioning", CVPR, 2024 (Xidian University). [Paper][Code (in construction)]
    • FlexCap: "FlexCap: Generating Rich, Localized, and Flexible Captions in Images", arXiv, 2024 (DeepMind). [Paper][Website]
  • Video:
    • Masked Transformers: "End-to-End Dense Video Captioning with Masked Transformer", CVPR, 2018 (UMich + Salesforce). [Paper][PyTorch]
    • BMT: "A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer", BMVC, 2020 (Tampere University, Finland). [Paper][PyTorch][Website]
    • ?: "Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers", Interspeech, 2021 (MERL). [Paper]
    • PDVC: "End-to-End Dense Video Captioning with Parallel Decoding", ICCV, 2021 (HKU + Southern University of Science and Technology). [Paper][PyTorch]
    • MV-GPT: "End-to-end Generative Pretraining for Multimodal Video Captioning", CVPR, 2022 (Google). [Paper]
    • VGCL: "Video-Guided Curriculum Learning for Spoken Video Grounding", ACMMM, 2022 (Zhejiang University). [Paper][PyTorch]
    • UVC-VI: "Aligning Source Visual and Target Language Domains for Unpaired Video Captioning", TPAMI, 2022 (Peking University). [Paper]
    • D2: "Dual-Level Decoupled Transformer for Video Captioning", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
    • VASTA: "Diverse Video Captioning by Adaptive Spatio-temporal Attention", arXiv, 2022 (University of Tubingen, Germany). [Paper]
    • VCRN: "Visual Commonsense-aware Representation Network for Video Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
    • RSFD: "Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning", arXiv, 2022 (Wuhan University of Technology). [Paper][Code (in construction)]
    • VLTinT: "VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning", AAAI, 2023 (University of Arkansas). [Paper]
    • Vid2Seq: "Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning", CVPR, 2023 (Google). [Paper][Website]
    • TextKG: "Text with Knowledge Graph Augmented Transformer for Video Captioning", CVPR, 2023 (ByteDance). [Paper]
    • G2L: "G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory", ICCV, 2023 (Peking). [Paper]
    • CoCap: "Accurate and Fast Compressed Video Captioning", ICCV, 2023 (CAS). [Paper][PyTorch]
    • Movie101: "Movie101: A New Movie Understanding Benchmark", ACL, 2023 (Renmin University of China). [Paper][Code (in construction)]
    • VidChapters-7M: "VidChapters-7M: Video Chapters at Scale", NeurIPS (Datasets and Benchmarks), 2023 (INRIA). [Paper][PyTorch][Website]
    • ?: "Implicit and Explicit Commonsense for Multi-sentence Video Captioning", arXiv, 2023 (UBC). [Paper]
    • Video-Verbalization: "A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot", arXiv, 2023 (Adobe). [Paper]
    • Dense-VOC: "Dense Video Object Captioning from Disjoint Supervision", arXiv, 2023 (Google). [Paper]
    • ?: "Exploring the Role of Audio in Video Captioning", arXiv, 2023 (ByteDance). [Paper]
    • ZeroTA: "Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment", arXiv, 2023 (KAIST). [Paper]
    • Video-CSR: "Video-CSR: Complex Video Digest Creation for Visual-Language Models", arXiv, 2023 (ByteDance). [Paper]
    • SCG-SP: "Set Prediction Guided by Semantic Concepts for Diverse Video Captioning", AAAI, 2024 (CAS). [Paper]
    • DIBS: "DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement", CVPR, 2024 (Shanghai AI Lab). [Paper]
    • CM2: "Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval", CVPR, 2024 (Kyung Hee University, Korea). [Paper][PyTorch]
    • NAE: "Narrative Action Evaluation with Prompt-Guided Multimodal Interaction", CVPR, 2024 (Tsinghua). [Paper][PyTorch (in construction)]
    • MICap: "MICap: A Unified Model for Identity-aware Movie Descriptions", CVPR, 2024 (IIIT Hyderabad, India). [Paper][Website]
    • EgoExoNCE: "Retrieval-Augmented Egocentric Video Captioning", arXiv, 2024 (Shanghai AI Lab). [Paper]
    • Video-ReCap: "Video ReCap: Recursive Captioning of Hour-Long Videos", arXiv, 2024 (UNC). [Paper][PyTorch][Website]
  • 3D:
    • Vote2Cap-DETR: "End-to-End 3D Dense Captioning with Vote2Cap-DETR", CVPR, 2023 (Fudan). [Paper][PyTorch]
    • Cap3D: "Scalable 3D Captioning with Pretrained Models", NeurIPS, 2023 (UMich). [Paper][Dataset]
    • Vote2Cap-DETR++: "Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning", arXiv, 2023 (Fudan). [Paper][PyTorch]
    • DiffuRank: "View Selection for 3D Captioning via Diffusion Ranking", arXiv, 2024 (UMich). [Paper][Dataset]
  • Others:
    • ET-Cap: "Explore and Tell: Embodied Visual Captioning in 3D Environments", ICCV, 2023 (Renmin University of China). [Paper][PyTorch][Website]

[Back to Overview]

Visual Question Answering

  • General:
    • MCAN: "Deep Modular Co-Attention Networks for Visual Question Answering", CVPR, 2019 (Hangzhou Dianzi University). [Paper][PyTorch]
    • M4C: "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA", CVPR, 2020 (Facebook). [Paper]
    • SA-M4C: "Spatially Aware Multimodal Transformers for TextVQA", ECCV, 2020 (Georgia Tech). [Paper][PyTorch][Website]
    • ConClaT: "Contrast and Classify: Training Robust VQA Models", ICCV, 2021 (Georgia Tech). [Paper]
    • TRAR: "TRAR: Routing the Attention Spans in Transformer for Visual Question Answering", ICCV, 2021 (Xiamen University). [Paper]
    • UniQer: "Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue", ICCV, 2021 (Keio). [Paper]
    • TxT: "TxT: Crossmodal End-to-End Learning with Transformers", GCPR, 2021 (TU Darmstadt). [Paper]
    • ProTo: "ProTo: Program-Guided Transformer for Program-Guided Tasks", NeurIPS, 2021 (Georiga Tech). [Paper]
    • VisQA: "VisQA: X-raying Vision and Language Reasoning in Transformers", arXiv, 2021 (INSA-Lyon). [Paper][PyTorch]
    • Block-Skim: "Block-Skim: Efficient Question Answering for Transformer", AAAI, 2022 (* Shanghai Jiao Tong*). [Paper]
    • RelViT: "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", ICLR, 2022 (NVIDIA). [Paper] [PyTorch]
    • Hypergraph-Transformer: "Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering", ACL, 2022 (SNU). [Paper][Code (in construction)]
    • X-Trans2Cap: "X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning", CVPR, 2022 (CUHK). [Paper]
    • UTC: "UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog", CVPR, 2022 (Fudan). [Paper]
    • LaTr: "LaTr: Layout-Aware Transformer for Scene-Text VQA", CVPR, 2022 (Amazon). [Paper]
    • QAA: "Query and Attention Augmentation for Knowledge-Based Explainable Reasoning", CVPR, 2022 (University of Minnesota). [Paper][PyTorch]
    • WebQA: "WebQA: Multihop and Multimodal QA", CVPR, 2022 (CMU + Microsoft). [Paper][PyTorch][Website]
    • ?: "Efficient Adaptive Image-Language Learning for Visual Question Answering", CVPRW, 2022 (Google). [Paper]
    • cViL: "cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation", ICPR, 2022 (IIIT, Hyderabad). [Paper]
    • Distinguishing-VQA: "Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances", COLING, 2022 (Nankai University). [Paper][Code (in construction)]
    • ?: "Weakly Supervised Grounding for VQA in Vision-Language Transformers", ECCV, 2022 (UCF). [Paper][PyTorch (in construction)]
    • MUST-VQA: "MUST-VQA: MUltilingual Scene-text VQA", ECCVW, 2022 (UAB, Spain). [Paper]
    • ?: "Training Vision-Language Models with Less Bimodal Supervision", Automated Knowledge Base Construction (AKBC), 2022 (Tel Aviv). [Paper]
    • REVIVE: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", NeurIPS, 2022 (Microsoft). [Paper]
    • ScienceQA: "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering", NeurIPS, 2022 (AI2). [Paper][PyTorch][Website]
    • FrozenBiLM: "Zero-Shot Video Question Answering via Frozen Bidirectional Language Models", NeurIPS, 2022 (INRIA). [Paper][PyTorch]
    • MuRAG: "MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text", EMNLP, 2022 (Google). [Paper]
    • MMBS: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • EnFoRe: "Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering", EMNLP, 2022 (UT Austin). [Paper]
    • CRIPP-VQA: "CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering", EMNLP, 2022 (Arizona State University). [Paper][PyTorch][Website]
    • PnP-VQA: "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", EMNLP Findings, 2022 (Salesforce). [Paper]
    • TMN: "Transformer Module Networks for Systematic Generalization in Visual Question Answering", arXiv, 2022 (Fujitsu). [Paper]
    • ?: "On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering", arXiv, 2022 (Birla Institute of Technology Mesra, India). [Paper]
    • DST: "Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • PAVCR: "Attention Mechanism based Cognition-level Scene Understanding", arXiv, 2022 (Leibniz University of Hannover, Germany). [Paper]
    • TAG: "TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation", arXiv, 2022 (Maryland + Salesforce). [Paper][PyTorch]
    • UniCon: "UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering", arXiv, 2022 (University of Tokyo). [Paper]
    • CLOVE: "Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • mVQA: "Towards Multi-Lingual Visual Question Answering", arXiv, 2022 (Google). [Paper]
    • CIB: "Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
    • ?: "Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering", arXiv, 2022 (CAS). [Paper]
    • VLR: "Visually Grounded VQA by Lattice-based Retrieval", arXiv, 2022 (University of Bremen, Germany). [Paper]
    • CMCL: "Cross-Modal Contrastive Learning for Robust Reasoning in VQA", arxiv, 2022 (University of Sydney). [Paper][PyTorch]
    • CL-CrossVQA: "CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering", arXiv, 2022 (LMU Munich). [Paper]
    • OFA-X: "Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations", arXiv, 2022 (University of Hamburg, Germany). [Paper][Code (in construction)]
    • VLC-BERT: "VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge", WACV, 2023 (UBC, Canada). [Paper][PyTorch]
    • LTG: "Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA", AAAI, 2023 (USTC). [Paper]
    • SelTDA: "Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!", CVPR, 2023 (NEC). [Paper][PyTorch]
    • Prophet: "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering", CVPR, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
    • GenB: "Generative Bias for Robust Visual Question Answering", CVPR, 2023 (KAIST). [Paper]
    • MixPHM: "MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering", CVPR, 2023 (Xi'an Jiaotong University). [Paper]
    • POEM: "Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning", CVPR, 2023 (University of Minnesota (UMN)). [Paper][PyTorch]
    • LYP: "Improving Selective Visual Question Answering by Learning From Your Peers", CVPR, 2023 (Meta). [Paper]
    • VQACL: "VQACL: A Novel Visual Question Answering Continual Learning Setting", CVPR, 2023 (CAS). [Paper][PyTorch]
    • Img2LLM: "From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models", CVPR, 2023 (Salesforce). [Paper][PyTorch]
    • Imp-VQA: "Logical Implications for Visual Question Answering Consistency", CVPR, 2023 (University of Bern, Switzerland). [Paper][PyTorch][Website]
    • RMLVQA: "RMLVQA: A Margin Loss Approach For Visual Question Answering with Language Biases", CVPR, 2023 (Indian Institute of Science). [Paper][PyTorch]
    • S3C: "S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning", CVPR, 2023 (Northwestern Polytechnical University, China). [Paper]
    • ?: "Diversifying Joint Vision-Language Tokenization Learning", CVPRW, 2023 (DeepMind). [Paper]
    • VQAAnswerTherapy: "VQA Therapy: Exploring Answer Differences by Visually Grounding Answers", ICCV, 2023 (UT Austin). [Paper][Website]
    • WHOOP: "Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images", ICCV, 2023 (Ben Gurion University of the Negev, Israel). [Paper][Website]
    • Encyclopedic-VQA: "Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories", ICCV, 2023 (Google). [Paper][Tensorflow]
    • RVQA: "Toward Unsupervised Realistic Visual Question Answering", ICCV, 2023 (UCSD). [Paper]
    • VQA-GNN: "VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering", ICCV, 2023 (Stanford). [Paper]
    • ViTiS: "Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts", ICCVW, 2023 (INRIA). [Paper][Website]
    • TwO: "Combo of Thinking and Observing for Outside-Knowledge VQA", ACL, 2023 (ByteDance). [Paper][Code (in construction)]
    • Mod-Zero-VQA: "Modularized Zero-shot VQA with Pre-trained Models", ACL Findings, 2023 (Singapore Management University). [Paper]
    • SaL: "Separate and Locate: Rethink the Text in Text-based Visual Question Answering", ACMMM, 2023 (CAS). [Paper][Code (in construction)]
    • ReVisE: "From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation", EMNLP, 2023 (Berkeley). [Paper][Code (in construction)]
    • Cola: "Large Language Models are Visual Reasoning Coordinators", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
    • AVIS: "AVIS: Autonomous Visual Information Seeking with Large Language Model Agent", NeurIPS, 2023 (Google). [Paper]
    • ?: "Exploring Question Decomposition for Zero-Shot VQA", NeurIPS, 2023 (Northeastern). [Paper][Code (in construction)][Website]
    • SeeTRUE: "What You See is What You Read? Improving Text-Image Alignment Evaluation", NeurIPS, 2023 (Google). [Paper][PyTorch][Website]
    • InfoSeek: "Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?", arXiv, 2023 (Google). [Paper][Website]
    • CoVGT: "Contrastive Video Question Answering via Video Graph Transformer", arXiv, 2023 (NUS). [Paper]
    • IVLT: "Causality-aware Visual Scene Discovery for Cross-Modal Question Reasoning", arXiv, 2023 (Sun-Yat-Sen University). [Paper]
    • MGT: "Multimodal Graph Transformer for Multimodal Question Answering", arXiv, 2023 (UC Santa Cruz). [Paper]
    • VCSR: "Visual Causal Scene Refinement for Video Question Answering", arXiv, 2023 (Sun-Yat-Sen University). [Paper]
    • JADE: "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner", arXiv, 2023 (CAS). [Paper]
    • NuScenes-QA: "NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario", arXiv, 2023 (Fudan). [Paper][Code (in construction)]
    • LAMOC: "Zero-shot Visual Question Answering with Language Model Feedback", arXiv, 2023 (Renmin University of China). [Paper][PyTorch]
    • PW-VQA: "Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA", arXiv, 2023 (University of Rochester). [Paper]
    • ?: "Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering", arXiv, 2023 (Mila). [Paper]
    • R2A: "Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models", arXiv, 2023 (CUHK). [Paper]
    • WikiTiLo: "Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning", arXiv, 2023 (LMU Munich). [Paper]
    • GenVQA: "Generative Visual Question Answering", arXiv, 2023 (UW). [Paper]
    • Context-VQA: "Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering", arXiv, 2023 (Stanford). [Paper]
    • BLIVA: "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions", arXiv, 2023 (USCD). [Paper]
    • NExT-GQA: "Can I Trust Your Answer? Visually Grounded Video Question Answering", arXiv, 2023 (NUS). [Paper]
    • CURE: "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models", arXiv, 2023 (SRI). [Paper][Code (in construction)]
    • RepARe: "Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models", arXiv, 2023 (UNC). [Paper][PyTorch]
    • RVP: "Recursive Visual Programming", arXiv, 2023 (Berkeley). [Paper]
    • SAB: "Sentence Attention Blocks for Answer Grounding", arXiv, 2023 (University of Delaware, Delaware). [Paper]
    • DIS: "Detection-based Intermediate Supervision for Visual Question Answering", AAAI, 2024 (Huazhong University of Science and Technology (HUST)). [Paper]
    • OAM-VQA: "Object Attribute Matters in Visual Question Answering", AAAI, 2024 (Jilin University). [Paper]
    • oVQA: "Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy", ICLR, 2024 (University of Freiburg, Germany). [Paper][PyTorch]
    • ?: "Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering", CVPR, 2024 (Northeastern University). [Paper]
    • ?: "Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation", arXiv, 2024 (University of Tokyo). [Paper]
    • MultipanelVQA: "Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA", arXiv, 2024 (eBay). [Paper][Website]
    • Proximity-QA: "Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis", arXiv, 2024 (Peking). [Paper][Code (in construction)]
    • SnapNTell: "SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM", arXiv, 2024 (Meta). [Paper]
  • Video:
    • ?: "Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering", arXiv, 2021 (Seoul National University). [Paper]
    • TPT: "Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering", arXiv, 2021 (CAS). [Paper]
    • SwinBERT: "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • WildQA: "WildQA: In-the-Wild Video Question Answering", International Conference on Computational Linguistics (COLING), 2022 (UMich). [Paper][Website]
    • VGT: "Video Graph Transformer for Video Question Answering", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • ?: "Video Question Answering with Iterative Video-Text Co-Tokenization", ECCV, 2022 (Google). [Paper][Website (in construction)]
    • DeST: "Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling", BMVC, 2022 (NTU). [Paper][PyTorch]
    • ViteVQA: "Towards Video Text Visual Question Answering: Benchmark and Baseline", NeurIPS, 2022 (ByteDance). [Paper][GitHub]
    • WSQG: "Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering", arXiv, 2022 (Zhejiang University). [Paper]
    • LocAns: "Locate before Answering: Answer Guided Question Localization for Video Question Answering", arXiv, 2022 (Fudan University). [Paper]
    • NewsVideoQA: "Watching the News: Towards VideoQA Models that can Read", arXiv, 2022 (IIIT Hyderabad, India). [Paper]
    • SHG-VQA: "Learning Situation Hyper-Graphs for Video Question Answering", CVPR, 2023 (UCF). [Paper][PyTorch]
    • ANetQA: "ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos", CVPR, 2023 (Hangzhou Dianzi University). [Paper][Website]
    • MCR: "Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering", CVPR, 2023 (Beijing Institute of Technology). [Paper][Code (in construction)]
    • MIST: "MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering", CVPR, 2023 (NUS). [Paper][PyTorch]
    • CaKE-LM: "Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering", CVPRW, 2023 (NTU + Columbia). [Paper]
    • TransSTR: "Discovering Spatio-Temporal Rationales for Video Question Answering", ICCV, 2023 (NUS). [Paper]
    • Tem-adapter: "Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer", ICCV, 2023 (CMU). [Paper][PyTorch]
    • OVQA: "Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models", ICCV, 2023 (Korea University). [Paper][PyTorch]
    • RaFormer: "Redundancy-aware Transformer for Video Question Answering", ACMMM, 2023 (NUS). [Paper]
    • LSS: "Long Story Short: a Summarize-then-Search Method for Long Video Question Answering", BMVC, 2023 (Yonsei University). [Paper]
    • Flipped-VQA: "Large Language Models are Temporal and Causal Reasoners for Video Question Answering", EMNLP, 2023 (Korea University). [Paper][Code (in construction)]
    • SeViLA: "Self-Chained Image-Language Model for Video Localization and Question Answering", NeurIPS, 2023 (UNC). [Paper][PyTorch]
    • Glance-Focus: "Glance and Focus: Memory Prompting for Multi-Event Video Question Answering", NeurIPS, 2023 (CAS). [Paper][PyTorch]
    • FunQA: "FunQA: Towards Surprising Video Comprehension", arXiv, 2023 (Beijing University of Posts and Telecommunication). [Paper][Code (in construction)][Website]
    • ProViQ: "Zero-Shot Video Question Answering with Procedural Programs", arXiv, 2023 (CMU). [Paper][Website]
    • R-VLM: "Retrieval-based Video Language Model for Efficient Long Video Question Answering", arXiv, 2023 (Microsoft). [Paper]
    • MoVQA: "MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding", arXiv, 2023 (Shanghai AI Lab). [Paper][Website]
    • GroundVQA: "Grounded Question-Answering in Long Egocentric Videos", arXiv, 2023 (SJTU). [Paper]
    • VLAP: "VLAP: Efficient Video-Language Alignment via Frame Prompting and Distilling for Video Question Answering", arXiv, 2023 (Amazon). [Paper]
    • Vista-LLaMA: "Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens", arXiv, 2023 (Zhejiang). [Paper][Website]
    • LLoVi: "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 (UNC). [Paper][Code (in construction)]
    • STAIR: "STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering", AAAI, 2024 (Peking). [Paper]
    • YTCommentQA: "YTCommentQA: Video Question Answerability in Instructional Videos", AAAI, 2024 (LG). [Paper][Code (in construction)]
    • RADI: "Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels", CVPR, 2024 (Sun Yat-sen University). [Paper]
    • MoReVQA: "MoReVQA: Exploring Modular Reasoning Models for Video Question Answering", CVPR, 2024 (Google). [Paper][Website]
    • Sports-QA: "Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports", arXiv, 2024 (University of Melbourne). [Paper]
    • DoraemonGPT: "DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models", arXiv, 2024 (Zhejiang). [Paper][Code (in construction)]
    • Q-ViD: "Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering", arXiv, 2024 (MBZUAI). [Paper]
    • LSTP: "LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding", arXiv, 2024 (BIGAI). [Paper][PyTorch]
    • DAM: "DAM: Dynamic Adapter Merging for Continual Video QA Learning", arXiv, 2024 (UNC). [Paper][Code (in construction)]
    • LangRepo: "Language Repository for Long Video Understanding", arXiv, 2024 (Stony Brook, NY). [Paper][PyTorch]
    • LongVLM: "LongVLM: Efficient Long Video Understanding via Large Language Models", arXiv, 2024 (Monash). [Paper][Code (in construction)]
    • TraveLER: "TraveLER: A Multi-LMM Agent Framework for Video Question-Answering", arXiv, 2024 (Berkeley). [Paper]
    • CinePile: "CinePile: A Long Video Question Answering Dataset and Benchmark", arXiv, 2024 (Maryland). [Paper][Website]
  • 3D:
    • 3D-VQA: "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes", CVPRW, 2023 (ETHZ). [Paper][Code (in construction)]
    • PO3D-VQA: "3D-Aware Visual Question Answering about Parts, Poses and Occlusions", NeurIPS, 2023 (JHU). [Paper][Code (in construction)]
    • Multi-CLIP: "Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes", arXiv, 2023 (ETHZ). [Paper]
    • Gen3DQA: "Generating Context-Aware Natural Answers for Questions in 3D Scenes", arXiv, 2023 (TUM). [Paper][Code (in construction)]
    • BridgeQA: "Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA", AAAI, 2024 (Peking). [Paper][PyTorch]
  • Audio-Visual:
    • PSTP-Net: "Progressive Spatio-temporal Perception for Audio-Visual Question Answering", ACMMM, 2023 (Renmin University of China). [Paper][PyTorch]

[Back to Overview]

Visual Grounding

  • General:
    • TransRefer3D: "TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding", ACMMM, 2021 (Beihang University). [Paper]
    • ?: "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers", EMNLP, 2021 (University of Trento). [Paper]
    • MITVG: "Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation", ACL Findings, 2021 (Tencent). [Paper]
    • TransVG: "TransVG: End-to-End Visual Grounding with Transformers", ICCV, 2021 (USTC). [Paper]
    • GSRTR: "Grounded Situation Recognition with Transformers", BMVC, 2021 (POSTECH). [Paper][PyTorch]
    • Referring-Transformer: "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", NeurIPS, 2021 (UBC). [Paper]
    • VGTR: "Visual Grounding with Transformers", arXiv, 2021 (Beihang University). [Paper]
    • UNICORN: "Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling", arXiv, 2021 (Microsoft). [Paper]
    • Word2Pix: "Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding", arXiv, 2021 (A*STAR). [Paper]
    • CoFormer: "Collaborative Transformers for Grounded Situation Recognition", CVPR, 2022 (POSTECH). [Paper][PyTorch]
    • MVT: "Multi-View Transformer for 3D Visual Grounding", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • GLIP: "Grounded Language-Image Pre-training", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • M-DGT: "Multi-Modal Dynamic Graph Transformer for Visual Grounding", CVPR, 2022 (University of Toronto). [Paper][PyTorch]
    • QRNet: "Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
    • SiRi: "SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding", ECCV, 2022 (JD). [Paper][PyTorch]
    • UniTAB: "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling", ECCV, 2022 (Microsoft). [Paper]
    • TAP: "Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers", ECCV, 2022 (Adobe). [Paper][GitHub][Website]
    • YORO: "YORO - Lightweight End to End Visual Grounding", ECCVW, 2022 (Amazon). [Paper]
    • GLIPv2: "GLIPv2: Unifying Localization and Vision-Language Understanding", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
    • ?: "Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?", EMNLP, 2022 (Aix-Marseille University, France). [Paper]
    • SeqTR: "SeqTR: A Simple yet Universal Network for Visual Grounding", arXiv, 2022 (Xiamen University). [Paper][Code (in construction)]
    • TransVG++: "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer", arXiv, 2022 (USTC). [Paper]
    • HLGT: "Hierarchical Local-Global Transformer for Temporal Sentence Grounding", arXiv, 2022 (Huazhong University of Science and Technology). [Paper]
    • Dynamic-MDETR: "Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding", arXiv, 2022 (Nanjing University). [Paper]
    • ClipCrop: "ClipCrop: Conditioned Cropping Driven by Vision-Language Model", arXiv, 2022 (The University of Tokyo). [Paper]
    • VL-MPAG-Net: "Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing", WACV, 2023 (Indian Institute of Science). [Paper][PyTorch][Website]
    • CLEVER: "Visually Grounded Commonsense Knowledge Acquisition", AAAI, 2023 (Tsinghua University). [Paper][PyTorch]
    • LADS: "Referring Expression Comprehension Using Language Adaptive Inference", AAAI, 2023 (Zhejiang University). [Paper]
    • ?: "Learning to Jointly Share and Prune Weights for Grounding Based Vision and Language Models", ICLR, 2023 (Samsung). [Paper]
    • AMC: "Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • CounTEX: "Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space", CVPR, 2023 (Amazon). [Paper]
    • SK-VG: "Advancing Visual Grounding with Scene Knowledge: Benchmark and Method", CVPR, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
    • D-ViTMDETR: "Dynamic Inference with Grounding Based Vision and Language Models", CVPR, 2023 (Amazon). [Paper]
    • ?: "Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding", CVPR, 2023 (Tel Aviv). [Paper][Code (in construction)]
    • RefCLIP: "RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension", CVPR, 2023 (Xiamen University). [Paper][PyTorch][Website]
    • FROMAGe: "Grounding Language Models to Images for Multimodal Inputs and Outputs", ICML, 2023 (CMU). [Paper][PyTorch][Website]
    • IR-VG: "Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision", ICCV, 2023 (Beihang). [Paper][Code (in construction)]
    • RefEgo: "RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D", ICCV, 2023 (RIKEN). [Paper]
    • SLAN: "SLAN: Self-Locator Aided Network for Vision-Language Understanding", ICCV, 2023 (Tencent). [Paper][Code (in construction)]
    • GITM-MR: "Grounded Image Text Matching with Mismatched Relation Reasoning", ICCV, 2023 (ShanghaiTech). [Paper]
    • DOD: "Described Object Detection: Liberating Object Detection with Flexible Expressions", NeurIPS, 2023 (Tongji University). [Paper][PyTorch]
    • CLIP-VG: "CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding", arXiv, 2023 (CAS). [Paper][Code (in construction)]
    • TreePrompt: "TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding", arXiv, 2023 (HKUST). [Paper]
    • OctoBERT: "World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models", arXiv, 2023 (UMich). [Paper]
    • BuboGPT: "BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs", arXiv, 2023 (ByteDance). [Paper][PyTorch][Website]
    • LG-DVG: "Language-Guided Diffusion Model for Visual Grounding", arXiv, 2023 (University of Toronto). [Paper]
    • VGDiffZero: "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders", arXiv, 2023 (Westlake University, China). [Paper]
    • GREC: "GREC: Generalized Referring Expression Comprehension", arXiv, 2023 (NTU, Singapore). [Paper][Website]
    • SoM: "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V", arXiv, 2023 (Microsoft). [Paper][Code (in construction)][Website]
    • GLaMM: "GLaMM: Pixel Grounding Large Multimodal Model", arXiv, 2023 (MBZUAI). [Paper][Code (in construction)][Website]
    • Griffon: "Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models", arXiv, 2023 (CAS). [Paper][PyTorch]
    • RelVLA: "Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions", arXiv, 2023 (Northeastern). [Paper]
    • Lenna: "Lenna: Language Enhanced Reasoning Detection Assistant", arXiv, 2023 (Meituan). [Paper][Code (in construction)]
    • LLaVA-Grounding: "LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • SeflEQ: "Improved Visual Grounding through Self-Consistent Explanations", arXiv, 2023 (Rice). [Paper][Website]
    • OV-VG: "OV-VG: A Benchmark for Open-Vocabulary Visual Grounding", arXiv, 2023 (Beihang). [Paper][Code (in construction)]
    • ?: "Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models", WACV, 2024 (Amazon). [Paper][PyTorch]
    • CyCo: "Cycle-Consistency Learning for Captioning and Grounding", AAAI, 2024 (Huawei). [Paper]
    • ChatterBox: "ChatterBox: Multi-round Multimodal Referring and Grounding", arXiv, 2024 (CAS). [Paper][PyTorch]
    • PIN: "PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs", arXiv, 2024 (UvA). [Paper][Code (in construction)][Website]
    • ViGoR: "ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling", arXiv, 2024 (Amazon). [Paper]
    • CRG: "Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training", arXiv, 2024 (UNC). [Paper][PyTorch][Website]
    • Griffon-v2: "Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring", arXiv, 2024 (CAS). [Paper][PyTorch]
    • RelationVLM: "RelationVLM: Making Large Vision-Language Models Understand Visual Relations", arXiv, 2024 (Microsoft). [Paper]
    • SynGround: "Learning from Models and Data for Visual Grounding", arXiv, 2024 (Rice). [Paper][Website]
    • Groma: "Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models", arXiv, 2024 (ByteDance). [Paper][PyTorch][Website]
  • Video:
    • Multi-Stage-Transformer: "Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos", CVPR, 2021 (University of Electronic Science and Technology of China). [Paper]
    • GTR: "On Pursuit of Designing Multi-modal Transformer for Video Grounding", EMNLP, 2021 (Peking). [Paper]
    • STVGBert: "STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding", ICCV, 2021 (Tencent). [Paper]
    • DRFT: "End-to-end Multi-modal Video Temporal Grounding", NeurIPS, 2021 (UC Merced). [Paper]
    • TubeDETR: "TubeDETR: Spatio-Temporal Video Grounding with Transformers", CVPR, 2022 (INRIA). [Paper][Website]
    • UMT: "UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection", CVPR, 2022 (Tencent). [Paper][Code (in constrcution)]
    • STVGFormer: "STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding", ACMMMW, 2022 (Sun Yat-sen University). [Paper]
    • STCAT: "Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
    • VideoWhisperer: "Grounded Video Situation Recognition", NeurIPS, 2022 (IIIT Hyderabad, India). [Paper][Website]
    • VidGTR: "Explore and Match: End-to-End Video Grounding with Transformer", arXiv, 2022 (KAIST). [Paper]
    • ?: "Language-free Training for Zero-shot Video Grounding", WACV, 2023 (Yonsei University). [Paper]
    • VG-LAW: "Language Adaptive Weight Generation for Multi-task Visual Grounding", CVPR, 2023 (Zhejiang University). [Paper]
    • TCSF: "You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
    • ?: "Weakly Supervised Temporal Sentence Grounding with Uncertainty-Guided Self-training", CVPR, 2023 (The University of Tokyo). [Paper]
    • DeCo: "DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking", CVPR, 2023 (Toyota). [Paper]
    • HSCNet: "Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding", CVPR, 2023 (Sun Yat-sen University). [Paper]
    • WINNER: "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding", CVPR, 2023 (Zhejiang University). [Paper]
    • IRON: "Iterative Proposal Refinement for Weakly-Supervised Video Grounding", CVPR, 2023 (Microsoft). [Paper]
    • ?: "Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding", CVPR, 2023 (Sun Yat-sen University). [Paper]
    • ProTeGe: "ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding", CVPR, 2023 (Microsoft). [Paper]
    • VidLN: "Connecting Vision and Language with Video Localized Narratives", CVPR, 2023 (Google). [Paper][Website]
    • VDI: "Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training", CVPR, 2023 (Queen Mary University of London). [Paper]
    • UniVTG: "UniVTG: Towards Unified Video-Language Temporal Grounding", ICCV, 2023 (NUS). [Paper][PyTorch]
    • EaTR: "Knowing Where to Focus: Event-aware Transformer for Video Grounding", ICCV, 2023 (Yonsei). [Paper][PyTorch]
    • SOONet: "Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos", ICCV, 2023 (Alibaba). [Paper][PyTorch]
    • TSGSV: "Temporal Sentence Grounding in Streaming Videos", ACMMM, 2023 (Shandong University). [Paper]
    • ConFormer: "Video Referring Expression Comprehension via Transformer with Content-conditioned Query", ACMMM, 2023 (Peking). [Paper]
    • CliMer: "Learning Temporal Sentence Grounding From Narrated EgoVideos", BMVC, 2023 (University of Bristol, UK). [Paper][PyTorch]
    • MomentDiff: "MomentDiff: Generative Video Moment Retrieval from Random to Real", NeurIPS, 2023 (Alibaba). [Paper][Pytorch]
    • ?: "Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos", arXiv, 2023 (Southern University of Science and Technology, China). [Paper]
    • BM-DETR: "Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval", arXiv, 2023 (Seoul National University (SNU)). [Paper][PyTorch (in construction)]
    • DiffusionVG: "Exploring Iterative Refinement with Diffusion Models for Video Grounding", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
    • CG-DETR: "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding", arXiv, 2023 (Sungkyunkwan University, Korea). [Paper][Code (in construction)]
    • LLM4VG: "LLM4VG: Large Language Models Evaluation for Video Grounding", arXiv, 2023 (Tsinghua). [Paper]
    • Grounding-Prompter: "Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos", arXiv, 2023 (Tsinghua). [Paper]
    • ?: "Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models", WACV, 2024 (Queen Mary University of London). [Paper]
    • SiamGTR: "Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding", CVPR, 2024 (Sun Yat-sen University). [Paper]
    • SnAG: "SnAG: Scalable and Accurate Video Grounding", CVPR, 2024 (UW-Madison). [Paper]
    • Video-GroundingDINO: "Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding", arXiv, 2024 (MBZUAI). [Paper][Code (in construction)]
    • CG-STVG: "Context-Guided Spatio-Temporal Video Grounding", arXiv, 2024 (CAS). [Paper][Code (in construction)]
    • LITA: "LITA: Language Instructed Temporal-Localization Assistant", arXiv, 2024 (NVIDIA). [Paper][PyTorch]
  • 3D:
    • ViL3DRel: "Language Conditioned Spatial Relation Reasoning for 3D Object Grounding", NeurIPS, 2022 (INRIA). [Paper][Website]
    • LAR: "Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding", NeurIPS, 2022 (KAUST). [Paper][Website]
    • 3D-CG: "3D Concept Grounding on Neural Fields", NeurIPS, 2022 (MIT). [Paper][Website]
    • NS3D: "NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations", CVPR, 2023 (Stanford). [Paper]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding", CVPR, 2023 (Peking University). [Paper]
    • ?: "Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding", ICCV, 2023 (Zhejiang University). [Paper]
    • Multi3DRefer: "Multi3DRefer: Grounding Text Description to Multiple 3D Objects", ICCV, 2023 (Simon Fraser). [Paper][PyTorch][Website]
    • UniT3D: "UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding", ICCV, 2023 (TUM). [Paper]
    • ViewRefer: "ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • 3DOGSFormer: "Dense Object Grounding in 3D Scenes", ACMMM, 2023 (Peking). [Paper]
    • CityRefer: "CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data", NeurIPS (Datasets and Benchmarks), 2023 (Advanced Telecommunications Research (ATR), Japan). [Paper][PyTorch]
    • ?: "What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions", arXiv, 2023 (Columbia). [Paper]
    • 3DRP-Net: "3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding", arXiv, 2023 (Zhejiang University). [Paper]
    • 3DRefTR: "A Unified Framework for 3D Point Cloud Visual Grounding", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
    • CoT3DRef: "CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding", arXiv, 2023 (KAUST). [Paper]
    • ZSVG3D: "Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
    • LARC: "Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners", CVPR, 2024 (Stanford). [Paper]
    • SceneVerse: "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding", arXiv, 2024 (BIGAI). [Paper][Code (in construction)][Website]

[Back to Overview]

Multi-Modal Representation Learning

  • General:
    • LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", EMNLP, 2019 (UNC). [Paper][PyTorch]
    • ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", NeurIPS, 2019 (Georgia Tech). [Paper][PyTorch]
    • Unified-VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA", AAAI, 2020 (UMich + Microsoft). [Paper][PyTorch]
    • UNITER: "UNITER: UNiversal Image-TExt Representation Learning", ECCV, 2020 (Microsoft). [Paper][PyTorch]
    • VinVL: "VinVL: Revisiting Visual Representations in Vision-Language Models", CVPR, 2021 (Microsoft). [Paper][Code]
    • CATT: "Causal Attention for Vision-Language Tasks", CVPR, 2021 (NTU Singapore). [Paper][PyTorch]
    • ViLT: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", ICML, 2021 (Kakao). [Paper][PyTorch]
    • MERLOT: "MERLOT: Multimodal Neural Script Knowledge Models", NeurIPS, 2021 (UW + AI2). [Paper][Tensorflow][Website]
    • SVO-Probes: "Probing Image-Language Transformers for Verb Understanding", arXiv, 2021 (DeepMind). [Paper]
    • CLIP-ViL: "How Much Can CLIP Benefit Vision-and-Language Tasks?", arXiv, 2021 (Berkeley + UCLA). [Paper][PyTorch]
    • Florence: "Florence: A New Foundation Model for Computer Vision", arXiv, 2021 (Microsoft). [Paper]
    • UFO: "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning", arXiv, 2021 (Microsoft). [Paper]
    • SimVLM: "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision", ICLR, 2022 (Google). [Paper]
    • LiT: "LiT: Zero-Shot Transfer with Locked-image text Tuning", CVPR, 2022 (Google). [Paper]
    • UniCL: "Unified Contrastive Learning in Image-Text-Label Space", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • FLAVA: "FLAVA: A Foundational Language And Vision Alignment Model", CVPR, 2022 (Meta). [Paper][Pretrained Model][Code][Dataset][Website][Demos]
    • LEMON: "Scaling Up Vision-Language Pre-training for Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • METER: "An Empirical Study of Training End-to-End Vision-and-Language Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • Uni-Perceiver: "Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks", CVPR, 2022 (SenseTime). [Paper][PyTorch]
    • MERLOT-Reserve: "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound", CVPR, 2022 (UW + AI2). [Paper][JAX][Website]
    • Omnivore: "Omnivore: A Single Model for Many Visual Modalities", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
    • CM-mix: "Pre-training image-language transformers for open-vocabulary tasks", CVPRW, 2022 (Google). [Paper]
    • VLMixer: "VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix", ICML, 2022 (Southern University of Science and Technology). [Paper][Code (in construction)]
    • VLUE: "VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models", ICML, 2022 (ByteDance). [Paper][Website][PyTorch]
    • X-VLM: "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts", ICML, 2022 (ByteDance). [Paper][PyTorch]
    • BLIP: "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation", ICML, 2022 (Salesforce). [Paper][PyTorch]
    • OFA: "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework", ICML, 2022 (Alibaba). [Paper][PyTorch]
    • MS-CLIP: "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • GRIT-VLP: "GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • SIMLA: "Single-Stream Multi-Level Alignment for Vision-Language Pretraining", ECCV, 2022 (Northeastern University). [Paper][PyTorch][Website]
    • Switch-BERT: "Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input", ECCV, 2022 (Ant Group). [Paper]
    • OmniVL: "OmniVL: One Foundation Model for Image-Language and Video-Language Tasks", NeurIPS, 2022 (Microsoft). [Paper]
    • UniCLIP: "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training", NeurIPS, 2022 (LG). [Paper]
    • Uni-Perceiver-MoE: "Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs", NeurIPS, 2022 (SenseTime). [Paper][PyTorch]
    • CLOOB: "CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP", NeurIPS, 2022 (Johannes Kepler University, Austria). [Paper][PyTorch]
    • CyCLIP: "CyCLIP: Cyclic Contrastive Language-Image Pretraining", NeurIPS, 2022 (UCLA). [Paper]
    • ?: "Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP", NeurIPS, 2022 (UW). [Paper][Pytorch]
    • PyramidCLIP: "PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining", NeurIPS, 2022 (Tencent). [Paper]
    • ?: "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning", NeurIPS, 2022 (Stanford). [Paper][Website]
    • LIMoE: "Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts", NeurIPS, 2022 (Google). [Paper]
    • VLMo: "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", NeurIPS, 2022 (Microsoft). [Paper][PyTorch (in construction)]
    • Knowledge-CLIP: "Contrastive Language-Image Pre-Training with Knowledge Graphs", NeurIPS, 2022 (Tsinghua). [Paper]
    • Flamingo: "Flamingo: a Visual Language Model for Few-Shot Learning", NeurIPS, 2022 (DeepMind). [Paper]
    • LOUPE: "Fine-Grained Semantically Aligned Vision-Language Pre-Training", NeurIPS, 2022 (Huawei). [Paper][Code (in construction)]
    • FIBER: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
    • UViM: "UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes", NeurIPS, 2022 (Google). [Paper]
    • LAION-5B: "LAION-5B: An open large-scale dataset for training next generation image-text models", NeurIPS (Datasets and Benchmarks), 2022 (LAION). [Paper][Website]
    • Wukong: "Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark", NeurIPS (Datasets and Benchmarks), 2022 (Huawei). [Paper][Website]
    • TaiSu: "TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training", NeurIPS (Datasets and Benchmarks), 2022 (CAS). [Paper][PyTorch]
    • WinoGAViL: "WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models", NeurIPS (Datasets and Benchmarks), 2022 (The Hebrew University of Jerusalem, Israel). [Paper][Website]
    • ELEVATER: "ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models", NeurIPS (Datasets and Benchmarks), 2022 (Microsoft). [Paper][Website]
    • ?: "Robustness Analysis of Video-Language Models Against Visual and Language Perturbations", NeurIPS (Datasets and Benchmarks), 2022 (UCF). [Paper][Website]
    • GIT: "GIT: A Generative Image-to-text Transformer for Vision and Language", TMLR, 2022 (Microsoft). [Paper]
    • CoCa: "CoCa: Contrastive Captioners are Image-Text Foundation Models", TMLR, 2022 (Google). [Paper][PyTorch (lucidrains)]
    • MultiMAE: "MultiMAE: Multi-modal Multi-task Masked Autoencoders", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
    • VLC: "Training Vision-Language Transformers from Captions Alone", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • CCLM: "Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training", arXiv, 2022 (ByteDance). [Paper]
    • VL-BEiT: "VL-BEiT: Generative Vision-Language Pretraining", arXiv, 2022 (Microsoft). [Paper]
    • MetaLM: "Language Models are General-Purpose Interfaces", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • Bridge-Tower: "Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • e-CLIP: "e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce", arXiv, 2022 (NAVER). [Paper]
    • LW-Transformer: "Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
    • UCM: "Self-Training Vision Language BERTs with a Unified Conditional Model", arXiv, 2022 (NTU, Singapore). [Paper]
    • Prefix-conditioning: "Prefix Conditioning Unifies Language and Label Supervision", arXiv, 2022 (Google). [Paper]
    • VLMAE: "VLMAE: Vision-Language Masked Autoencoder", arXiv, 2022 (Tencent). [Paper]
    • ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", arXiv, 2022 (Sorbonne University, France). [Paper][Code (in construction)]
    • DetailCLIP: "Injecting Image Details into CLIP's Feature Space", arXiv, 2022 (Megvii). [Paper]
    • ?: "Pre-training image-language transformers for open-vocabulary tasks", arXiv, 2022 (Google). [Paper]
    • ERNIE: "ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training", arXiv, 2022 (Baidu). [Paper][Paddle]
    • ?: "One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
    • MAPL: "MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting", arXiv, 2022 (Mila). [Paper]
    • EfficientVLM: "EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning", arXiv, 2022 (Bytedance). [Paper][PyTorch (in construction)]
    • CN-CLIP: "Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese", arXiv, 2022 (Alibaba). [Paper]
    • X2-VLM: "X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks", arXiv, 2022 (ByteDance). [Paper][Code (in construction)]
    • SkillNet: "One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code", arXiv, 2022 (Tencent). [Paper]
    • Compound-Tokens: "Compound Tokens: Channel Fusion for Vision-Language Representation Learning", arXiv, 2022 (Google). [Paper]
    • WFH: "Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision", WACV, 2023 (Aalto University, Finland). [Paper]
    • Perceiver-VL: "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention", WACV, 2023 (UNC). [Paper][PyTorch]
    • MixGen: "MixGen: A New Multi-Modal Data Augmentation", WACVW, 2023 (Amazon). [Paper]
    • ?: "Unifying Vision-Language Representation Space with Single-tower Transformer", AAAI, 2023 (NAVER). [Paper]
    • PaLI: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", ICLR, 2023 (Google). [Paper]
    • LilT: "Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning", ICLR, 2023 (Northeastern University). [Paper][PyTorch]
    • CLIPs: "Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning", ICLR, 2023 (Stanford). [Paper]
    • HiCLIP: "HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention", ICLR, 2023 (Rutgers University). [Paper]
    • DeCap: "DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training", ICLR, 2023 (Zhejiang University). [Paper][PyTorch]
    • MaskVLM: "Masked Vision and Language Modeling for Multi-modal Representation Learning", ICLR, 2023 (Amazon). [Paper]
    • DaVinci: "Write and Paint: Generative Vision-Language Models are Unified Modal Learners", ICLR, 2023 (ByteDance). [Paper][Code (in construction)]
    • EVA: "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale", CVPR, 2023 (Beijing Academy of Artificial Intelligence (BAAI)). [Paper][PyTorch]
    • FLM: "Accelerating Vision-Language Pretraining with Free Language Modeling", CVPR, 2023 (Tencent). [Paper][PyTorch]
    • FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
    • VILA: "VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining", CVPR, 2023 (Google). [Paper][JAX]
    • BEiT-3: "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • ReVeaL: "REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory", CVPR, 2023 (Google). [Paper][Website]
    • SCL: "Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning", CVPR, 2023 (Tencent). [Paper]
    • EPIC: "Leveraging per Image-Token Consistency for Vision-Language Pre-training", CVPR, 2023 (ByteDance). [Paper]
    • PTP: "Position-guided Text Prompt for Vision-Language Pre-training", CVPR, 2023 (Sea AI Lab). [Paper][PyTorch]
    • PHASE: "Uncurated Image-Text Datasets: Shedding Light on Demographic Bias", CVPR, 2023 (Osaka University). [Paper][GitHub]
    • Uni-Perceiver-v2: "Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • ?: "Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language", CVPR, 2023 (Beijing Institute of Technology). [Paper]
    • GIVL: "GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods", CVPR, 2023 (Amazon). [Paper]
    • FLIP: "Scaling Language-Image Pre-training via Masking", CVPR, 2023 (Meta). [Paper][PyTorch]
    • MAP: "MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model", CVPR, 2023 (Tsinghua + Waseda). [Paper][PyTorch
    • DANCE: "Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles", CVPR, 2023 (Microsoft). [Paper][PyTorch (in construction)][Website]
    • xCLIP: "Non-Contrastive Learning Meets Language-Image Pre-Training", CVPR, 2023 (Microsoft). [Paper]
    • SVLC: "Teaching Structured Vision & Language Concepts to Vision&Language Models", CVPR, 2023 (IBM). [Paper]
    • DeAR: "DeAR: Debiasing Vision-Language Models with Additive Residuals", CVPR, 2023 (Adobe). [Paper][GitHub]
    • ?: "Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning", CVPR, 2023 (Amazon). [Paper]
    • UniHCP: "UniHCP: A Unified Model for Human-Centric Perceptions", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • HumanBench: "HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining", CVPR, 2023 (SenseTime). [Paper][PyTorch]
    • ?: "Joint Adaptive Representations for Image-Language Learning", CVPRW, 2023 (DeepMind). [Paper]
    • BLIP-2: "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models", ICML, 2023 (Salesforce). [Paper][PyTorch]
    • RLEG: "RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation", ICML, 2023 (Alibaba). [Paper]
    • Mod-X: "Continual Vision-Language Representation Learning with Off-Diagonal Information", ICML, 2023 (Huawei). [Paper]
    • ILLUME: "ILLUME: Rationalizing Vision-Language Models through Human Interactions", ICML, 2023 (German Center for Artificial Intelligence (DFKI)). [Paper][PyTorch]
    • Pix2Struct: "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding", ICML, 2023 (Google). [Paper]
    • MERU: "Hyperbolic Image-Text Representations", ICML, 2023 (Meta). [Paper]
    • ?: "Measuring Progress in Fine-grained Vision-and-Language Understanding", ACL, 2023 (DeepMind). [Paper]
    • RELIT: "Weakly Supervised Vision-and-Language Pre-training with Relative Representations", ACL, 2023 (Tsinghua). [Paper]
    • PuMer: "PuMer: Pruning and Merging Tokens for Efficient Vision Language Models", ACL, 2023 (UW). [Paper]
    • SINC: "SINC: Self-Supervised In-Context Learning for Vision-Language Tasks", ICCV, 2023 (Microsoft). [Paper]
    • ALIP: "ALIP: Adaptive Language-Image Pre-training with Synthetic Caption", ICCV, 2023 (DeepGlint, China). [Paper][PyTorch]
    • SigLiP: "Sigmoid Loss for Language Image Pre-Training", ICCV, 2023 (Google). [Paper][JAX]
    • VL-PET: "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control", ICCV, 2023 (CUHK). [Paper][PyTorch][Website]
    • GrowCLIP: "GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training", ICCV, 2023 (Sun Yat-sen University). [Paper]
    • ViLLA: "ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data", ICCV, 2023 (Stanford). [Paper][PyTorch]
    • CFM-ViT: "Contrastive Feature Masking Open-Vocabulary Vision Transformer", ICCV, 2023 (DeepMind). [Paper]
    • EqSim: "Equivariant Similarity for Vision-Language Foundation Models", ICCV, 2023 (Microsoft). [Paper][PyTorch]
    • A-CLIP: "Attentive Mask CLIP", ICCV, 2023 (Microsoft). [Paper]
    • CLOSE: "I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data", ICCV, 2023 (AI2). [Paper][PyTorch][Website]
    • SyViC: "Going Beyond Nouns With Vision & Language Models Using Synthetic Data", ICCV, 2023 (IBM). [Paper][PyTorch][Website]
    • ViLTA: "ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation", ICCV, 2023 (Tsinghua). [Paper]
    • MCD: "Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining", ICCV, 2023 (LG). [Paper]
    • TL;DR: "Too Large; Data Reduction for Vision-Language Pre-Training", ICCV, 2023 (NUS). [Paper][PyTorch]
    • DiffusionITM: "Are Diffusion Models Vision-And-Language Reasoners?", NeurIPS, 2023 (Mila). [Paper][PyTorch]
    • OPTIMA: "Module-wise Adaptive Distillation for Multimodality Foundation Models", NeurIPS, 2023 (Google). [Paper]
    • 4M: "4M: Massively Multimodal Masked Modeling", NeurIPS, 2023 (EPFL). [Paper][Website]
    • P-Former: "Bootstrapping Vision-Language Learning with Decoupled Language Pre-training", NeurIPS, 2023 (Dartmouth College). [Paper][PyTorch]
    • LQAE: "Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment", NeurIPS, 2023 (Berkeley). [Paper]
    • OBELISC: "OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents", NeurIPS, 2023 (Hugging Face). [Paper][GitHub]
    • VoLTA: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment", TMLR, 2023 (Meta). [Paper][PyTorch][Website]
    • KOSMOS-1: "Language Is Not All You Need: Aligning Perception with Language Models", arXiv, 2023 (Microsoft). [Paper][Code]
    • Prismer: "Prismer: A Vision-Language Model with An Ensemble of Experts", arXiv, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • RVLM: "Replacement as a Self-supervision for Fine-grained Vision-language Pre-training", arXiv, 2023 (Harbin Institute of Technology). [Paper]
    • MuLTI: "MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling", arXiv, 2023 (Alibaba). [Paper]
    • VL-MoE: "Scaling Vision-Language Models with Sparse Mixture of Experts", arXiv, 2023 (Berkeley + Microsoft). [Paper]
    • EVA-02: "EVA-02: A Visual Representation for Neon Genesis", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • CoBIT: "CoBIT: A Contrastive Bi-directional Image-Text Generation Model", arXiv, 2023 (Google). [Paper]
    • EVA-CLIP: "EVA-CLIP: Improved Training Techniques for CLIP at Scale", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • Sig: "Sigmoid Loss for Language Image Pre-Training", arXiv, 2023 (Google). [Paper]
    • MaMMUT: "MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks", arXiv, 2023 (Google). [Paper]
    • CAVL: "CAVL: Learning Contrastive and Adaptive Representations of Vision and Language", arXiv, 2023 (CMU). [Paper]
    • MoMo: "MoMo: A shared encoder Model for text, image and multi-Modal representations", arXiv, 2023 (Amazon). [Paper]
    • REAVL: "Retrieval-based Knowledge Augmented Vision Language Pre-training", arXiv, 2023 (Tencent). [Paper]
    • ALBEF-MI: "Vision Lanauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation", arXiv, 2023 (Alibaba). [Paper]
    • Helip: "Boosting Visual-Language Models by Exploiting Hard Samples", arXiv, 2023 (Huawei). [Paper]
    • IMP: "Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception", arXiv, 2023 (Google). [Paper]
    • Musketeer: "Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts", arXiv, 2023 (Amazon). [Paper]
    • GVT: "What Makes for Good Visual Tokenizers for Large Language Models?", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • S-CLIP: "S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions", NeurIPS, 2023 (KAIST). [Paper]
    • VisorGPT: "VisorGPT: Learning Visual Prior via Generative Pre-Training", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
    • IdealGPT: "IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models", arXiv, 2023 (Columbia University). [Paper][PyTorch]
    • PaLI-X: "PaLI-X: On Scaling up a Multilingual Vision and Language Model", arXiv, 2023 (Google). [Paper]
    • CrossGET: "CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
    • COSA: "COSA: Concatenated Sample Pretrained Vision-Language Foundation Model", arXiv, 2023 (ByteDance). [Paper][PyTorch]
    • Babel-ImageNet: "Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations", arXiv, 2023 (University of Würzburg, Germany). [Paper][PyTorch]
    • Kosmos-2: "Kosmos-2: Grounding Multimodal Large Language Models to the World", arXiv, 2023 (Microsoft). [Paper][PyTorch][Demo]
    • LENS: "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language", arXiv, 2023 (Contextual AI + Stanford). [Paper][PyTorch][Demo]
    • Emu: "Generative Pretraining in Multimodality", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • mBLIP: "mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs", arXiv, 2023 (University of Wurzburg, Germany). [Paper][PyTorch]
    • SEED-OPT: "Planting a SEED of Vision in Large Language Model", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • OpenFlamingo: "OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models", arXiv, 2023 (UW). [Paper][PyTorch]
    • Free-ATM: "Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks", arXiv, 2023 (ByteDance). [Paper]
    • LCL: "Link-Context Learning for Multimodal LLMs", arXiv, 2023 (SenseTime). [Paper]
    • DLIP: "DLIP: Distilling Language-Image Pre-training", arXiv, 2023 (ByteDance). [Paper]
    • LaVIT: "Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization", arXiv, 2023 (Kuaishou). [Paper][Code (in construction)]
    • MMICL: "MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning", arXiv, 2023 (Peking). [Paper][PyTorch]
    • ELIP: "ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens", arXiv, 2023 (NUS). [Paper]
    • SEED-LLaMA: "Making LLaMA SEE and Draw with SEED Tokenizer", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • ITIT: "Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency", arXiv, 2023 (Google). [Paper]
    • SimVLG: "SimVLG: Simple and Efficient Pretraining of Visual Language Generative Models", arXiv, 2023 (ByteDance). [Paper]
    • VeCLIP: "From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions", arXiv, 2023 (Apple). [Paper]
    • PaLI-3: "PaLI-3 Vision Language Models: Smaller, Faster, Stronger", arXiv, 2023 (Google). [Paper]
    • COMM: "From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models", arXiv, 2023 (Huawei). [Paper][PyTorch (in construction)]
    • CogVLM: "CogVLM: Visual Expert for Pretrained Language Models", arXiv, 2023 (Zhipu AI, China). [Paper][PyTorch]
    • OtterHD: "OtterHD: A High-Resolution Multi-modality Model", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
    • Florence-2: "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks", arXiv, 2023 (Microsoft). [Paper]
    • MLA: "Multimodal Representation Learning by Alternating Unimodal Adaptation", arXiv, 2023 (UNC). [Paper]
    • MobileCLIP: "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training", arXiv, 2023 (Apple). [Paper]
    • LLaMA-VID: "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models", arXiv, 2023 (CUHK). [Paper][PyTorch]
    • ?: "MLLMs-Augmented Visual-Language Representation Learning", arXiv, 2023 (NUS). [Paper][Code (in construction)]
    • Hulk: "Hulk: A Universal Knowledge Translator for Human-Centric Tasks", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • D-iGPT: "Rejuvenating image-GPT as Strong Visual Representation Learners", arXiv, 2023 (JHU + UC Santa Cruz). [Paper][PyTorch]
    • Vary: "Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models", arXiv, 2023 (Megvii). [Paper][PyTorch][Website]
    • Emu2: "Generative Multimodal Models are In-Context Learners", arXiv, 2023 (BAAI). [Paper][PyTorch][Website]
    • InternVL: "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • TiMix: "TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training", AAAI, 2024 (Peking). [Paper]
    • ECLIPSE: "Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders", AAAI, 2024 (LG). [Paper]
    • ASM: "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World", ICLR, 2024 (Shanghai AI Lab). [Paper][PyTorch][Demo]
    • VLAP: "Bridging Vision and Language Spaces with Assignment Prediction", ICLR, 2024 (Yonsei). [Paper]
    • MetaCLIP: "Demystifying CLIP Data", ICLR, 2024 (Meta). [Paper][PyTorch]
    • NARVL: "Non-autoregressive Sequence-to-Sequence Vision-Language Models", CVPR, 2024 (Amazon). [Paper]
    • S4: "Enhancing Vision-Language Pre-training with Rich Supervisions", CVPR, 2024 (Amazon). [Paper]
    • IL-CLIP: "Iterated Learning Improves Compositionality in Large Vision-Language Models", CVPR, 2024 (UW). [Paper]
    • MoDE: "MoDE: CLIP Data Experts via Clustering", CVPR, 2024 (Meta). [Paper][PyTorch]
    • Multi-MaP: "Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering", CVPR, 2024 (UW Tacoma). [Paper][PyTorch]
    • Cluster-Masking: "Efficient Vision-Language Pre-training by Cluster Masking", CVPR, 2024 (UMich). [Paper][PyTorch][Website]
    • FFF: "FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models", CVPR, 2024 (Samsung). [Paper]
    • VILA: "VILA: On Pre-training for Visual Language Models", CVPR, 2024 (NVIDIA). [Paper][PyTorch]
    • Morph-Tokens: "Auto-Encoding Morph-Tokens for Multimodal LLM", ICML, 2024 (Zhejiang). [Paper][PyTorch]
    • Libra: "Libra: Building Decoupled Vision System on Large Language Models", ICML, 2024 (CAS). [Paper][PyTorch]
    • COSMO: "COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training", arXiv, 2024 (Microsoft). [Paper][PyTorch][Website]
    • ?: "Low-Resource Vision Challenges for Foundation Models", arXiv, 2024 (UvA). [Paper][Code (in construction)][Website]
    • UMG-CLIP: "UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding", arXiv, 2024 (Huawei). [Paper]
    • MM-Interleaved: "MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • SPARC: "Improving fine-grained understanding in image-text pre-training", arXiv, 2024 (DeepMind). [Paper]
    • MouSi: "MouSi: Poly-Visual-Expert Vision-Language Models", arXiv, 2024 (Fudan). [Paper][Code (in construction)]
    • ?: "Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study", arXiv, 2024 (Alibaba). [Paper]
    • QA-ViT: "Question Aware Vision Transformer for Multimodal Reasoning", arXiv, 2024 (Amazon). [Paper]
    • PaLM2-VAdapter: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter", arXiv, 2024 (Google). [Paper]
    • PALO: "PALO: A Polyglot Large Multimodal Model for 5B People", arXiv, 2024 (MBZUAI). [Paper][Code (in construction)]
    • CogCoM: "CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations", arXiv, 2024 (Zhipu AI, China). [Paper][PyTorch]
    • EVA-CLIP-18B: "EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters", arXiv, 2024 (BAAI). [Paper][PyTorch]
    • SynthCLIP: "SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?", arXiv, 2024 (KAUST). [Paper][PyTorch]
    • CloVe: "CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models", arXiv, 2024 (Netflix). [Paper]
    • ASMv2: "The All-Seeing Project V2: Towards General Relation Comprehension of the Open World", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • Multimodal-ArXiv: "Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models", arXiv, 2024 (HKU). [Paper][Website]
    • Synth2: "Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings", arXiv, 2024 (DeepMind). [Paper]
    • ?: "Towards Multimodal In-Context Learning for Vision & Language Models", arXiv, 2024 (IBM). [Paper]
    • LocCa: "LocCa: Visual Pretraining with Location-aware Captioners", arXiv, 2024 (DeepMind). [Paper]
    • SPHINX-V: "Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • MyVLM: "MyVLM: Personalizing VLMs for User-Specific Queries", arXiv, 2024 (Snap). [Paper][Code (in construction)][Website]
    • BRAVE: "BRAVE: Broadening the visual encoding of vision-language models", arXiv, 2024 (Google). [Paper][Website]
    • SEED-X: "SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation", arXiv, 2024 (Tencent). [Paper][Code (in construction)]
    • CatLIP: "CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data", arXiv, 2024 (Apple). [Paper][PyTorch]
    • Llip: "Modeling Caption Diversity in Contrastive Vision-Language Pretraining", arXiv, 2024 (Meta). [Paper]
    • Idefics2: "What matters when building vision-language models?", arXiv, 2024 (Hugging Face). [Paper]
    • VILA2: "VILA2: VILA Augmented VILA", arXiv, 2024 (NVIDIA). [Paper]
  • Video:
    • COOT: "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning", NeurIPS, 2020 (University of Freiburg). [Paper][PyTorch]
    • Parameter-Reduction: "Parameter Efficient Multimodal Transformers for Video Representation Learning", ICLR, 2021 (Seoul National University). [Paper]
    • ClipBERT: "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling", CVPR, 2021 (UNC + Microsoft). [Paper][PyTorch]
    • VLM: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", ACL Findings, 2021 (Facebook). [Paper][PyTorch]
    • VideoCLIP: "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding", EMNLP, 2021 (Facebook). [Paper][PyTorch]
    • VALUE: "VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation", NeurIPS (Datasets and Benchmarks), 2021 (Microsoft). [Paper][Website]
    • TAN: "Temporal Alignment Networks for Long-term Video", CVPR, 2022 (Oxford). [Paper][Code (in construction)][Website]
    • HD-VILA: "Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions", CVPR, 2022 (Microsoft). [Paper][GitHub]
    • ATP: "Revisiting the "Video" in Video-Language Understanding", CVPR, 2022 (Stanford). [Paper][Website]
    • ALPRO: "Align and Prompt: Video-and-Language Pre-training with Entity Prompts", CVPR, 2022 (Salesforce). [Paper][PyTorch]
    • CLOP: "CLOP: Video-and-Language Pre-Training with Knowledge Regularizations", ACMMM, 2022 (Baidu). [Paper]
    • LocVTP: "LocVTP: Video-Text Pre-training for Temporal Localization", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • FineCo: "Contrastive Video-Language Learning with Fine-grained Frame Sampling", AACL, 2022 (ICL, UK). [Paper]
    • EMCL: "Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
    • LF-VILA: "Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning", NeurIPS, 2022 (Microsoft). [Paper][GitHub]
    • VATT-GR-CL: "Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization", NeurIPS, 2022 (Google). [Paper]
    • LGDN: "LGDN: Language-Guided Denoising Network for Video-Language Modeling", NeurIPS, 2022 (Renmin University of China). [Paper]
    • EgoVLP: "Egocentric Video-Language Pretraining", NeurIPS, 2022 (NUS). [Paper][PyTorch][Website]
    • LiteVL: "LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling", EMNLP, 2022 (Peking University). [Paper]
    • Singularity: "Revealing Single Frame Bias for Video-and-Language Learning", arXiv, 2022 (UNC). [Paper]
    • VIOLET: "VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • SimVTP: "SimVTP: Simple Video Text Pre-training with Masked Autoencoders", arXiv, 2022 (Tencent). [Paper][PyTorch (in construction)]
    • VideoCoCa: "Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners", arXiv, 2022 (Google). [Paper]
    • i-Code: "i-Code: An Integrative and Composable Multimodal Learning Framework", AAAI, 2023 (Microsoft). [Paper][Code (in construction)]
    • TempCLR: "TempCLR: Temporal Alignment Representation with Contrastive Learning", ICLR, 2023 (Columbia). [Paper]
    • MELTR: "MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models", CVPR, 2023 (Korea University). [Paper][PyTorch]
    • VIOLETv2: "An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • LAVENDER: "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
    • SViTT: "SViTT: Temporal Learning of Sparse Video-Text Transformers", CVPR, 2023 (Intel). [Paper][Website]
    • TVTS: "Learning Transferable Spatiotemporal Representations from Natural Script Knowledge", CVPR, 2023 (Tencent). [Paper][PyTorch]
    • HBI: "Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning", CVPR, 2023 (Peking University). [Paper][Code (in construction)][Website]
    • All-in-One: "All in One: Exploring Unified Video-Language Pre-training", CVPR, 2023 (NUS). [Paper][PyTorch]
    • VindLU: "VindLU: A Recipe for Effective Video-and-Language Pretraining", CVPR, 2023 (UNC). [Paper][PyTorch]
    • Clover: "Clover: Towards A Unified Video-Language Alignment and Fusion Model", CVPR, 2023 (ByteDance). [Paper][PyTorch (in construction)]
    • mPLUG-2: "mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video", ICML, 2023 (Alibaba). [Paper][Code (in construction)]
    • BUS: "BUS: Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization", ICCV, 2023 (Alibaba). [Paper]
    • UMT: "Unmasked Teacher: Towards Training-Efficient Video Foundation Models", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • ?: "Long-range Multimodal Pretraining for Movie Understanding", ICCV, 2023 (Adobe). [Paper]
    • EgoVLPv2: "EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
    • SMAUG: "SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training", ICCV, 2023 (UW). [Paper]
    • VFC: "Verbs in Action: Improving verb understanding in video-language models", ICCV, 2023 (Google). [Paper]
    • HiTeA: "HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training", ICCV, 2023 (Alibaba). [Paper]
    • TW-BERT: "Learning Trajectory-Word Alignments for Video-Language Tasks", ICCV, 2023 (Southeast University, China). [Paper]
    • TESTA: "TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding", EMNLP Findings, 2023 (Peking). [Paper][PyTorch]
    • STOA-VLP: "STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training", arXiv, 2023 (Harbin Institute of Technology). [Papaer]
    • VLAB: "VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending", arXiv, 2023 (ByteDance). [Paper]
    • TVTSv2: "TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • Youku-mPLUG: "Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks", arXiv, 2023 (Alibaba). [Paper]
    • VideoGLUE: "VideoGLUE: Video General Understanding Evaluation of Foundation Models", arXiv, 2023 (Google). [Paper]
    • InternVid: "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • EVE: "EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE", arXiv, 2023 (Sun Yat-sen University). [Paper]
    • Qwen-VL: "Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • BT-Adapter: "One For All: Video Conversation is Feasible Without Video Instruction Tuning", arXiv, 2023 (Tencent). [Paper]
    • ?: "Harvest Video Foundation Models via Efficient Post-Pretraining", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Owl-Con: "VideoCon: Robust Video-Language Alignment via Contrast Captions", arXiv, 2023 (UCLA). [Paper][PyTorch]
    • ShareGPT4V: "ShareGPT4V: Improving Large Multi-Modal Models with Better Captions", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
    • Vamos: "Vamos: Versatile Action Models for Video Understanding", arXiv, 2023 (Brown). [Paper][Website]
    • EILEV: "Efficient In-Context Learning in Vision-Language Models for Egocentric Videos", arXiv, 2023 (UMich). [Paper]
    • E-ViLM: "E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer", arXiv, 2023 (Amazon). [Paper]
    • ?: "A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames", arXiv, 2023 (DeepMind). [Paper]
    • NSVA: "A Strong Baseline for Temporal Video-Text Alignment", arXiv, 2023 (SJTU). [Paper][Website]
    • debias-VL: "Debiasing Vision-Language Models via Biased Prompts", arXiv, 2023 (MIT). [Paper][PyTorch]
    • MobileVLM: "MobileVLM: A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices", arXiv, 2023 (Meituan). [Paper][Code (in construction)]
    • READ-PVLA: "READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling", AAAI, 2024 (NUS). [Paper]
    • S-ViLM: "Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding", ICLR, 2024 (Google). [Paper]
    • Panda-70M: "Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers", CVPR, 2024 (Snap). [Paper][PyTorch][Website]
    • vid-TLDR: "vid-TLDR: Training Free Token merging for Light-weight Video Transformer", CVPR, 2024 (Korea University). [Paper][Code (in construction)]
    • VidLA: "VidLA: Video-Language Alignment at Scale", CVPR, 2024 (Amazon). [Paper]
    • OmniVid: "OmniVid: A Generative Framework for Universal Video Understanding", CVPR, 2024 (Fudan). [Paper][Code (in construction)]
    • MA-LMM: "MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding", CVPR, 2024 (Meta). [Paper][Website][PyTorch]
    • VIIT: "Distilling Vision-Language Models on Millions of Videos", arXiv, 2024 (Google). [Paper][Website]
    • FiGCLIP: "FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos", arXiv, 2024 (IIIT Hyderabad). [Paper][Code (in construction)]
    • LWM: "World Model on Million-Length Video And Language With RingAttention", arXiv, 2024 (Berkeley). [Paper][JAX][Website]
    • VideoPrism: "VideoPrism: A Foundational Visual Encoder for Video Understanding", arXiv, 2024 (Google). [Paper]
    • Slot-VLM: "Slot-VLM: SlowFast Slots for Video-Language Modeling", arXiv, 2024 (Microsoft). [Paper]
    • MobileVLM-V2: "MobileVLM V2: Faster and Stronger Baseline for Vision Language Model", arXiv, 2024 (Meituan). [Paper][PyTorch]
    • VL-Mamba: "VL-Mamba: Exploring State Space Models for Multimodal Learning", arXiv, 2024 (University of Adelaide). [Paper][Code (in construction)][Website]
    • InternVideo2: "InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][Code (in construction)]
  • 3D:
    • CLIP2: "CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data", CVPR, 2023 (Huawei). [Paper]
    • 3D-VLP: "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training", CVPR, 2023 (Sichuan University). [Paper][PyTorch]
    • SDFusion: "SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation", CVPR, 2023 (Snap). [Paper][PyTorch][Website]
    • 3D-VisTA: "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment", ICCV, 2023 (Beijing Institute for General Artificial Intelligence (BIGAI)). [Paper][PyTorch][Website]
    • RegionPLC: "RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding", arXiv, 2023 (HKU). [Paper][Website]
    • 3DVLP: "Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding", arXiv, 2023 (Tsinghua). [Paper]
    • CLIPXPlore: "CLIPXPlore: Coupled CLIP and Shape Spaces for 3D Shape Exploration", arXiv, 2023 (CUHK). [Paper]
    • Point-PEFT: "Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • SUGAR: "SUGAR: Pre-training 3D Visual Representations for Robotics", CVPR, 2024 (INRIA). [Paper][Website]
    • Any2Point: "Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • PQ3D: "Unifying 3D Vision-Language Understanding via Promptable Queries", arXiv, 2024 (Beijing Institute for General Artificial Intelligence (BIGAI)). [Paper]
  • Vision-Audio-Text:
    • VATT: "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text", NeurIPS, 2021 (Google). [Paper][Tensorflow]
    • VideoCC: "Learning Audio-Video Modalities from Image Captions", ECCV, 2022 (Google). [Paper][Website]
    • MUGEN: "MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration", ECCV, 2022 (Meta). [Paper][Website]
    • VATLM: "VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • CLIP4VLA: "Accommodating Audio Modality in CLIP for Multimodal Processing", AAAI, 2023 (Renmin University of China). [Paper]
    • data2vec-2.0: "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language", ICML, 2023 (Meta). [Paper][PyTorch]
    • VAST: "VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset", NeurIPS, 2023 (CAS). [Paper][Code (in construction)]
    • VALOR: "VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset", arXiv, 2023 (CAS). [Paper][PyTorch][Website]
    • MC3: "SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos", CVPR, 2024 (Meta). [Paper][Website]
  • More than 3 modalities:
    • Meta-Transformer: "Meta-Transformer: A Unified Framework for Multimodal Learning", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
    • UnIVAL: "Unified Model for Image, Video, Audio and Language Tasks", arXiv, 2023 (Sorbonne University, France). [Paper][PyTorch][Website]
    • ViT-Lens: "ViT-Lens: Towards Omni-modal Representations", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • ViT-Lens-2: "ViT-Lens-2: Gateway to Omni-modal Intelligence", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • ModaVerse: "ModaVerse: Efficiently Transforming Modalities with LLMs", arXiv, 2024 (University of Adelaide). [Paper]
    • M2PT: "Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities", arXiv, 2024 (CUHK). [Paper][PyTorch][Website]
    • DAMC: "Model Composition for Multimodal Large Language Models", arXiv, 2024 (Tsinghua). [Paper]
  • Others:

[Back to Overview]

Multi-Modal Retrieval

  • General:
    • Fast-and-Slow: "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers", CVPR, 2021 (DeepMind). [Paper]
    • HTR: "Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning", CVPR, 2021 (Amazon). [Paper][PyTorch]
    • TERN: "Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features", CBMI, 2021 (National Research Council, Italy). [Paper]
    • VisualSparta: "VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search", arXiv, 2021 (CMU). [Paper]
    • CCR-CCS: "More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints", arXiv, 2021 (Rutgers + Amazon). [Paper]
    • MCProp: "Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching", ICLRW, 2022 (National Research Council, Italy). [Paper][PyTorch]
    • TASK-former: "A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch", ECCV, 2022 (Georgia Tech). [Paper][Website]
    • CODER: "CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval", ECCV, 2022 (Baidu). [Paper]
    • ?: "Most and Least Retrievable Images in Visual-Language Query Systems", ECCV, 2022 (Old Dominion University, Virginia). [Paper]
    • MACK: "MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching", NeurIPS, 2022 (CAS). [Paper]
    • MLA: "Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval", NeurIPS, 2022 (Renmin University of China). [Paper]
    • SpeechCLIP: "SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model", IEEE Workshop on Spoken Language Technology (SLT), 2022 (NTU). [Paper]
    • LoopITR: "LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval", arXiv, 2022 (UNC). [Paper]
    • TNLBT: "Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training", arXiv, 2022 (The University of Electro-Communications, Japan). [Paper]
    • HiVLP: "HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval", arXiv, 2022 (Huawei). [Paper]
    • ?: "Revising Image-Text Retrieval via Multi-Modal Entailment". arXiv, 2022 (Soochow University, China). [Paper]
    • TokenFlow: "TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval", arXiv, 2022 (Kuaishou). [Paper]
    • VLPCook: "Structured Vision-Language Pretraining for Computational Cooking", arXiv, 2022 (Sorbonne University, France). [Paper]
    • UniVL-DR: "Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval", ICLR, 2023 (Northeastern University, China). [Paper]
    • HREM: "Learning Semantic Relationship Among Instances for Image-Text Matching", CVPR, 2023 (USTC). [Paper]
    • CHAN: "Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
    • ViLEM: "ViLEM: Visual-Language Error Modeling for Image-Text Retrieval", CVPR, 2023 (CAS). [Paper]
    • SoftMask: "Multi-Modal Representation Learning with Text-Driven Soft Masks", CVPR, 2023 (SNU). [Paper]
    • MetaPer: "Meta-Personalizing Vision-Language Models To Find Named Instances in Video", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • DivE: "Improving Cross-Modal Retrieval with Set of Diverse Embeddings", CVPR, 2023 (POSTECH). [Paper][Website]
    • Pic2Word: "Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval", CVPR, 2023 (Google). [Paper][PyTorch]
    • LexLIP: "LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval", ICCV, 2023 (Microsoft). [Paper]
    • SEARLE: "Zero-Shot Composed Image Retrieval with Textual Inversion", ICCV, 2023 (University of Florence, Italy). [Paper][PyTorch (in construction)]
    • VLSlice: "VLSlice: Interactive Vision-and-Language Slice Discovery", ICCV, 2023 (OSU). [Paper][PyTorch][Website]
    • ConaCLIP: "ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval", ACL Industry Track, 2023 (Alibaba). [Paper][PyTorch]
    • FNE: "Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination", ACMMM, 2023 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch]
    • HAT: "Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval", ACMMM, 2023 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch]
    • STAIR: "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens", arXiv, 2023 (Apple). [Paper]
    • ChatIR: "Chatting Makes Perfect - Chat-based Image Retrieval", arXiv, 2023 (The Hebrew University of Jerusalem, Israel). [Paper]
    • TransAgg: "Zero-shot Composed Text-Image Retrieval", arXiv, 2023 (Shanghai Jiao Tong). [Paper][PyTorch][Website]
    • CUSA: "Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval", AAAI, 2024 (Beihang University). [Paper][PyTorch]
    • L2RM: "Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval", CVPR, 2024 (Xi'an Jiaotong). [Paper][Code (in construction)]
    • ?: "Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models", ICML, 2024 (UW Madison). [Paper]
    • Long-CLIP: "Long-CLIP: Unlocking the Long-Text Capability of CLIP", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch (in construction)]
    • ?: "Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval", arXiv, 2024 (Adobe). [Paper]
  • Video:
    • MMT: "Multi-modal Transformer for Video Retrieval", ECCV, 2020 (INRIA + Google). [Paper][Website]
    • AYCE: "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", CVPRW, 2021 (University of Modena and Reggio Emilia). [Paper][PyTorch]
    • HiT: "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval", ICCV, 2021 (Kuaishou). [Paper]
    • Frozen: "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval", ICCV, 2021 (Oxford). [Paper][Pytorch][Website][Dataset]
    • CLIP4Clip: "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval", arXiv, 2021 (Microsoft). [Paper][PyTorch]
    • MMFT: "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval", CVPR, 2022 (Goethe University Frankfurt, Germany). [Paper]
    • X-Pool: "X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval", CVPR, 2022 (Layer 6 AI, Toronto). [Paper][PyTorch][Website]
    • MVPt: "It's Time for Artistic Correspondence in Music and Video", CVPR, 2022 (Adobe). [Paper][Website]
    • OA-Trans: "Object-aware Video-language Pre-training for Retrieval", CVPR, 2022 (NUS). [Paper][PyTorch]
    • BridgeFormer: "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
    • CenterCLIP: "CenterCLIP: Token Clustering for Efficient Text-Video Retrieval", SIGIR, 2022 (Zhejiang University). [Paper]
    • X-CLIP: "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval", ACMMM, 2022 (Alibaba). [Paper]
    • HiSE: "Boosting Video-Text Retrieval with Explicit High-Level Semantics", ACMMM, 2022 (Baidu). [Paper]
    • TS2-Net: "TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval", ECCV, 2022 (Tencent). [Paper][PyTorch]
    • LAFF: "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval", ECCV, 2022 (Renmin University of China). [Paper]
    • ECLIPSE: "ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound", ECCV, 2022 (UNC). [Paper][PyTorch][Website]
    • MILES: "MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval", ECCV, 2022 (HKU). [Paper][PyTorch]
    • VTC: "VTC: Improving Video-Text Retrieval with User Comments", ECCV, 2022 (Unitary, UK). [Paper][PyTorch][Website]
    • LINAS: "Learning Linguistic Association towards Efficient Text-Video Retrieval", ECCV, 2022 (CAS). [Paper][PyTorch]
    • ?: "A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper]
    • ?: "Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval", NeurIPS, 2022 (Sun Yat-sen University). [Paper]
    • ConTra: "ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval", ACCV, 2022 (University of Bristol, UK). [Paper]
    • RaP: "RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • MDMMT-2: "MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization", arXiv, 2022 (Huawei). [Paper]
    • M2HF: "M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval", arXiv, 2022 (Tencent). [Paper]
    • FIRE: "Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks", arXiv, 2022 (Meta). [Paper][PyTorch]
    • Cross-Modal-Adapter: "Cross-Modal Adapter for Text-Video Retrieval", arXiv, 2022 (Tsinghua University). [Paper][PyTorch (in construction)]
    • MAC: "Masked Contrastive Pre-Training for Efficient Video-Text Retrieval", arXiv, 2022 (Alibaba). [Paper]
    • CLIP-ViP: "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment", ICLR, 2023 (Microsoft). [Paper][Code (in construction)]
    • HiREST: "Hierarchical Video-Moment Retrieval and Step-Captioning", CVPR, 2023 (UNC + Meta). [Paper][PyTorch][Website]
    • Cap4Video: "Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
    • CLIPPING: "CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval", CVPR, 2023 (Huawei). [Paper]
    • CNVid-3.5M: "CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset", CVPR, 2023 (Ant Group). [Paper][GitHub (in construction)]
    • CelebV-Text: "CelebV-Text: A Large-Scale Facial Text-Video Dataset", CVPR, 2023 (University of Sydney). [Paper][GitHub][Website]
    • ReST: "Relational Space-Time Query in Long-Form Videos", CVPR, 2023 (Meta). [Paper]
    • NaQ: "NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory", CVPR, 2023 (UT Austin). [Paper][PyTorch][Website]
    • ?: "Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval", CVPR, 2023 (Columbia). [Paper][Code (in contruction)]
    • VoP: "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval", CVPR, 2023 (Alibaba). [Paper][Code (in construction)][Website]
    • SpotEM: "SpotEM: Efficient Video Search for Episodic Memory", ICML, 2023 (UT Austin). [Paper][Website]
    • PromptSwitch: "Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval", ICCV, 2023 (University of Adelaide). [Paper][PyTorch]
    • ?: "Simple Baselines for Interactive Video Retrieval with Questions and Answers", ICCV, 2023 (Princeton). [Paper][PyTorch]
    • MeVTR: "Multi-event Video-Text Retrieval", ICCV, 2023 (LMU Munich). [Paper][PyTorch]
    • In-Style: "In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval", ICCV, 2023 (MPI). [Paper][Code (in construction)]
    • UCoFiA: "Unified Coarse-to-Fine Alignment for Video-Text Retrieval", ICCV, 2023 (UNC). [Paper][PyTorch]
    • TEFAL: "Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment", ICCV, 2023 (Amazon). [Paper]
    • DiffusionRet: "DiffusionRet: Generative Text-Video Retrieval with Diffusion Model", ICCV, 2023 (Peking University). [Paper][PyTorch]
    • UATVR: "UATVR: Uncertainty-Adaptive Text-Video Retrieval", ICCV, 2023 (Baidu). [Paper][PyTorch]
    • In-Style: "In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval", ICCV, 2023 (Goethe University Frankfurt, Germany). [Paper][Code (in construction)]
    • ReGaDa: "Video-adverb retrieval with compositional adverb-action embeddings", BMVC, 2023 (University of Tübingen, Germany). [Paper][Code (in construction)][Website]
    • TextVR: "A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
    • MASCOT: "Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval", arXiv, 2023 (?). [Paper]
    • CrossTVR: "Fine-grained Text-Video Retrieval with Frozen Image Encoders", arXiv, 2023 (Alibaba). [Paper]
    • TeachCLIP: "TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval", arXiv, 2023 (Renmin University of China). [Paper]
    • CoVR: "CoVR: Learning Composed Video Retrieval from Web Video Captions", arXiv, 2023 (Ecole des Ponts ParisTech (ENPC), France). [Paper][PyTorch][Website]
    • LanguageBind: "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment", arXiv, 2023 (Peking). [Paper][PyTorch]
    • 10k-Words: "A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval", arXiv, 2023 (SRI). [Paper][Website]
    • DGL: "DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval", AAAI, 2024 (University of Technology Sydney). [Paper][PyTorch]
    • T-MASS: "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval", CVPR, 2024 (Rochester Institute of Technology (RIT), NY). [Paper][PyTorch]
    • ?: "Detours for Navigating Instructional Videos", arXiv, 2024 (Meta). [Paper]
    • MultiCaps: "Learning text-to-video retrieval from image captioning", arXiv, 2024 (INRIA). [Paper][Website]
  • Vision-Audio-Text:
  • 3D:
    • Text2Loc: "Text2Loc: 3D Point Cloud Localization from Natural Language", CVPR, 2024 (TUM). [Paper][Code (in construction)][Website]
    • Text2SGM: ""Where am I?" Scene Retrieval with Language", arXiv, 2024 (ETHZ). [Paper]
  • Others:
    • IRRA: "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval", CVPR, 2023 (Wuhan University). [Paper][PyTorch]
    • ZS-SBIR: "CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not", CVPR, 2023 (University of Surrey, UK). [Paper][PyTorch]
    • ViML: "Language-Guided Music Recommendation for Video via Prompt Analogies", CVPR, 2023 (Adobe). [Paper][Website]
    • eP-ALM: "eP-ALM: Efficient Perceptual Augmentation of Language Models", ICCV, 2023 (Sorbonne University, France). [Paper][PyTorch]
    • Auto-ACD: "A Large-scale Dataset for Audio-Language Representation Learning", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)][Website]
    • Motion-Patches: "Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches", CVPR, 2024 (LY, Japan). [Paper][Website]

[Back to Overview]

Multi-Modal Generation

  • General:
    • AttnGAN: "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks", CVPR, 2018 (Microsoft). [Paper][PyTorch]
    • ControlGAN: "Controllable Text-to-Image Generation", NeurIPS, 2019 (Oxford). [Paper][PyTorch]
    • DALL-E: "Zero-Shot Text-to-Image Generation", ICML, 2021 (OpenAI). [Paper][PyTorch][PyTorch (lucidrains)]
    • CogView: "CogView: Mastering Text-to-Image Generation via Transformers", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
    • Layout-VQGAN: "Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer", CVPR, 2022 (CAS). [Paper]
    • Lafite: "Towards Language-Free Training for Text-to-Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • LDM: "High-Resolution Image Synthesis with Latent Diffusion Models", CVPR, 2022 (LMU Munich). [Paper][PyTorch]
    • AvatarCLIP: "AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars", SIGGRAPH, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • StoryDALL-E: "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation", ECCV, 2022 (UNC). [Paper][PyTorch]
    • Make-A-Scene: "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors", ECCV, 2022 (Meta). [Paper][Video]
    • TCTIG: "Trace Controlled Text to Image Generation", ECCV, 2022 (Beihang University). [Paper]
    • CogView2: "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
    • CLIPDraw: "CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders", NeurIPS, 2022 (Cross Compass, Japan). [Paper][PyTorch][Blog]
    • Imagen: "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding", NeurIPS, 2022 (Google). [Paper][Website]
    • ?: "Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark", NeurIPSW, 2022 (Boston + MIT + Columbia). [Paper]
    • DALL-Eval: "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers", arXiv, 2022 (UNC). [Paper][PyTorch]
    • DALL-E-2: "Hierarchical Text-Conditional Image Generation with CLIP Latents", arXiv, 2022 (OpenAI). [Paper][Website]
    • ?: "A very preliminary analysis of DALL-E 2", arXiv, 2022 (NYU). [Paper]
    • GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models", arXiv, 2022 (OpenAI). [Paper][PyTorch]
    • ?: "Discovering the Hidden Vocabulary of DALLE-2", arXiv, 2022 (UT Austin). [Paper]
    • Parti: "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", arXiv, 2022 (Google). [Paper][GitHub][Website]
    • Textual-Inversion: "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion", arXiv, 2022 (NVIDIA). [Paper][Website]
    • VLMGAN: "Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks", arXiv, 2022 (Fudan University). [Paper]
    • PDM: "Progressive Denoising Model for Fine-Grained Text-to-Image Generation", arXiv, 2022 (Meituan). [Paper]
    • FS-VQG: "Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets", arXiv, 2022 (IIT Kharagpur). [Paper]
    • Swinv2-Imagen: "Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation", arXiv, 2022 (Auckland University of Technology). [Paper]
    • UniTune: "UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image", arXiv, 2022 (Google). [Paper]
    • VSD: "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation", arXiv, 2022 (Tianjin University). [Paper][Code (in construction)]
    • Lafite2: "Lafite2: Few-shot Text-to-Image Generation", arXiv, 2022 (SUNY, Buffalo). [Paper]
    • eDiffi: "eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers", arXiv, 2022 (NVIDIA). [Paper][Website]
    • SpaText: "SpaText: Spatio-Textual Representation for Controllable Image Generation", arXiv, 2022 (Meta). [Paper][Website]
    • Story-LDM: "Make-A-Story: Visual Memory Conditioned Consistent Story Generation", arXiv, 2022 (UBC + Snap). [Paper]
    • Structure-Diffusion: "Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis", arXiv, 2022 (UCSB + UC Santa Cruz). [Paper][PyTorch][Website]
    • Re-Imagen: "Re-Imagen: Retrieval-Augmented Text-to-Image Generator", ICLR, 2023 (Google). [Paper]
    • Prompt-to-Prompt: "Prompt-to-Prompt Image Editing with Cross Attention Control", ICLR, 2023 (Google). [Paper][PyTorch][Website]
    • UniD3: "Unified Discrete Diffusion for Simultaneous Vision-Language Generation", ICLR, 2023 (NTU, Singapore). [Paper]
    • T2P: "Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation", CVPR, 2023 (Fuxi AI Lab). [Paper]
    • GLIGEN: "GLIGEN: Open-Set Grounded Text-to-Image Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
    • MAGVLT: "MAGVLT: Masked Generative Vision-and-Language Transformer", CVPR, 2023 (Kakao). [Paper]
    • ReCo: "ReCo: Region-Controlled Text-to-Image Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • GALIP: "GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis", CVPR, 2023 (Nanjing University of Posts and Telecommunications). [Paper][PyTorch]
    • DreamBooth: "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation", CVPR, 2023 (Google). [Paper][GitHub][Website]
    • RIATIG: "RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts", CVPR, 2023 (Washington University in St. Louis). [Paper]
    • ERNIE-ViLG-2.0: "ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts", CVPR, 2023 (Baidu). [Paper][Website]
    • GigaGAN: "Scaling up GANs for Text-to-Image Synthesis", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • Shifted-Diffusion: "Shifted Diffusion for Text-to-image Generation", CVPR, 2023 (ByteDance). [Paper][PyTorch]
    • Specialist-Diffusion: "Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style", CVPR, 2023 (Picsart). [Paper][Website]
    • ?: "Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation", CVPR, 2023 (CyberAgent, Japan). [Paper]
    • Custom-Diffusion: "Multi-Concept Customization of Text-to-Image Diffusion", CVPR, 2023 (Adobe). [Paper]
    • UniDiffuser: "One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale", ICML, 2023 (Tsinghua University). [Paper][Pytorch]
    • Muse: "Muse: Text-To-Image Generation via Masked Generative Transformers", ICML, 2023 (Google). [Paper][Website]
    • RA-CM3: "Retrieval-Augmented Multimodal Language Modeling", ICML, 2023 (Meta). [Paper]
    • StyleGAN-T: "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis", ICML, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • VD: "Versatile Diffusion: Text, Images and Variations All in One Diffusion Model", ICCV, 2023 (Oregon). [Paper][PyTorch]
    • DiT: "Scalable Diffusion Models with Transformers", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
    • TCTS-FAS: "Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models", ICCV, 2023 (KAIST). [Paper]
    • ?: "Discriminative Class Tokens for Text-to-Image Diffusion Models", ICCV, 2023 (Tel Aviv). [Paper][PyTorch]
    • TIFA: "TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering", ICCV, 2023 (UW). [Paper][PyTorch][Website]
    • LSDM: "Language-driven Scene Synthesis using Multi-conditional Diffusion Model", NeurIPS, 2023 (FSOFT AI Center, Vietnam). [Paper][PyTorch][Website]
    • LLMScore: "LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation", NeurIPS, 2023 (UCSB). [Paper][PyTorch]
    • PoS-subspaces: "Parts of Speech-Grounded Subspaces in Vision-Language Models", NeurIPS, 2023 (Queen Mary University of London). [Paper][PyTorch][Website]
    • LANCE: "LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images", NeurIPS, 2023 (Georgia Tech). [Paper][PyTorch][Website]
    • ?: "The CLIP Model is Secretly an Image-to-Prompt Converter", NeurIPS, 2023 (Xidian University). [Paper]
    • BLIP-Diffusion: "BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing", NeurIPS, 2023 (Salesforce). [Paper][Code (in construction)][Website]
    • CoDi: "Any-to-Any Generation via Composable Diffusion", NeurIPS, 2023 (Microsoft). [Paper][PyTorch][Website]
    • UniControl: "UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild", NeurIPS, 2023 (Salesforce). [Paper][PyTorch]
    • E4T: "Designing an Encoder for Fast Personalization of Text-to-Image Models", arXiv, 2023 (NVIDIA). [Paper][Website]
    • ?: "Controlled and Conditional Text to Image Generation with Diffusion Prior", arXiv, 2023 (Adobe). [Paper]
    • Lformer: "Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding", arXiv, 2023 (Zhejiang University). [Paper]
    • UMM-Diffusion: "Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation", arXiv, 2023 (Peking University). [Paper]
    • ToMESD: "Token Merging for Fast Stable Diffusion", arXiv, 2023 (Georgia Tech). [Paper][PyTorch]
    • layout-guidance: "Training-Free Layout Control with Cross-Attention Guidance", arXiv, 2023 (Oxford). [Paper][PyTorch][Website]
    • HRS-Bench: "HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models", arXiv, 2023 (KAUST). [Paper][GitHub][Website]
    • SeedSelect: "It is all about where you start: Text-to-image generation with seed selection", arXiv, 2023 (Bar-Ilan University, Israel). [Paper]
    • DisenBooth: "DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation", arXiv, 2023 (Tsinghua). [Paper]
    • VideoOFA: "VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation", arXiv, 2023 (Meta). [Paper]
    • FastComposer: "FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention", arXiv, 2023 (MIT). [Paper][PyTorch][Website]
    • VPGen: "Visual Programming for Text-to-Image Generation and Evaluation", arXiv, 2023 (UNC). [Paper][PyTorch][Website]
    • SeeCoder: "Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models", arXiv, 2023 (Picsart). [Paper][PyTorch]
    • GILL: "Generating Images with Multimodal Language Models", NeurIPS, 2023 (CMU). [Paper][PyTorch][Website]
    • DA-Score: "Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback", NeurIPS, 2023 (ANU). [Paper][PyTorch][Website]
    • GORS: "T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation", NeurIPS, 2023 (HKU). [Paper][PyTorch][Website]
    • TextDiffuser: "TextDiffuser: Diffusion Models as Text Painters", NeurIPS, 2023 (Microsoft). [Paper][PyTorch]
    • CAC: "Localized Text-to-Image Generation for Free via Cross Attention Control", arXiv, 2023 (CMU). [Paper]
    • CLIPAG: "CLIPAG: Towards Generator-Free Text-to-Image Generation", arXiv, 2023 (Technion, Israel). [Paper]
    • PACGen: "Generate Anything Anywhere in Any Scene", arXiv, 2023 (UW Madison). [Paper][Code (in construction)][Website]
    • SPAE: "SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs", arXiv, 2023 (Google). [Paper]
    • HyperDreamBooth: "HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models", arXiv, 2023 (Google). [Paper][Website]
    • ?: "Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models", arXiv, 2023 (NVIDIA). [Paper][Website]
    • IP-Adapter: "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models", arXiv, 2023 (Tencent). [Paper][Website]
    • ORES: "ORES: Open-vocabulary Responsible Visual Synthesis", arXiv, 2023 (Microsoft). [Paper]
    • CM3Leon: "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", arXiv, 2023 (Meta). [Paper]
    • DreamLLM: "DreamLLM: Synergistic Multimodal Comprehension and Creation", arXiv, 2023 (Megvii). [Paper][Code (in construction)][Website]
    • FreeU: "FreeU: Free Lunch in Diffusion U-Net", arXiv, 2023 (NTU, Singapore). [Paper][Website][Code (in construction)]
    • Emu: "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack", arXiv, 2023 (Meta). [Paper]
    • Kosmos-G: "Kosmos-G: Generating Images in Context with Multimodal Large Language Models", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • AlignProp: "Aligning Text-to-Image Diffusion Models with Reward Backpropagation", arXiv, 2023 (CMU). [Paper][PyTorch][Website]
    • Idea2Img: "Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation", arXiv, 2023 (Microsoft). [Paper][Website]
    • EasyGen: "Making Multimodal Generation Easier: When Diffusion Models Meet LLMs", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
    • LLM-Blueprint: "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts", arXiv, 2023 (MBZUAI). [Paper][Code (in construction)]
    • DiagrammerGPT: "DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning", arXiv, 2023 (UNC). [Paper][PyThon][Website]
    • Emu-Edit: "Emu Edit: Precise Image Editing via Recognition and Generation Tasks", arXiv, 2023 (Meta). [Paper]
    • CoDi-2: "CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation", arXiv, 2023 (Microsoft). [Paper][Code (in construction)][Website]
    • UniGS: "UniGS: Unified Representation for Image Generation and Segmentation", arXiv, 2023 (UC Merced). [Paper][PyTorch (in construction)]
    • StoryGPT-V: "Large Language Models as Consistent Story Visualizers", arXiv, 2023 (KAUST). [Paper][PyTorch][Website]
    • StackedDiffusion: "Generating Illustrated Instructions", arXiv, 2023 (Meta). [Paper][Website]
    • VL-GPT: "VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • PixArt-α: "PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis", ICLR, 2024 (Huawei). [Paper][PyTorch][Website]
    • DistriFusion: "DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models", CVPR, 2024 (MIT). [Paper][PyTorch][Website]
    • DPT: "Discriminative Probing and Tuning for Text-to-Image Generation", CVPR, 2024 (NUS). [Paper][Code (in construction)][Website]
    • HcP: "Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation", CVPR, 2024 (University of New South Wales (UNSW), Australia). [Paper][Code (in construction)][Website]
    • aMUSEd: "aMUSEd: An Open MUSE Reproduction", arXiv, 2024 (Hugging Face). [Paper]
    • Instruct-Imagen: "Instruct-Imagen: Image Generation with Multi-modal Instruction", arXiv, 2024 (Google). [Paper]
    • DiffusionGPT: "DiffusionGPT: LLM-Driven Text-to-Image Generation System", arXiv, 2024 (ByteDance). [Paper][PyTorch][Website]
    • SiT: "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers", arXiv, 2024 (NYU). [Paper][PyTorch][Website]
    • λ-ECLIPSE: "λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space", arXiv, 2024 (Arizona State University). [Paper][PyTorch][Website]
    • FiT: "FiT: Flexible Vision Transformer for Diffusion Model", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch (in construction)]
    • SDXL-Lightning: "SDXL-Lightning: Progressive Adversarial Diffusion Distillation", arXiv, 2024 (ByteDance). [Paper]
    • CG: "Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models", arXiv, 2024 (CMU). [Paper]
    • Gen4Gen: "Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition", arXiv, 2024 (Berkeley). [Paper][Code (in construction)][Website]
    • PixArt-Σ: "PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation", arXiv, 2024 (Huawei). [Paper][Website]
    • CogView3: "CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion", arXiv, 2024 (Zhipu AI, China). [Paper]
    • SELMA: "SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data", arXiv, 2024 (UNC). [Paper][PyTorch][Website]
    • Imagine-Flash: "Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation", arXiv, 2024 (Meta). [Paper]
    • Hunyuan-DiT: "Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding", arXiv, 2024 (Tencent). [Paper][PyTorch][Website]
  • Video:
    • Imagen-Video: "Imagen Video: High Definition Video Generation with Diffusion Models", arXiv, 2022 (Google). [Paper][Website]
    • Phenaki: "Phenaki: Variable Length Video Generation From Open Domain Textual Description", arXiv, 2022 (Google). [Paper][PyTorch (LAION-AI, in construction)][Website]
    • ?: "Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization", arXiv, 2022 (CMU). [Paper][PyTorch][Website]
    • MagicVideo: "MagicVideo: Efficient Video Generation With Latent Diffusion Models", arXiv, 2022 (ByteDance). [Paper][Website]
    • CogVideo: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", ICLR, 2023 (Tsinghua University) [Paper][GitHub (in construction)]
    • Make-A-Video: "Make-A-Video: Text-to-Video Generation without Text-Video Data", ICLR, 2023 (Meta). [Paper]
    • VideoLDM: "Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models", CVPR, 2023 (NVIDIA). [Paper][Website]
    • MMVG: "Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation", CVPR, 2023 (Meta). [Paper]
    • MM-Diffusion: "MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • PYoCo: "Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models", ICCV, 2023 (NVIDIA). [Paper][Website]
    • Text2Video-Zero: "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators", ICCV, 2023 (Picsart). [Paper][Code (in construction)]
    • Text2Performer: "Text2Performer: Text-Driven Human Video Generation", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • GlueGen: "GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation", ICCV, 2023 (Salesforce). [Paper][PyTorch]
    • VideoFactory: "VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation", arXiv, 2023 (Microsoft). [Paper]
    • Video-Adapter: "Probabilistic Adaptation of Text-to-Video Models", arXiv, 2023 (DeepMind). [Paper][Website]
    • SimDA: "SimDA: Simple Diffusion Adapter for Efficient Video Generation", arXiv, 2023 (Fudan). [Paper][Website]
    • LVD: "LLM-grounded Video Diffusion Models", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
    • VideoCrafter1: "VideoCrafter1: Open Diffusion Models for High-Quality Video Generation", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
    • Emu-Video: "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning", arXiv, 2023 (Meta). [Paper][Website]
    • PixelDance: "Make Pixels Dance: High-Dynamic Video Generation", arXiv, 2023 (ByteDance). [Paper][Website]
    • VideoBooth: "VideoBooth: Diffusion-based Video Generation with Image Prompts", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
    • VideoSwap: "VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence", arXiv, 2023 (Meta). [Paper][Code (in construction)][Website]
    • LEGO: "LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning", arXiv, 2023 (Meta). [Paper]
    • GenHowTo: "GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos", arXiv, 2023 (Czech Technical University). [Paper][Website]
    • VideoLCM: "VideoLCM: Video Latent Consistency Model", arXiv, 2023 (Alibaba). [Paper]
    • VideoPoet: "VideoPoet: A Large Language Model for Zero-Shot Video Generation", arXiv, 2023 (Google). [Paper][Website]
    • CMD: "Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition", ICLR, 2024 (NVIDIA). [Paper][Website]
    • MagicVideo-V2: "MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation", arXiv, 2024 (ByteDance). [Paper][Website]
    • WorldDreamer: "WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens", arXiv, 2024 (Tsinghua). [Paper][Code (in construction)][Website]
    • Vlogger: "Vlogger: Make Your Dream A Vlog", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • VideoCrafter2: "VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models", arXiv, 2024 (Tencent). [Paper][PyTorch][Website]
    • ActAnywhere: "ActAnywhere: Subject-Aware Video Background Generation", arXiv, 2024 (Adobe). [Paper][Website]
    • Lumiere: "Lumiere: A Space-Time Diffusion Model for Video Generation", arXiv, 2024 (Google). [Paper][Website]
    • Snap-Video: "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis", arXiv, 2024 (Snap). [Paper][Website]
    • Pix2Gif: "Pix2Gif: Motion-Guided Diffusion for GIF Generation", arXiv, 2024 (Microsoft). [Paper][PyTorch][Website]
    • WorldGPT: "WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs", arXiv, 2024 (Seeking AI, China). [Paper]
    • AnyV2V: "AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks", arXiv, 2024 (University of Waterloo, Canada). [Paper][PyTorch][Website]
  • 3D:
    • Magic3D: "Magic3D: High-Resolution Text-to-3D Content Creation", CVPR, 2023 (NVIDIA). [Paper][Website]
    • CLIP-Sculptor: "CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language", CVPR, 2023 (Autodesk). [Paper][Website]
    • Diffusion-SDF: "Diffusion-SDF: Text-to-Shape via Voxelized Diffusion", CVPR, 2023 (Tsinghua). [Paper][PyTorch][Website]
    • TAPS3D: "TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision", CVPR, 2023 (Bytedance). [Paper][PyTorch]
    • Dream3D: "Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models", CVPR, 2023 (Tencent). [Paper][Website]
    • ATT3D: "ATT3D: Amortized Text-to-3D Object Synthesis", ICCV, 2023 (NVIDIA). [Paper][Website]
    • InstructP2P: "InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions", arXiv, 2023 (Tencent). [Paper]
    • SDS-Complete: "Point-Cloud Completion with Pretrained Text-to-image Diffusion Models", arXiv, 2023 (NVIDIA). [Paper][Website]
    • Michelangelo: "Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation", arXiv, 2023 (Tencent). [Paper][[Code (in construction)(https://github.com/NeuralCarver/michelangelo)]][Website]
    • 3D-GPT: "3D-GPT: Procedural 3D Modeling with Large Language Models", arXiv, 2023 (ANU). [Paper][Website]
    • DiffTF: "Large-Vocabulary 3D Diffusion Model with Transformer", ICLR, 2024 (NTU, Singapore). [Paper][PyTorch][Website]
    • DiffTF++: "DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation", arXiv, 2024 (NTU, Singapore). [Paper]
  • Others:
    • DiffGesture: "Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation", CVPR, 2023 (HKU). [Paper][PyTorch]
    • CondFoleyGen: "Conditional Generation of Audio from Video via Foley Analogies", CVPR, 2023 (UMich). [Paper][PyTorch (in construction)][Website]
    • Physics-Diffusion: "Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos", CVPR, 2023 (IBM). [Paper][PyTorch][Website]
    • RACER: "Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards", CVPR, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
    • ReVISE: "ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • MAV3D: "Text-To-4D Dynamic Scene Generation", ICML, 2023 (Meta). [Paper][Website]
    • LORIS: "Long-Term Rhythmic Video Soundtracker", ICML, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • NExT-GPT: "NExT-GPT: Any-to-Any Multimodal LLM", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
    • Lumina-T2X: "Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • X-VILA: "X-VILA: Cross-Modality Alignment for Large Language Model", arXiv, 2024 (NVIDIA). [Paper]

[Back to Overview]

Prompt Learning/Tuning:

  • CLIP-Adapter: "CLIP-Adapter: Better Vision-Language Models with Feature Adapters", arXiv, 2021 (Shanghai AI Lab). [Paper][PyTorch]
  • CoCoOp: "Conditional Prompt Learning for Vision-Language Models", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
  • ProDA: "Prompt Distribution Learning", CVPR, 2022 (Huawei). [Paper]
  • VPT: "Visual Prompt Tuning", ECCV, 2022 (Cornell). [Paper][PyTorch]
  • PerVL: "This is my unicorn, Fluffy": Personalizing frozen vision-language representations", ECCV, 2022 (NVIDIA). [Paper][PyTorch]
  • OrdinalCLIP: "OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
  • BeamCLIP: "Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching", NeurIPS, 2022 (LG). [Paper]
  • TPT: "Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models", NeurIPS, 2022 (NVIDIA). [Paper][PyTorch][Website]
  • CoOp: "Learning to Prompt for Vision-Language Models", IJCV, 2022 (NTU, Singapore). [Paper][PyTorch]
  • CAVPT: "Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
  • Visual-Prompting: "Exploring Visual Prompts for Adapting Large-Scale Models", arXiv, 2022 (MIT). [Paper][PyTorch][Website]
  • PGN: "Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers", arXiv, 2022 (University of Amsterdam). [Paper][PyTorch]
  • UPT: "Unified Vision and Language Prompt Learning", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
  • CPL: "CPL: Counterfactual Prompt Learning for Vision and Language Models", arXiv, 2022 (UC Santa Cruz). [Paper]
  • PTP: "Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models", arXiv, 2022 (Baidu). [Paper]
  • MVLPT: "Multitask Vision-Language Prompt Tuning", arXiv, 2022 (Berkeley). [Paper][PyTorch]
  • ?: "Task Bias in Vision-Language Models", arXiv, 2022 (Columbia). [Paper]
  • UPL: "Unsupervised Prompt Learning for Vision-Language Models", arXiv, 2022 (Peking). [Paper][PyTorch]
  • DeFo: "Learning to Decompose Visual Features with Latent Textual Prompts", ICLR, 2023 (UIUC). [Paper]
  • PLOT: "Prompt Learning with Optimal Transport for Vision-Language Models", ICLR, 2023 (CMU). [Paper]
  • ?: "Visual Classification via Description from Large Language Models", ICLR, 2023 (Columbia). [Paper]
  • CSP: "Learning to Compose Soft Prompts for Compositional Zero-Shot Learning", ICLR, 2023 (Brown University). [Paper][PyTorch]
  • CaFo: "Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • ?: "Multimodal Prompting with Missing Modalities for Visual Recognition", CVPR, 2023 (NYCU). [Paper][PyTorch][Website]
  • DAM-VP: "Diversity-Aware Meta Visual Prompting", CVPR, 2023 (USTC). [Paper][PyTorch]
  • ILM-VP: "Understanding and Improving Visual Prompting: A Label-Mapping Perspective", CVPR, 2023 (Michigan State). [Paper][PyTorch]
  • KgCoOp: "Visual-Language Prompt Tuning with Knowledge-guided Context Optimization", CVPR, 2023 (CAS). [Paper][PyTorch]
  • BlackVIP: "BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning", CVPR, 2023 (University of Seoul). [Paper][PyTorch]
  • EXPRES: "Learning Expressive Prompting With Residuals for Vision Transformers", CVPR, 2023 (Amazon). [Paper]
  • ?: "Learning to Name Classes for Vision and Language Models", CVPR, 2023 (Huawei). [Paper]
  • PMF: "Efficient Multimodal Fusion via Interactive Prompting", CVPR, 2023 (Zhejiang University). [Paper]
  • MaPLe: "MaPLe: Multi-modal Prompt Learning", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
  • HiPro: "Hierarchical Prompt Learning for Multi-Task Learning", CVPR, 2023 (JD). [Paper]
  • DFSP: "Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • TaI-DP: "Texts as Images in Prompt Tuning for Multi-Label Image Recognition", CVPR, 2023 (Tomorrow Advancing Life (TAL)). [Paper][PyTorch]
  • ESPER: "Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning", CVPR, 2023 (Yonsei). [Paper][PyTorch]
  • APT: "A-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting", CVPR, 2023 (Amazon). [Paper]
  • VQT: "Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning", CVPR, 2023 (The Ohio State University (OSU)). [Paper]
  • LaBo: "Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification", CVPR, 2023 (University of Pennsylvania). [Paper][PyTorch]
  • TaskRes: "Task Residual for Tuning Vision-Language Models", CVPR, 2023 (NUS). [Paper][PyTorch]
  • LASP: "Language-Aware Soft Prompting for Vision & Language Foundation Models", CVPR, 2023 (Samsung). [Paper][Website]
  • POUF: "POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models", ICML, 2023 (UT Austin). [Paper][PyTorch]
  • GaPT: "Improving Visual Prompt Tuning for Self-supervised Vision Transformers", ICML, 2023 (SNU). [Paper][PyTorch]
  • ZPE: "A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models", ICML, 2023 (Google). [Paper]
  • CMPA: "Deeply Coupled Cross-Modal Prompt Learning", ACL Findings, 2023 (SenseTime). [Paper]
  • PromptSRC: "Self-regulating Prompts: Foundational Model Adaptation without Forgetting", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
  • SHIP: "Improving Zero-Shot Generalization for CLIP with Synthesized Prompts", ICCV, 2023 (CAS). [Paper]
  • PTNL: "Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?", ICCV, 2023 (ByteDance). [Paper][PyTorch]
  • E2VPT: "E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning", ICCV, 2023 (Rochester Institute of Technology, NY). [Paper][PyTorch]
  • R-AMT: "Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models", ICCV, 2023 (Zhejiang University). [Paper][PyTorch][Website]
  • DiffTPT: "Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning", ICCV, 2023 (A*STAR). [Paper][PyTorch]
  • KAPT: "Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models", ICCV, 2023 (Southern University of Science and Technology (SUSTech)). [Paper]
  • RPO: "Read-only Prompt Optimization for Vision-Language Few-shot Learning", ICCV, 2023 (Korea University). [Paper][PyTorch]
  • LoGoPrompt: "LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models", ICCV, 2023 (ShanghaiTech). [Paper][Website]
  • DAPT: "Distribution-Aware Prompt Tuning for Vision-Language Models", ICCV, 2023 (Korea University). [Paper][PyTorch]
  • ?: "What does CLIP know about a red circle? Visual prompt engineering for VLMs", ICCV, 2023 (Oxford). [Paper]
  • GRAM: "Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models", ICCV, 2023 (Huawei). [Paper]
  • VPT: "Variational prompt tuning improves generalization of vision-language models", ICCV, 2023 (Samsung). [Paper][PyTorch]
  • ProGrad: "Prompt-aligned Gradient for Prompt Tuning", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch]
  • CTP-TFT: "Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models", ICCV, 2023 (Baidu). [Paper]
  • GOPro: "GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning", BMVC, 2023 (IIT Bombay). [Paper][Code (in construction)]
  • APoLLo: "APoLLo: Unified Adapter and Prompt Learning for Vision Language Models", EMNLP, 2023 (Maryland). [Paper][Website]
  • ALIGN: "Tuning Multi-mode Token-level Prompt Alignment across Modalities", NeurIPS, 2023 (Xidian University). [Paper][PyTorch]
  • GraphAdapter: "GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph", NeurIPS, 2023 (NUS). [Paper][PyTorch (in construction)]
  • OpenVik: "Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting", NeurIPS, 2023 (Emory). [Paper]
  • PromptAlign: "Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization", NeurIPS, 2023 (MBZUAI). [Paper][PyTorch][Website]
  • VPGTrans: "Transfer Visual Prompt Generator across LLMs", NeurIPS, 2023 (NUS). [Paper][PyTorch][Website]
  • TransHP: "TransHP: Image Classification with Hierarchical Prompting", NeurIPS, 2023 (Baidu). [Paper][PyTorch]
  • UP-DP: "UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models", NeurIPS, 2023 (Bosch). [Paper]
  • LaFTer: "LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections", NeurIPS, 2023 (TU Graz, Austria). [Paper][PyTorch][Website]
  • DDCoT: "DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models", NeurIPS, 2023 (ShanghaiTech). [Paper][PyTorch][Website]
  • ?: "Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning", NeurIPS, 2023 (Brown). [Paper][PyTorch]
  • FGVP: "Fine-Grained Visual Prompting", NeurIPS, 2023 (BAAI). [Paper][PyTorch]
  • POMP: "Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition", NeurIPS, 2023 (Amazon). [Paper][PyTorch]
  • SeMap: "From Visual Prompt Learning to Zero-Shot Transfer: Mapping Is All You Need", arXiv, 2023 (CISPA, Germany). [Paper]
  • R-Tuning: "R-Tuning: Regularized Prompt Tuning in Open-Set Scenarios", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
  • VPTM: "Rethinking Visual Prompt Learning as Masked Visual Token Modeling", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
  • PBPrompt: "Patch-Token Aligned Bayesian Prompt Learning for Vision-Language Models", arXiv, 2023 (Xidian University). [Paper]
  • Robust-ProL: "Towards Robust Prompts on Vision-Language Models", arXiv, 2023 (Google). [Paper]
  • ProVP: "Progressive Visual Prompt Learning with Contrastive Feature Re-formation", arXiv, 2023 (vivo, China). [Paper]
  • ?: "Chain of Thought Prompt Tuning in Vision Language Models", arXiv, 2023 (Peking University). [Paper]
  • Instruction-ViT: "Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT", arXiv, 2023 (University of Electronic Science and Technology of China). [Paper]
  • DRPT: "DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning", arXiv, 2023 (Hong Kong Polytechnic University). [Paper][Code (in construction)]
  • VCoT: "Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings", arXiv, 2023 (UCSB). [Paper]
  • PMPO: "Multi-Prompt with Depth Partitioned Cross-Modal Learning", arXiv, 2023 (CAS). [Paper]
  • DSD: "Discriminative Diffusion Models as Few-shot Vision and Language Learners", arXiv, 2023 (Google). [Paper]
  • PLID: "Prompting Language-Informed Distribution for Compositional Zero-Shot Learning", arXiv, 2023 (Michigan State). [Paper]
  • ConES: "ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models", arXiv, 2023 (Sichuan University). [Paper]
  • CoPrompt: "Consistency-guided Prompt Learning for Vision-Language Models", arXiv, 2023 (Queen’s University, Canada). [Paper]
  • ProTeCt: "ProTeCt: Prompt Tuning for Hierarchical Consistency", arXiv, 2023 (UCSD). [Paper]
  • POP: "POP: Prompt Of Prompts for Continual Learning", arXiv, 2023 (Qualcomm). [Paper]
  • GAVIE: "Aligning Large Multi-Modal Model with Robust Instruction Tuning", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
  • NPT: "Bridging the Gap: Neural Collapse Inspired Prompt Tuning for Generalization under Class Imbalance", arXiv, 2023 (Zhejiang University). [Paper]
  • APT: "Approximated Prompt Tuning for Vision-Language Pre-trained Models", arXiv, 2023 (Xiamen University). [Paper]
  • CoPL: "Contextual Prompt Learning for Vision-Language Understanding", arXiv, 2023 (Adobe). [Paper]
  • CiP: "Image Captions are Natural Prompts for Text-to-Image Models", arXiv, 2023 (The University of Sydney). [Paper]
  • DPL: "DPL: Decoupled Prompt Learning for Vision-Language Models", arXiv, 2023 (vivo). [Paper]
  • DuAl-PT: "Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment", arXiv, 2023 (ByteDance). [Paper]
  • DePT: "DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning", arXiv, 2023 (UCL). [Paper][PyTorch]
  • Prompting4Debugging: "Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts", arXiv, 2023 (NYCU). [Paper]
  • ?: "Language Models as Black-Box Optimizers for Vision-Language Models", arXiv, 2023 (CMU). [Paper]
  • DePT: "DePT: Decoupled Prompt Tuning", arXiv, 2023 (University of Electronic Science and Technology of China). [Paper][PyTorch]
  • DEsignBench: "DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design", arXiv, 2023 (Microsoft). [Paper][Website]
  • ArGue: "ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models", arXiv, 2023 (ANU). [Paper]
  • SWIG: "Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models", arXiv, 2023 (NUS). [Paper][Code (in construction)]
  • IMProv: "IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
  • CLAMP: "CLAMP: Contrastive LAnguage Model Prompt-tuning", arXiv, 2023 (Boston). [Paper]
  • RLP: "Re-parameterized Low-rank Prompt: Generalize a Vision-Language Model within 0.5K Parameters", arXiv, 2023 (Tsinghua). [Paper]
  • HPT: "Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models", AAAI, 2024 (Tongji University). [Paper]
  • LAMM: "LAMM: Label Alignment for Multi-Modal Prompt Learning", AAAI, 2024 (SJTU). [Paper][Code (in construction)]
  • LaViP: "LaViP: Language-Grounded Visual Prompts", AAAI, 2024 (Monash University). [Paper]
  • SA2VP: "SA2VP: Spatially Aligned-and-Adapted Visual Prompt", AAAI, 2024 (Harbin Institute of Technology). [Paper][PyTorch]
  • CPL: "Concept-Guided Prompt Learning for Generalization in Vision-Language Models", AAAI, 2024 (Harbin Institute of Technology). [Paper]
  • ?: "Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?", ICLR, 2024 (Rochester Institute of Technology). [Paper]
  • PromptKD: "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models", CVPR, 2024 (Nankai University). [Paper][PyTorch][Website]
  • TDA: "Efficient Test-Time Adaptation of Vision-Language Models", CVPR, 2024 (MBZUAI). [Paper][PyTorch][Website]
  • TVP: "Exploring the Transferability of Visual Prompting for Multimodal Large Language Models", CVPR, 2024 (Tsinghua). [Paper][PyTorch]
  • MTA: "On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?", CVPR, 2024 (UCLouvain, Belgium). [Paper][PyTorch]
  • MemVP: "Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning", ICML, 2024 (Huawei). [Paper][Code (in construction)]
  • ProText: "Learning to Prompt with Text Only Supervision for Vision-Language Models", arXiv, 2024 (MBZUAI). [Paper][PyTorch][Website]
  • Any-shift: "Any-Shift Prompting for Generalization over Distributions", arXiv, 2024 (UvA). [Paper]
  • SPT: "Revisiting the Power of Prompt for Visual Tuning", arXiv, 2024 (Hefei University of Technology). [Paper][PyTorch]
  • LSPT: "LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning", arXiv, 2024 (Microsoft). [Paper]
  • MPVR: "Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs", arXiv, 2024 (TU Graz, Austria). [Paper][PyTorch][Website]
  • CDL: "Pre-trained Vision-Language Models Learn Discoverable Visual Concepts", arXiv, 2024 (Brown). [Paper][PyTorch][Website]

[Back to Overview]

Visual Document Understanding

  • LayoutLMv2: "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding", ACL, 2021 (Microsoft). [Paper][PyTorch]
  • DocFormer: "DocFormer: End-to-End Transformer for Document Understanding", ICCV, 2021 (Amazon). [Paper]
  • StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", ACMMM, 2021 (Baidu). [Paper][Paddle]
  • LayoutXLM: "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • TableFormer: "TableFormer: Table Structure Understanding with Transformers", CVPR, 2022 (IBM). [Paper]
  • TSRFormer: "TSRFormer: Table Structure Recognition with Transformers", ACMMM, 2022 (Microsoft). [Paper]
  • ERNIE-mmLayout: "ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding", ACMMM, 2022 (Baidu). [Paper]
  • Donut: "Donut: Document Understanding Transformer without OCR", ECCV, 2022 (NAVER). [Paper][PyTorch]
  • I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
  • MGDoc: "MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding", EMNLP, 2022 (Adobe). [Paper]
  • DocEnTr: "DocEnTr: An End-to-End Document Image Enhancement Transformer", arXiv, 2022 (UAB, Spain). [Paper][PyTorch]
  • DocSegTr: "DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer", arXiv, 2022 (UAB, Spain). [Paper]
  • DiT: "DiT: Self-supervised Pre-training for Document Image Transformer", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • LayoutLMv3: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MATrIX: "MATrIX - Modality-Aware Transformer for Information eXtraction", arXiv, 2022 (Amazon). [Paper]
  • VLCDoC: "VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification", arXiv, 2022 (La Rochelle University, France). [Paper]
  • Bi-VLDoc: "Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding", arXiv, 2022 (Alibaba). [Paper]
  • TRUST: "TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers", arXiv, 2022 (Baidu). [Paper]
  • Hi-VT5: "Hierarchical multimodal transformers for Multi-Page DocVQA", arXiv, 2022 (UAB, Spain). [Paper]
  • OCR-VQGAN: "OCR-VQGAN: Taming Text-within-Image Generation", WACV, 2023 (UAB, Spain). [Paper]
  • PIXEL: "Language Modelling with Pixels", ICLR, 2023 (University of Copenhagen, Denmark). [Paper]
  • Spotlight: "Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus", ICLR, 2023 (Google). [Paper]
  • MaskDoc: "Masked Visual-Textual Prediction for Document Image Representation Pretraining", ICLR, 2023 (Baidu). [Paper]
  • StrucTexTv2: "StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training", ICLR, 2023 (Baidu). [Paper][Paddle]
  • FlexDM: "Towards Flexible Multi-modal Document Models", CVPR, 2023 (CyberAgent, Japan). [Paper][Tensorflow][Website]
  • MUI: "Mobile User Interface Element Detection Via Adaptively Prompt Tuning", CVPR, 2023 (Ant Group). [Paper][GitHub (in construction)]
  • UDOP: "Unifying Vision, Text, and Layout for Universal Document Processing", CVPR, 2023 (Microsoft). [Paper][PyTorch]
  • M6Doc: "M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis", CVPR, 2023 (South China University of Technology). [Paper][GitHub]
  • VGT: "Vision Grid Transformer for Document Layout Analysis", ICCV, 2023 (Alibaba). [Paper][PyTorch]
  • SeRum: "Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration", ICCV, 2023 (Tencent). [Paper]
  • DocTr: "DocTr: Document Transformer for Structured Information Extraction in Documents", ICCV, 2023 (Amazon). [Paper]
  • FormNetV2: "FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction", ACL, 2023 (Google). [Paper]
  • UReader: "UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model", EMNLP, 2023 (Alibaba). [Paper]
  • mmc4: "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text", NeurIPS (Datasets and Benchmarks), 2023 (AI2). [Paper][GitHub]
  • DUBLIN: "DUBLIN - Document Understanding By Language-Image Network", arXiv, 2023 (Microsoft). [Paper]
  • DocFormerv2: "DocFormerv2: Local Features for Document Understanding", arXiv, 2023 (Amazon). [Paper]
  • DocumentCLIP: "DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents", arXiv, 2023 (Adobe). [Paper][PyTorch]
  • Kosmos-2.5: "Kosmos-2.5: A Multimodal Literate Model", arXiv, 2023 (Microsoft). [Paper]
  • mPLUG-DocOwl: "mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
  • RD: "Efficient End-to-End Visual Document Understanding with Rationale Distillation", arXiv, 2023 (DeepMind). [Paper]
  • RoDLA: "RoDLA: Benchmarking the Robustness of Document Layout Analysis Models", CVPR, 2024 (Karlsruhe Institute of Technology (KIT), Germnay). [Paper][PyTorch][Website]
  • HRVDA: "HRVDA: High-Resolution Visual Document Assistant", CVPR, 2024 (Tencent). [Paper]
  • LayoutLLM: "LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding", CVPR, 2024 (Alibaba). [Paper][Code (in construction)]
  • ScreenAI: "ScreenAI: A Vision-Language Model for UI and Infographics Understanding", arXiv, 2024 (Google). [Paper]
  • DoCo: "Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models", arXiv, 2024 (Tencent). [Paper]
  • mPLUG-DocOwl-1.5: "mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding", arXiv, 2024 (Alibaba). [Paper][Code (in construction)]

[Back to Overview]

Other Multi-Modal Tasks

  • Transfer Learning/Adaptation/Distillation/PEFT/MoE:
    • FLYP: "Finetune like you pretrain: Improved finetuning of zero-shot vision models", CVPR, 2023 (CMU). [Paper][PyTorch]
    • Pi-Tuning: "Pi-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation", ICML, 2023 (HKU). [Paper][Code (in construction)]
    • OCRA: "Cross-Modal Fine-Tuning: Align then Refine", ICML, 2023 (CMU + HP). [Paper][PyTorch]
    • ProbVLM: "ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models", ICCV, 2023 (University of Tubingen, Germany). [Paper][PyTorch]
    • TeS: "Improved Visual Fine-tuning with Natural Language Supervision", ICCV, 2023 (Alibaba). [Paper][PyTorch]
    • Aurora: "Parameter-efficient Tuning of Large-scale Multimodal Foundation Model", NeurIPS, 2023 (Peking). [Paper][PyTorch]
    • DAS: "Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models", NeurIPS, 2023 (Xiamen University). [Paper][PyTorch]
    • Paxion: "Paxion: Patching Action Knowledge in Video-Language Foundation Models", NeurIPS, 2023 (UIUC). [Paper][PyTorch]
    • m2-Mix: "Geodesic Multi-Modal Mixup for Robust Fine-Tuning", NeurIPS, 2023 (University of Seoul). [Paper][PyTorch (in construction)]
    • RLCF: "Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models", arXiv, 2023 (Zhejiang University). [Paper][Code (in construction)]
    • LMAT: "Can Large Pre-trained Models Help Vision Models on Perception Tasks?", arXiv, 2023 (Huawei). [Paper][Website (in construction)]
    • TaCA: "TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • CLIP-KD: "CLIP-KD: An Empirical Study of Distilling CLIP Models", arXiv, 2023 (CAS). [Paper][Code (in construction)]
    • AdaLink: "Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling", arXiv, 2023 (Google). [Paper]
    • LM4Visual: "Frozen Transformers in Language Models Are Effective Visual Encoder Layers", arXiv, 2023 (UIUC). [Paper][PyTorch (in construction)]
    • Octavius: "Octavius: Mitigating Task Interference in MLLMs via MoE", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • GDA: "A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation", ICLR, 2024 (CAS). [Paper][PyTorch]
    • DMN: "Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models", CVPR, 2024 (HKPolyU). [Paper][PyTorch]
    • ARF: "Anchor-based Robust Finetuning of Vision-Language Models", CVPR, 2024 (Tencent). [Paper]
    • DoRA: "DoRA: Weight-Decomposed Low-Rank Adaptation", ICML, 2024 (NVIDIA). [Paper][PyTorch][Website]
    • LLaVA-MoLE: "LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs", arXiv, 2024 (Meituan). [Paper]
    • ?: "Routers in Vision Mixture of Experts: An Empirical Study", arXiv, 2024 (DeepMind). [Paper]
    • MoE-LLaVA: "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models", arXiv, 2024 (Peking). [Paper][PyTorch]
    • DeLVM: "Data-efficient Large Vision Models through Sequential Autoregression", arXiv, 2024 (University of Sydney). [Paper][PyTorch (in construction)]
    • POVID: "Aligning Modalities in Vision Large Language Models via Preference Fine-tuning", arXiv, 2024 (UNC). [Paper][PyTorch]
    • MoAI: "MoAI: Mixture of All Intelligence for Large Language and Vision Models", arXiv, 2024 (KAIST). [Paper][PyTorch]
  • Zero-Shot:
    • SMs: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 (Google). [Paper][GitHub][Website]
    • iCLIP: "iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition", CVPR, 2023 (Microsoft). [Paper]
    • DiffDis: "DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability", ICCV, 2023 (Huawei). [Paper]
    • CuPL: "What does a platypus look like? Generating customized prompts for zero-shot image classification", ICCV, 2023 (UW). [Paper][PyTorch]
    • InMaP: "Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP", NeurIPS, 2023 (Alibaba). [Paper][PyTorch (in construction)]
    • DN: "Test-Time Distribution Normalization for Contrastively Learned Vision-language Models", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
    • ?: "ChatGPT-Powered Hierarchical Comparisons for Image Classification", NeurIPS, 2023 (Michigan State). [Paper][PyTorch]
    • V-GLOSS: "Visually-Grounded Descriptions Improve Zero-Shot Image Classification", arXiv, 2023 (University of Alberta, Canada). [Paper]
    • ?: "Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness", arXiv, 2023 (Amazon). [Paper]
    • UniFine: "UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding", arXiv, 2023 (Columbia). [Paper][Code (in construction)]
    • Cheetah: "Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions", arXiv, 2023 (Zhejiang). [Paper]
    • ?: "LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions", arXiv, 2023 (Beihang). [Paper]
    • ZLaP: "Label Propagation for Zero-shot Classification with Vision-Language Models", CVPR, 2024 (Czech Technical University in Prague). [Paper][PyTorch]
    • REAL: "The Neglected Tails of Vision-Language Models", arXiv, 2024 (TAMU). [Paper][Code (in construction)][Website]
  • X-Shot:
    • Tip-Adapter: "Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
    • VidIL: "Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners", NeurIPS, 2022 (UIUC). [Paper][PyTorch]
    • ComCLIP: "ComCLIP: Training-Free Compositional Image and Text Matching", arXiv, 2022 (UC Santa Cruz). [Paper]
    • TCT: "Efficient Zero-shot Visual Search via Target and Context-aware Transformer", arXiv, 2022 (Baylor College of Medicine, TX). [Paper]
    • ?: "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning", ICLR, 2023 (University of Amsterdam). [Paper]
    • ?: "Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models", CVPR, 2023 (CMU). [Paper]
    • SADA: "Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment", CVPR, 2023 (Huawei). [Paper][PyTorch]
    • LFA: "Black Box Few-Shot Adaptation for Vision-Language models", ICCV, 2023 (Samsung). [Paper]
    • Meta-Adapter: "Meta-Adapter: An Online Few-shot Learner for Vision-Language Model", NeurIPS, 2023 (Xi'an JiaoTong). [Paper]
    • ?: "Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime", arXiv, 2023 (DeepMind). [Paper]
    • Proto-CLIP: "Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning", arXiv, 2023 (UT Dallas). [Paper]
    • NtUA: "Noise-Tolerant Unsupervised Adapter for Vision-Language Models", arXiv, 2023 (MBZUAI). [Paper]
    • SeCAt: "Small Visual Language Models can also be Open-Ended Few-Shot Learners", arXiv, 2023 (UvA). [Paper]
  • Referring Image Segmentation:
    • VLT: "Vision-Language Transformer and Query Generation for Referring Segmentation", ICCV, 2021 (NTU, Singapore). [Paper][Tensorflow]
    • CRIS: "CRIS: CLIP-Driven Referring Image Segmentation", CVPR, 2022 (University of Sydney). [Paper]
    • LAVT: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", CVPR, 2022 (Oxford). [Paper]
    • ReSTR: "ReSTR: Convolution-free Referring Image Segmentation Using Transformers", CVPR, 2022 (POSTECH). [Paper][Website]
    • ReCLIP: "ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension", ACL, 2022 (AI2). [Paper]
    • TSEG: "Weakly-supervised segmentation of referring expressions", arXiv, 2022 (INRIA). [Paper]
    • ZS-RIS: "Zero-shot Referring Image Segmentation with Global-Local Context Features", CVPR, 2023 (Gwangju Institute of Science and Technology (GIST)). [Paper][PyTorch]
    • PolyFormer: "PolyFormer: Referring Image Segmentation as Sequential Polygon Generation", CVPR, 2023 (Amazon). [Paper][Website]
    • MCRES: "Meta Compositional Referring Expression Segmentation", CVPR, 2023 (Singapore University of Technology and Design). [Paper]
    • ReLA: "GRES: Generalized Referring Expression Segmentation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • CGFormer: "Contrastive Grouping With Transformer for Referring Image Segmentation", CVPR, 2023 (ShanghaiTech). [Paper][PyTorch]
    • CCTF: "Learning To Segment Every Referring Object Point by Point", CVPR, 2023 (JD). [Paper][Code (in construction)]
    • ETRIS: "Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
    • DMMI: "Beyond One-to-One: Rethinking the Referring Image Segmentation", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • TRIS: "Referring Image Segmentation Using Text Supervision", ICCV, 2023 (CUHK). [Paper][PyTorch]
    • SaG: "Shatter and Gather: Learning Referring Image Segmentation with Text Supervision", ICCV, 2023 (POSTECH). [Paper][PyTorch][Website]
    • GRSer: "Advancing Referring Expression Segmentation Beyond Single Image", ICCV, 2023 (SenseTime). [Paper][Code (in construction)]
    • APE: "Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • TAS: "Text Augmented Spatial-aware Zero-shot Referring Image Segmentation", EMNLP, 2023 (Zhejiang). [Paper]
    • RIO: "RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments", NeurIPS, 2023 (Beijing Jiaotong University). [Paper][PyTorch][Website]
    • VLT: "VLT: Vision-Language Transformer and Query Generation for Referring Segmentation", TPAMI, 2023 (NTU, Singapore). [Paper]
    • IREG: "Whether you can locate or not? Interactive Referring Expression Generation", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper][Code (in construction)]
    • R-RIS: "Towards Robust Referring Image Segmentation", arXiv, 2023 (Peking). [Paper][Code (in construction)][Website]
    • PVD: "Parallel Vertex Diffusion for Unified Visual Grounding", arXiv, 2023 (Peking University). [Paper]
    • MMNet: "MMNet: Multi-Mask Network for Referring Image Segmentation", arXiv, 2023 (CAS). [Paper]
    • LGFormer: "Linguistic Query-Guided Mask Generation for Referring Image Segmentation", arXiv, 2023 (Alibaba). [Paper]
    • RISCLIP: "RISCLIP: Referring Image Segmentation Framework using CLIP", arXiv, 2023 (POSTECH). [Paper]
    • EAVL: "EAVL: Explicitly Align Vision and Language for Referring Image Segmentation", arXiv, 2023 (CAS). [Paper]
    • Ref-Diff: "Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models", arXiv, 2023 (Harbin Institute of Technology). [Paper][Code (in construction)]
    • DuMoGa: "Towards Complex-query Referring Image Segmentation: A Novel Benchmark", arXiv, 2023 (NUS). [Paper]
    • SSC: "Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation", arXiv, 2023 (Five AI, UK). [Paper]
    • Omni-RES: "Towards Omni-supervised Referring Expression Segmentation", arXiv, 2023 (Xiamen University). [Paper]
    • BTMAE: "Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation", arXiv, 2023 (Yonsei). [Paper]
    • SESAME: "See, Say, and Segment: Teaching LMMs to Overcome False Premises", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
    • MRES: "Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation", arXiv, 2023 (CAS). [Paper][Code (in construction)]
    • MagNet: "Mask Grounding for Referring Image Segmentation", arXiv, 2023 (Tsinghua). [Paper]
    • ReMamber: "ReMamber: Referring Image Segmentation with Mamba Twister", arXiv, 2024 (SJTU). [Paper]
  • Referring Video Segmentation:
    • ReferFormer: "Language as Queries for Referring Video Object Segmentation", CVPR, 2022 (HKU). [Paper][PyTorch]
    • MTTR: "End-to-End Referring Video Object Segmentation with Multimodal Transformers", CVPR, 2022 (Technion - Israel Institute of Technology). [Paper][PyTorch]
    • LBDT: "Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation", CVPR, 2022 (Meituan). [Paper][PyTorch]
    • DSA-BAS: "Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation", CVPR, 2022 (IIAI, China). [Paper]
    • MANet: "Multi-Attention Network for Compressed Video Referring Object Segmentation", ACMMM, 2022 (CAS). [Paper][PyTorch]
    • R2VOS: "Robust Referring Video Object Segmentation with Cyclic Structural Consensus", ICCV, 2023 (Microsoft). [Paper][PyTorch][Website]
    • OnlineRefer: "OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation", ICCV, 2023 (Megvii). [Paper][PyTorch]
    • SgMg: "Spectrum-guided Multi-granularity Referring Video Object Segmentation", ICCV, 2023 (The University of Western Australia). [Paper][PyTorch]
    • MeViS: "MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • CMA: "Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples", ICCV, 2023 (SUSTech). [Paper][PyTorch]
    • TempCD: "Temporal Collection and Distribution for Referring Video Object Segmentation", ICCV, 2023 (ShanghaiTech). [Paper][Website]
    • UniRef: "Segment Every Reference Object in Spatial and Temporal Spaces", ICCV, 2023 (HKU). [Paper][PyTorch]
    • HTML: "HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation", ICCV, 2023 (University of Technology Sydney, UTS). [Paper][Website]
    • ?: "1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation", ICCVW, 2023 (ByteDance). [Paper][PyTorch]
    • SOC: "SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
    • Locater: "Local-Global Context Aware Transformer for Language-Guided Video Segmentation", TPAMI, 2023 (Zhejiang). [Paper][PyTorch]
    • LoSh: "LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation", arXiv, 2023 (King’s College London). [Paper]
    • RefSAM: "RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation", arXiv, 2023 (National University of Defense Technology, China). [Paper][Code (in construction)]
    • IFIRVOS: "Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation", arXiv, 2023 (Wuhan University). [Paper]
    • LGCFS: "Learning Referring Video Object Segmentation from Weak Annotation", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • EPCFormer: "EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation", arXiv, 2023 (Hunan University). [Paper][Code (in construction)]
    • FTEA: "Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation", arXiv, 2023 (Hangzhou Dianzi University). [Paper]
    • UniRef++: "UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces", arXiv, 2023 (HKU). [Paper][PyTorch]
    • MUTR: "Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation", AAAI, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • DsHmp: "Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation", CVPR, 2024 (NTU, Singapore). [Paper][Code (in construction)]
    • GroPrompt: "GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation", CVPRW, 2024 (NVIDIA). [Paper][Website]
    • VD-IT: "Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation", arXiv, 2024 (University at Buffalo, NY). [Paper][Code (in construction)]
    • HTR: "Towards Temporally Consistent Referring Video Object Segmentation", arXiv, 2024 (University of Western Australia). [Paper]
    • VLP-RVOS: "Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models", arXiv, 2024 (Harbin Institute of Technology). [Paper]
  • Referring 3D Segmentation:
    • 3D-STMN: "3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
  • Narrative Grounding:
    • GELLA: "Generalizable Entity Grounding via Assistance of Large Language Model", arXiv, 2024 (UC Merced). [Paper]
  • Tracking:
    • ModaMixer: "Divert More Attention to Vision-Language Tracking", NeurIPS, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
    • TransRMOT: "Referring Multi-Object Tracking", CVPR, 2023 (Megvii). [Paper][PyTorch][Website]
    • ModaMixer: "Divert More Attention to Vision-Language Object Tracking", arXiv, 2023 (Beijing Jiaotong University). [Paper][PyTorch]
  • Analysis:
    • MM-Explainability: "Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers", ICCV, 2021 (Tel Aviv). [Paper][PyTorch]
    • ?: "Are Multimodal Transformers Robust to Missing Modality?", CVPR, 2022 (University of Delaware). [Paper]
    • VL-InterpreT: "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", CVPR (demo), 2022 (Intel). [Paper][Website][Video]
    • ?: "Understanding Attention for Vision-and-Language Tasks", International Conference on Computational Linguistics (COLING), 2022 (The University of Sydney). [Paper]
    • VL-CheckList: "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
    • ?: "Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding", CVPR, 2023 (Tel Aviv). [Paper][PyTorch][Website]
    • Why-Prompt: "Doubly Right Object Recognition: A Why Prompt for Visual Rationales", CVPR, 2023 (Columbia). [Paper]
    • CREPE: "CREPE: Can Vision-Language Foundation Models Reason Compositionally?", CVPR, 2023 (Stanford). [Paper]
    • ZOOM: "Zero-shot Model Diagnosis", CVPR, 2023 (CMU). [Paper]
    • ?: "On the Generalization of Multi-modal Contrastive Learning", ICML, 2023 (Peking). [Paper][PyTorch]
    • ?: "Learning Concise and Descriptive Attributes for Visual Recognition", ICCV, 2023 (UCSD). [Paper]
    • ?: "Linear Spaces of Meanings: Compositional Structures in Vision-Language Models", ICCV, 2023 (Amazon). [Paper]
    • ?: "Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models", EMNLP, 2023 (University of Copenhagen, Denmark). [Paper][PyTorch]
    • GVIL: "Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?", EMNLP, 2023 (UMich). [Paper][GitHub][Website]
    • LICO: "LICO: Explainable Models with Language-Image Consistency", NeurIPS, 2023 (Fudan). [Paper][Code (in construction)]
    • MultiMon: "Mass-Producing Failures of Multimodal Systems with Language Models", NeurIPS, 2023 (Berkeley). [Paper][PyTorch]
    • ?: "Kiki or Bouba? Sound Symbolism in Vision-and-Language Models", NeurIPS, 2023 (Tel Aviv). [Paper][PyTorch][Website]
    • M2IB: "Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution", NeurIPS, 2023 (NYU). [Paper]
    • ?: "Interpreting CLIP's Image Representation via Text-Based Decomposition", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
    • vit-interpret: "Interpreting and Controlling Vision Foundation Models via Text Explanations", arXiv, 2023 (Columbia). [Paper][PyTorch]
    • ?: "Probing the 3D Awareness of Visual Foundation Models", CVPR, 2024 (Google). [Paper][PyTorch]
    • MMVP: "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs", arXiv, 2024 (NYU). [Paper][PyTorch][Website]
    • ?: "Exploring Perceptual Limitation of Multimodal Large Language Models", arXiv, 2024 (USC). [Paper][PyTorch]
    • DejaVu-Momerization: "Déjà Vu Memorization in Vision-Language Models", arXiv, 2024 (Meta). [Paper]
    • ?: "Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies", arXiv, 2024 (DeepMind). [Paper]
  • Speaker Localization:
    • ?: "The Right to Talk: An Audio-Visual Transformer Approach", ICCV, 2021 (University of Arkansas). [Paper]
  • Multi-task:
    • UniT: "Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
    • Pix2Seq: "A Unified Sequence Interface for Vision Tasks", NeurIPS, 2022 (Google). [Paper]
    • LAVIS: "LAVIS: A Library for Language-Vision Intelligence", arXiv, 2022 (Salesforce). [Paper][PyTorch]
    • Unified-IO: "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks", ICLR, 2023 (AI2). [Paper][JAX][Website]
    • ImageBind: "ImageBind: One Embedding Space To Bind Them All", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • EgoT2: "Egocentric Video Task Translation", CVPR, 2023 (Meta). [Paper][Website]
    • VTAGML: "Vision Transformer Adapters for Generalizable Multitask Learning", ICCV, 2023 (EPFL). [Paper][Website]
    • VisionLLM: "VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks", NeurIPS, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • CoCoCon: "Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models", arXiv, 2023 (AI2). [Paper][PyTorch][Website]
    • ONE-PEACE: "ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities", arXiv, 2023 (Alibaba). [Paper][PyTorch (in construction)]
    • VideoLLM: "VideoLLM: Modeling Video Sequence with Large Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • i-Code-Studio: "i-Code Studio: A Configurable and Composable Framework for Integrative AI", arXiv, 2023 (Microsoft). [Paper][Code (in construction)][Website]
    • Tag2Text: "Tag2Text: Guiding Vision-Language Model via Image Tagging", arXiv, 2023 (OPPO). [Paper][PyTorch][Website]
    • RAM: "Recognize Anything: A Strong Image Tagging Model", arXiv, 2023 (OPPO). [Paper][PyTorch][Website]
    • InstructDiffusion: "InstructDiffusion: A Generalist Modeling Interface for Vision Tasks", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • SPHINX: "SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • UniLSeg: "Universal Segmentation at Arbitrary Granularity with Language Instruction", arXiv, 2023 (Tsinghua). [Paper]
    • APE: "Aligning and Prompting Everything All at Once for Universal Visual Perception", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • Alpha-CLIP: "Alpha-CLIP: A CLIP Model Focusing on Wherever You Want", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • VistaLLM: "Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model", arXiv, 2023 (Meta). [Paper][Website]
    • VCoder: "VCoder: Versatile Vision Encoders for Multimodal Large Language Models", arXiv, 2023 (Georgia Tech). [Paper][PyTorch][Website]
    • Unified-IO-2: "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action", arXiv, 2023 (AI2). [Paper][JAX][Website]
    • InstructCV: "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists", ICLR, 2024 (Peking + Berkeley). [Paper][PyTorch]
    • MAD: "Masked AutoDecoder is Effective Multi-Task Vision Generalist", CVPR, 2024 (NTU, Singapore). [Paper][Code (in construction)]
    • UniBind: "UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All", CVPR, 2024 (HKUST). [Paper][PyTorch][Website]
    • GLEE: "General Object Foundation Model for Images and Videos at Scale", CVPR, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch][Website]
    • VLP: "Using Left and Right Brains Together: Towards Vision and Language Planning", arXiv, 2024 (Microsoft). [Paper]
    • V2T-Tokenizer: "Beyond Text: Frozen Large Language Models in Visual Signal Comprehension", arXiv, 2024 (Peking). [Paper][PyTorch]
    • Lumen: "Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models", arXiv, 2024 (Meituan). [Paper][Code (in construction)]
    • GiT: "GiT: Towards Generalist Vision Transformer through Universal Language Interface", arXiv, 2024 (Peking). [Paper][PyTorch]
    • GLID: "GLID: Pre-training a Generalist Encoder-Decoder Vision Model", CVPR, 2024 (CUHK). [Paper]
  • Language-based Video Editing:
    • M3L: "Language-based Video Editing via Multi-Modal Multi-Level Transformer", CVPR, 2022 (UCSB). [Paper]
    • Video-P2P: "Video-P2P: Video Editing with Cross-attention Control", arXiv, 2023 (CUHK). [Paper][Website]
    • FateZero: "FateZero: Fusing Attentions for Zero-shot Text-based Video Editing", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
    • Make-A-Protagonist: "Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts", arXiv, 2023 (Huawei). [Paper][PyTorch][Website]
    • RAVA: "Reframe Anything: LLM Agent for Open World Video Reframing", arXiv, 2024 (Opus Research, Minnesota). [Paper]
  • Video Summarization:
    • GPT2MVS: "GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization", ICMR, 2021 (BBC). [Paper]
    • QVHighlights: "QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries", NeurIPS, 2021 (UNC). [Paper][PyTorch]
    • HMT: "Hierarchical Multimodal Transformer to Summarize Videos", arXiv, 2021 (Xidian University). [Paper]
    • ?: "Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention", ACMMM, 2022 (Adobe). [Paper]
    • IV-Sum: "TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency", ECCV, 2022 (Google). [Paper][Website]
    • A2Summ: "Align and Attend: Multimodal Summarization with Dual Contrastive Losses", CVPR, 2023 (Adobe). [Paper][Code (in construction)][Website]
    • QD-DETR: "Query-Dependent Video Representation for Moment Retrieval and Highlight Detection", CVPR, 2023 (Sungkyunkwan University, Korea). [Paper][PyTorch]
    • A2Summ: "Align and Attend: Multimodal Summarization with Dual Contrastive Losses", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • CLC: "Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies", CVPR, 2023 (Tencent). [Paper][Code (in construction)]
    • VideoXum: "VideoXum: Cross-modal Visual and Textural Summarization of Videos", arXiv, 2023 (OPPO). [Paper][Website]
    • MH-DETR: "MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer", arXiv, 2023 (Nanjing University). [Paper]
    • VisionaryVid: "Joint Moment Retrieval and Highlight Detection Via Natural Language Queries", arXiv, 2023 (Georgia Tech). [Paper][PyTorch]
    • VIEWS: "Video Summarization: Towards Entity-Aware Captions", arXiv, 2023 (Google). [Paper]
    • TR-DETR: "TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection", AAAI, 2024 (Central China Normal University). [Paper][PyTorch]
    • TGT: "Towards Automated Movie Trailer Generation", CVPR, 2024 (KAUST). [Paper]
    • LfVS: "Scaling Up Video Summarization Pretraining with Large Language Models", CVPR, 2024 (Adobe). [Paper]
    • TaleSumm: ""Previously on ..." From Recaps to Story Summarization", CVPR, 2024 (IIIT Hyderabad, India). [Paper][Website]
  • Robotics:
    • CRT: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", IROS, 2021 (Keio University). [Paper]
    • TraSeTR: "TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery", ICRA, 2022 (CUHK). [Paper]
    • VLMbench: "VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation", NeurIPS (Datasets and Benchmarks), 2022 (UC Santa Cruz). [Paper][Pytorch][Website]
    • Surgical-VQLA: "Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery", ICRA, 2023 (CUHK). [Paper][PyTorch]
    • ?: "Distilling Internet-Scale Vision-Language Models into Embodied Agents", ICML, 2023 (DeepMind). [Paper]
    • LIV: "LIV: Language-Image Representations and Rewards for Robotic Control", ICML, 2023 (UPenn). [Paper][PyTorch][Website]
    • PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML, 2023 (Google). [Paper][Website]
    • VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", ICML, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • GVCCI: "GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation", IROS, 2023 (SNU, Korea). [Paper]
    • ARNOLD: "ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes", ICCV, 2023 (UCLA). [Paper][PyTorch][Website]
    • LACO: "Language-Conditioned Path Planning", CoRL, 2023 (Berkeley). [Paper][Code (in construction)][Website]
    • CROG: "Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter", CoRL, 2023 (University of Groningen, Netherlands). [Paper][PyTorch]
    • DiffVL: "DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics", NeurIPS, 2023 (UCSD). [Paper][PyTorch][Website]
    • HiP: "Compositional Foundation Models for Hierarchical Planning", NeurIPS, 2023 (MIT). [Paper][PyTorch][Website]
    • Grounded-Decoding: "Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control", arXiv, 2023 (Google). [Paper][Website]
    • MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", arXiv, 2023 (Google). [Paper][Website]
    • ?: "Vision-Language Models as Success Detectors", arXiv, 2023 (DeepMind). [Paper]
    • VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", arXiv, 2023 (Meta). [Paper][Website]
    • HomeRobot: "HomeRobot: Open-Vocabulary Mobile Manipulation", arXiv, 2023 (Georgia Tech + Meta). [Paper][PyTorch][Website]
    • TaPA: "Embodied Task Planning with Large Language Models", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper][PyTorch][Website]
    • VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", arXiv, 2023 (Stanford). [Paper][Website]
    • RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv, 2023 (DeepMind). [Paper][Website]
    • VLP: "Video Language Planning", arXiv, 2023 (DeepMind). [Paper][Code (in construction)][Website]
    • RoboFlamingo: "Vision-Language Foundation Models as Effective Robot Imitators", arXiv, 2023 (ByteDance). [Paper][Website]
    • ?: "GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration", arXiv, 2023 (Microsoft). [Paper][Website]
    • AffordanceLLM: "AffordanceLLM: Grounding Affordance from Vision Language Models", arXiv, 2024 (Amazon). [Paper][Website]
    • MultiPLY: "MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World", arXiv, 2024 (UMass). [Paper][Code (in construction)][Website]
    • seeing-unseen: "Seeing the Unseen: Visual Common Sense for Semantic Placement", arXiv, 2024 (AI2). [Paper][Website]
    • PIVOT: "PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs", arXiv, 2024 (DeepMind). [Paper][Website]
    • VPDD: "Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning", arXiv, 2024 (Shanghai AI Lab). [Paper][Website]
    • DecisionNCE: "DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning", arXiv, 2024 (Tsinghua). [Paper][PyTorch][Website]
    • 3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model", arXiv, 2024 (UMass). [Paper][Code (in construction)][Website]
  • Multi-modal Fusion:
    • MICA: "Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion", ICCV, 2021 (Southwest Jiaotong University). [Paper]
    • IFT: "Image Fusion Transformer", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PPT: "PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion", arXiv, 2021 (?). [Paper]
    • TransFuse: "TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning", arXiv, 2022 (Fudan University). [Paper]
    • SwinFuse: "SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images", arXiv, 2022 (Taiyuan University of Science and Technology). [Paper]
    • ?: "Array Camera Image Fusion using Physics-Aware Transformers", arXiv, 2022 (University of Arizona). [Paper]
    • CDDFuse: "CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion", CVPR, 2023 (ETHZ). [Paper][PyTorch]
  • Human Interaction:
    • Dyadformer: "Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions", ICCVW, 2021 (Universitat de Barcelona). [Paper]
  • 3D:
    • 3DRefTransformer: "3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language", WACV, 2022 (KAUST). [Paper][Website]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning", arXiv, 2022 (Peking University). [Paper]
    • PLA: "Language-driven Open-Vocabulary 3D Scene Understanding", CVPR, 2023 (ByteDance). [Paper][PyTorch][Website]
    • VL-SAT: "VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud", CVPR, 2023 (Beihang University). [Paper][PyTorch]
    • LERF: "LERF: Language Embedded Radiance Fields", ICCV, 2023 (Berkeley). [Paper][Website]
    • ConceptFusion: "ConceptFusion: Open-set Multimodal 3D Mapping", arXiv, 2023 (MIT). [Paper][Website]
    • CG3D: "CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition", arXiv, 2023 (JHU). [Paper][PyTorch][Website]
    • DiffCLIP: "DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification", arXiv, 2023 (Beijing Institute of Technology). [Paper]
    • LLM-Grounder: "LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent", arXiv, 2023 (UMich). [Paper][PyTorch][Website]
    • ShapeGPT: "ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model", arXiv, 2023 (Tencent). [Paper][Website][Website]
    • LEGaussain: "Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding", arXiv, 2023 (Beihang). [Paper]
    • Gaussian-Grouping: "Gaussian Grouping: Segment and Edit Anything in 3D Scenes", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
    • LangSplat: "LangSplat: 3D Language Gaussian Splatting", arXiv, 2023 (Tsinghua). [Paper][PyTorch][Website]
    • Open-NeRF: "Open-NeRF: Towards Open Vocabulary NeRF Decomposition", WACV, 2024 (UIUC). [Paper]
    • OpenNeRF: "OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views", ICLR, 2024 (Google). [Paper][PyTorch][Website]
    • PPT: "Parameter-efficient Prompt Learning for 3D Point Cloud Understanding", ICRA, 2024 (Renmin University of China). [Paper][PyTorch]
    • TAMM: "TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding", CVPR, 2024 (UIUC). [Paper][PyTorch][Website]
    • FMGS: "FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding", arXiv, 2024 (Google). [Paper]
    • EgoLifter: "EgoLifter: Open-world 3D Segmentation for Egocentric Perception", arXiv, 2024 (Meta). [Paper][Website]
    • Gaga: "Gaga: Group Any Gaussians via 3D-aware Memory Bank", arXiv, 2024 (UC Merced). [Paper][Code (in construction)][Website]
  • 3D Segmentation:
    • OpenScene: "OpenScene: 3D Scene Understanding with Open Vocabularies", CVPR, 2023 (Google). [Paper][PyTorch][Website]
    • PartSLIP: "PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models", CVPR, 2023 (Qualcomm). [Paper]
    • CLIP2Scene: "CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • PLA: "Language-driven Open-Vocabulary 3D Scene Understanding", CVPR, 2023 (ByteDance). [Paper][PyTorch][Website]
    • 3D-Highlighter: "3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions", CVPR, 2023 (University of Chicago). [Paper][PyTorch][Website]
    • CLIP-FO3D: "CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP", ICCVW, 2023 (Tsinghua University). [Paper]
    • OpenSUN3D: "OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding", ICCVW, 2023 (ETHZ). [Paper][Website]
    • OVSG: "Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs", CoRL, 2023 (Rutgers). [Paper][PyTorch]
    • OVIR-3D: "OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data", CoRL, 2023 (Rutgers). [Paper][PyTorch]
    • 3D-OVS: "Weakly Supervised 3D Open-vocabulary Segmentation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
    • OpenMask3D: "OpenMask3D: Open-Vocabulary 3D Instance Segmentation", NeurIPS, 2023 (ETHZ). [Paper][PyTorch][Website]
    • Seal: "Segment Any Point Cloud Sequences by Distilling Vision Foundation Models", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
    • POP-3D: "POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images", NeurIPS, 2023 (valeo.ai, France). [Paper][PyTorch][Website]
    • OVO: "OVO: Open-Vocabulary Occupancy", arXiv, 2023 (Fudan). [Paper]
    • SAM3D: "SAM3D: Segment Anything in 3D Scenes", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Lowis3D: "Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding", arXiv, 2023 (HKU). [Paper]
    • OpenIns3D: "OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation", arXiv, 2023 (Cambridge). [Paper][Website]
    • ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", arXiv, 2023 (University of Toronto + Universite de Montreal). [Paper][PyTorch][Website]
    • OmniSeg3D: "OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning", arXiv, 2023 (Tsinghua). [Paper][Website]
    • SAMPro3D: "SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
    • LL3DA: "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning", arXiv, 2023 (Fudan). [Paper][Code (in construction)][Website]
    • PartSLIP++: "PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation", arXiv, 2023 (UCSD). [Paper][PyTorch]
    • Uni3DL: "Uni3DL: Unified Model for 3D and Language Understanding", arXiv, 2023 (KAUST). [Paper][Website]
    • Open3DIS: "Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance", arXiv, 2023 (VinAI). [Paper][Website]
    • Segment3D: "Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels", arXiv, 2023 (Tsinghua). [Paper][Website]
    • PartDistill: "PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation", CVPR, 2024 (NYCU). [Paper]
    • ?: "3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation", arXiv, 2024 (Waymo). [Paper]
    • PartSTAD: "PartSTAD: 2D-to-3D Part Segmentation Task Adaptation", arXiv, 2024 (KAIST). [Paper]
    • MaskClustering: "MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation", arXiv, 2024 (Peking). [Paper][Website]
    • GARField: "GARField: Group Anything with Radiance Fields", arXiv, 2024 (Berkeley). [Paper][PyTorch][Website]
    • SA-GS: "Segment Anything in 3D Gaussians", arXiv, 2024 (The Hong Kong Polytechnic University). [Paper]
    • OV-NeRF: "OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding", arXiv, 2024 (Peking). [Paper]
    • Open3DSG: "Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships", arXiv, 2024 (Bosch). [Paper][Website]
    • PointSeg: "PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models", arXiv, 2024 (Tencent). [Paper]
    • SOLE: "Segment Any 3D Object with Language", arXiv, 2024 (NUS). [Paper][Code (in construction)][Website]
    • PARIS3D: "PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model", arXiv, 2024 (MBZUAI). [Paper][PyTorch]
  • Speech Recognition:
    • AV-HuBERT: "Robust Self-Supervised Audio-Visual Speech Recognition", arXiv, 2022 (Meta). [Paper][PyTorch]
    • ?: "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition", arXiv, 2022 (Google). [Paper]
    • AVFormer: "AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR", CVPR, 2023 (Google). [Paper]
    • AV-RelScore: "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring", CVPR, 2023 (KAIST). [Paper][PyTorch]
    • SynthVSR: "SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision", CVPR, 2023 (Meta). [Paper]
    • Lip2Vec: "Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping", ICCV, 2023 (Technology Innovation Institute (TII), UAE). [Paper]
  • Emotion Recognition:
    • ?: "A Pre-trained Audio-Visual Transformer for Emotion Recognition", ICASSP, 2022 (USC). [Paper]
    • MDAN: "MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis", CVPR, 2022 (Tencent). [Paper]
    • DMD: "Decoupled Multimodal Distilling for Emotion Recognition", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • Sound Separation:
    • VoViT: "VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer", ECCV, 2022 (Universitat Pompeu Fabra, Spain). [Paper][PyTorch][Website]
    • iQuery: "iQuery: Instruments as Queries for Audio-Visual Sound Separation", CVPR, 2023 (UCSD). [Paper][Code (in construction)]
    • VAST: "Language-Guided Audio-Visual Source Separation via Trimodal Consistency", CVPR, 2023 (Boston University). [Paper][Website]
    • AVIN: "Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization", ACMMM, 2023 (Northwestern Polytechnical University). [Paper][Code (in construction)]
    • GAVS: "Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer", arXiv, 2023 (Renmin University of China). [Paper]
  • Audio-Visual:
    • AV-HuBERT: "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction", ICLR, 2022 (Meta). [Paper][PyTorch]
    • AVCA: "Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language", CVPR, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • TCaF: "Temporal and cross-modal attention for audio-visual zero-shot learning", ECCV, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • AVA-Memory: "Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment", ECCV, 2022 (KAIST). [Paper]
    • TVLT: "TVLT: Textless Vision-Language Transformer", NeurIPS, 2022 (UNC). [Paper][PyTorch]
    • ANGIE: "Audio-Driven Co-Speech Gesture Video Generation", NeurIPS, 2022 (CUHK). [Paper][Website]
    • MGN: "Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing", NeurIPS, 2022 (CMU + UT Austin). [Paper][PyTorch]
    • FS-RIR: "Few-Shot Audio-Visual Learning of Environment Acoustics", NeurIPS, 2022 (UT Austin). [Paper][Website]
    • u-HuBERT: "u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality", NeurIPS, 2022 (Meta). [Paper]
    • PC-VAE: "Multimodal Transformer for Parallel Concatenated Variational Autoencoders", NeurIPSW, 2022 (USC). [Paper]
    • AV-CAT: "Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers", SIGGRAPH Asia, 2022 (Tokyo Institute of Technology + Baidu). [Paper][Website]
    • MTD: "Multimodal Transformer Distillation for Audio-Visual Synchronization", arXiv, 2022 (NTU). [Paper]
    • AVE-CLIP: "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization", WACV, 2023 (UT Austin). [Paper]
    • CLIPSep: "CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos", ICLR, 2023 (Sony). [Paper]
    • CAV-MAE: "Contrastive Audio-Visual Masked Autoencoder", ICLR, 2023 (MIT + IBM). [Paper]
    • UnAV: "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline", CVPR, 2023 (Southern University of Science and Technology). [Paper][PyTorch][Website]
    • LAVISH: "Vision Transformers are Parameter-Efficient Audio-Visual Learners", CVPR, 2023 (UNC). [Paper][Pytorch][Website]
    • OneAVM: "A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition", ICML, 2023 (CMU + UW Madison). [Paper][Code (in construction)]
    • AdVerb: "AdVerb: Visually Guided Audio Dereverberation", ICCV, 2023 (Maryland). [Paper][Website]
    • CIGN: "Class-Incremental Grouping Network for Continual Audio-Visual Learning", ICCV, 2023 (UT Dallas). [Paper][PyTorch]
    • AV-CIL: "Audio-Visual Class-Incremental Learning", ICCV, 2023 (UT Dallas). [Paper][PyTorch]
    • Audiovisual-MAE: "Audiovisual Masked Autoencoders", ICCV, 2023 (Google). [Paper]
    • MAViL: "MAViL: Masked Audio-Video Learners", NeurIPS, 2023 (Meta). [Paper][Code (in construction)]
    • LSLD: "Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective", NeurIPS, 2023 (Wuhan University). [Paper][PyTorch]
    • DG-SCT: "Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks", NeurIPS, 2023 (Zhejiang). [Paper][PyTorch]
    • VALOR: "Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser", NeurIPS, 2023 (NTU). [Paper][PyTorch]
    • GestureDiffuCLIP: "GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents", arXiv, 2023 (Peking University). [Paper]
    • MMViT: "MMViT: Multiscale Multiview Vision Transformers", arXiv, 2023 (Meta). [Paper]
    • ?: "Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos" arXiv, 2023 (Meta). [Paper]
    • AVSiam: "Siamese Vision Transformers are Scalable Audio-visual Learners", arXiv, 2024 (UNC). [Paper]
  • Audio-Visual Localization/Segmentation:
    • AVSBench: "Audio-Visual Segmentation", ECCV, 2022 (SenseTime). [Paper][PyTorch][Website]
    • ECMVAE: "Multimodal Variational Auto-encoder based Audio-Visual Segmentation", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • WS-AVS: "Weakly-Supervised Audio-Visual Segmentation", NeurIPS, 2023 (CMU + MBZUAI). [Paper]
    • AV-SAM: "AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation", arXiv, 2023 (CMU + UT Dallas). [Paper]
    • AUSS: "Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation", arXiv, 2023 (Fudan). [Paper]
    • AuTR: "Annotation-free Audio-Visual Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • AVSegFormer: "AVSegFormer: Audio-Visual Segmentation with Transformer", arXiv, 2023 (Nanjing University). [Paper][PyTorch]
    • SQD: "Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition", arXiv, 2023 (CMU). [Paper]
    • DiffMAViL: "Diffusion Models as Masked Audio-Video Learners", arXiv, 2023 (Apple). [Paper]
    • AVIS: "Audio-Visual Instance Segmentation", arXiv, 2023 (Peking). [Paper]
    • COMBO: "Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation", arXiv, 2023 (CAS). [Paper][Code (in construction)][Website]
    • UFE: "Audio-Visual Segmentation via Unlabeled Frame Exploitation", CVPR, 2024 (SJTU). [Paper]
  • Audio Description:
    • AutoAD: "AutoAD: Movie Description in Context", CVPR, 2023 (Oxford). [Paper][PyTorch][Website][Website]
    • AutoAD-II: "AutoAD II: The Sequel - Who, When, and What in Movie Audio Description", ICCV, 2023 (Oxford). [Paper][PyTorch][Website]
    • AutoAD-III: "AutoAD III: The Prequel -- Back to the Pixels", CVPR, 2024 (Oxford). [Paper][PyTorch][Website]
  • Sound Localization:
    • TURN: "Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch (in construction)]
    • AVGN: "Audio-Visual Grouping Network for Sound Localization from Mixtures", CVPR, 2023 (CMU). [Paper][PyTorch]
    • ?: "Sound Source Localization is All about Cross-Modal Alignment", ICCV, 2023 (KAIST). [Paper]
  • Sentiment Analysis:
    • CubeMLP: "CubeMLP: A MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation", ACMMM, 2022 (Zhejiang University). [Paper]
    • MCMulT: "Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos", arXiv, 2022 (Tencent). [Paper]
  • Entity Recognition:
    • FMIT: "Flat Multi-modal Interaction Transformer for Named Entity Recognition", International Conference on Computational Linguistics (COLING), 2022 (South China University of Technology). [Paper]
    • OVEN: "Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities", ICCV, 2023 (Google). [Paper][PyTorch][Website]
  • Localization via Embodied Dialog:
    • LED-Bert: "Transformer-based Localization from Embodied Dialog with Large-scale Pre-training", arXiv, 2022 (Georgia Tech). [Paper]
  • Object Captioning:
    • GRiT: "GRiT: A Generative Region-to-text Transformer for Object Understanding", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • Conversation:
    • VisProg: "Visual Programming: Compositional visual reasoning without training", CVPR, 2023 (AI2). [Paper][PyTorch][Website]
    • ViperGPT: "ViperGPT: Visual Inference via Python Execution for Reasoning", ICCV, 2023 (Columbia). [Paper][PyTorch][Website]
    • LaVIN: "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models", NeurIPS, 2023 (Xiamen University). [Paper][PyTorch][Website]
    • LLaVA: "Visual Instruction Tuning", NeurIPS, 2023 (UW-Madison). [Paper][PyTorch][Website]
    • LAMM: "LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark", NeurIPS (Datasets and Benchmarks), 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", NeurIPS, 2023 (HKU). [Paper][PyTorch (in construction)][Website]
    • InstructBLIP: "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning", NeurIPS, 2023 (Salesforce). [Paper][PyTorch]
    • AmadeusGPT: "AmadeusGPT: a natural language interface for interactive animal behavioral analysis", NeurIPS, 2023 (EPFL). [Paper][PyTorch][Website]
    • Visual-ChatGPT: "Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models", arXiv, 2023 (Microsoft). [Paper]
    • MM-REACT: "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action", arXiv, 2023 (Microsoft). [Paper][Code][Website]
    • Chameleon: "Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models", arXiv, 2023 (UCLA + Microsoft). [Paper][PyTorch][Website]
    • MiniGPT-4: "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models", arXiv, 2023 (KAUST). [Paper][PyTorch][Website]
    • LLaMA-Adapter: "LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • LLaMA-Adapter-V2: "LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Otter: "Otter: A Multi-Modal Model with In-Context Instruction Tuning", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
    • LMEye: "LMEye: An Interactive Perception Network for Large Language Models", arXiv, 2023 (Meituan). [Paper]
    • MultiModal-GPT: "MultiModal-GPT: A Vision and Language Model for Dialogue with Humans", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • InternChat: "InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • ArtGPT-4: "ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4", arXiv, 2023 (Anhui Polytechnic University). [Paper][PyTorch]
    • PandaGPT: "PandaGPT: One Model To Instruction-Follow Them All", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
    • MIMIC-IT: "MIMIC-IT: Multi-Modal In-Context Instruction Tuning", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • ?: "Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models", arXiv, 2023 (Huawei). [Paper]
    • AssistGPT: "AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
    • Macaw-LLM: "Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • Shikra: "Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic", arXiv, 2023 (SenseTime). [Paper][Code (in construction)]
    • LLaVAR: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding", arXiv, 2023 (Stanford). [Paper][PyTorch][Website]
    • Polite-Flamingo: "Visual Instruction Tuning with Polite Flamingo", arXiv, 2023 (Xiaobing.AI). [Paper]
    • Lynx: "What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?", arXiv, 2023 (ByteDance). [Paper][Website]
    • GPT4RoI: "GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • SVIT: "SVIT: Scaling up Visual Instruction Tuning", arXiv, 2023 (BAAI). [Paper]
    • ChatSpot: "ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning", arXiv, 2023 (Megvii). [Paper][Demo]
    • ?: "How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges", arXiv, 2023 (ETHZ). [Paper][GitHub (in construction)]
    • ?: "Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models", arXiv, 2023 (Google). [Paper]
    • MM-Vet: "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities", arXiv, 2023 (Microsoft). [Paper][Code]
    • StableLLaVA: "StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • PVIT: "Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models", arXiv, 2023 (Tsinghua). [Paper]
    • PointLLM: "PointLLM: Empowering Large Language Models to Understand Point Clouds", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
    • Point-Bind: "Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following", arXiv, 2023 (CUHK). [Paper][PyTorch]
    • ImageBind-LLM: "ImageBind-LLM: Multi-modality Instruction Tuning", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • ?: "An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models", arXiv, 2023 (Microsoft). [Paper][GitHub]
    • InternLM-XComposer: "InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • LLaVA-RLHF: "Aligning Large Multimodal Models with Factually Augmented RLHF", arXiv, 2023 (Berkeley + CMU + UIUC). [Paper][Code (in construction)][Website]
    • Muffin: "Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants", arXiv, 2023 (Tsinghua). [Paper]
    • Pink: "Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs", arXiv, 2023 (Ant). [Paper][Code (in construction)]
    • LLaVA-1.5: "Improved Baselines with Visual Instruction Tuning", arXiv, 2023 (UW Madison). [Paper][PyTorch][Website]
    • MiniGPT-5: "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens", arXiv, 2023 (UC Santa Cruz). [Paper][PyTorch]
    • MiniGPT-v2: "MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning", arXiv, 2023 (Meta). [Paper][PyTorch][Website]
    • Woodpecker: "Woodpecker: Hallucination Correction for Multimodal Large Language Models", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • LLaVA-Interactive: "LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • NExT-Chat: "NExT-Chat: An LMM for Chat, Detection and Segmentation", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
    • mPLUG-Owl: "mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • mPLUG-Owl2: "mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • LLaVA-Plus: "LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • u-LLaVA: "u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model", arXiv, 2023 (OPPO). [Paper]
    • LVIS-Instruct4V: "To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning", arXiv, 2023 (Fudan). [Paper][GitHub]
    • InfMLLM: "InfMLLM: A Unified Framework for Visual-Language Tasks", arXiv, 2023 (?). [Paper][PyTorch]
    • Q-Instruct: "Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models", arXiv, 2023 (NTU, Singapore). [Paper][Website]
    • DRESS: "DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback", arXiv, 2023 (SRI). [Paper][Dataset]
    • LION: "LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge", arXiv, 2023 (Harbin Institute of Technology). [Paper][Code (in construction)][Website]
    • VCD: "Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • OPERA: "OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • CG-VLM: "Contrastive Vision-Language Alignment Makes Efficient Instruction Learner", arXiv, 2023 (South China University of Technology). [Paper][Code (in construction)]
    • X-InstructBLIP: "X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning", arXiv, 2023 (Salesforce). [Paper]
    • ViP-LLaVA: "Making Large Multimodal Models Understand Arbitrary Visual Prompts", arXiv, 2023 (Cruise). [Paper][PyTorch][Website]
    • Prompt-Highlighter: "Prompt Highlighter: Interactive Control for Multi-Modal LLMs", arXiv, 2023 (CUHK). [Paper][PyTorch][Website]
    • Honeybee: "Honeybee: Locality-enhanced Projector for Multimodal LLM", arXiv, 2023 (Kakao). [Paper][Code (in construction)]
    • Osprey: "Osprey: Pixel Understanding with Visual Instruction Tuning", arXiv, 2023 (Zhejiang). [Paper][PyTorch]
    • Gemini: "Gemini: A Family of Highly Capable Multimodal Models", arXiv, 2023 (Google). [Paper]
    • V*: "V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs", arXiv, 2023 (NYU). [Paper][PyTorch][Website]
    • ?: "Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases", arXiv, 2023 (Shanghai AI Lab). [Paper][GitHub]
    • TinyGPT-V: "TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones", arXiv, 2023 (Anhui Polytechnic University). [Paper][PyTorch]
    • Ferret: "Ferret: Refer and Ground Anything Anywhere at Any Granularity", ICLR, 2024 (Apple). [Paper][PyTorch]
    • SNIFFER: "SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection", CVPR, 2024 (NUS). [Paper][Code (in construction)][Website]
    • ChartAssisstant: "ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning", arXiv, 2024 (Shanghai AI Lab). [Paper]
    • LLaVA-ϕ: "LLaVA-ϕ: Efficient Multi-Modal Assistant with Small Language Model", arXiv, 2024 (Midea Group, China). [Paper][Code (in construction)]
    • CaMML: "CaMML: Context-Aware Multimodal Learner for Large Models", arXiv, 2024 (Amazon). [Paper]
    • InternLM-XComposer2: "InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • MARINE: "Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance", arXiv, 2024 (UCLA). [Paper]
    • Prismatic-VLM: "Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models", arXiv, 2024 (Toyota). [Paper][PyTorch]
    • SPHINX-X: "SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • ChartVLM: "ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • Vision-Flan: "Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning", arXiv, 2024 (Virginia Tech). [Paper]
    • CoLLaVO: "CoLLaVO: Crayon Large Language and Vision mOdel", arXiv, 2024 (KAIST). [Paper]
    • LLaVA-HR: "Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models", arXiv, 2024 (Xiamen University). [Paper][PyTorch]
    • InfiMM-HD: "InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding", arXiv, 2024 (CAS). [Paper][PyTorch]
    • DeepSeek-VL: "DeepSeek-VL: Towards Real-World Vision-Language Understanding", arXiv, 2024 (DeepSeek, China). [Paper][PyTorch]
    • Gemini-1.5: "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", arXiv, 2024 (Google). [Paper]
    • FastV: "An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models", arXiv, 2024 (Peking). [Paper][PyTorch (in construction)]
    • MM1: "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training", arXiv, 2024 (Apple). [Paper]
    • LLaVA-UHD: "LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images", arXiv, 2024 (Tsinghua). [Paper][PyTorch]
    • LLaVA-PruMerge: "LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models", arXiv, 2024 (UW-Madison). [Paper][Code (in construction)][Website]
    • VT: "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models", arXiv, 2024 (CUHK). [Paper][Code (in construction)]
    • Mini-Gemini: "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models", arXiv, 2024 (CUHK). [Paper][PyTorch]
    • Gemma: "Gemma: Open Models Based on Gemini Research and Technology", arXiv, 2024 (Google). [Paper]
    • SQ-LLaVA: "SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant", arXiv, 2024 (Salesforce). [Paper][PyTorch]
    • Cobra: "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference", arXiv, 2024 (Westlake University, China). [Paper][PyTorch][Website]
    • TinyLLaVA: "TinyLLaVA: A Framework of Small-scale Large Multimodal Models", arXiv, 2024 (Beihang University). [Paper][PyTorch]
    • MathVerse: "MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • GroundingGPT: "GroundingGPT: Language Enhanced Multi-modal Grounding Model", arXiv, 2024 (ByteDance). [Paper][PyTorch][Website]
    • InternLM-XComposer2-4KHD: "InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
    • Ferret-UI: "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs", arXiv, 2024 (Apple). [Paper]
    • Ferret-v2: "Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models", arXiv, 2024 (Apple). [Paper]
    • LocVLM: "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs", arXiv, 2024 (Meta). [Paper]
    • Reka: "Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models", arXiv, 2024 (Reka.ai). [Paper][Website]
    • Cantor: "Cantor: Inspiring Multimodal Chain-of-Thought of MLLM", arXiv, 2024 (Tencent). [Paper][Code (in construction)][Website]
    • CuMo: "CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts", arXiv, 2024 (ByteDance). [Paper][PyTorch]
    • Chameleon: "Chameleon: Mixed-Modal Early-Fusion Foundation Models", arXiv, 2024 (Meta). [Paper]
    • VoCo-LLaMA: "VoCo-LLaMA: Towards Vision Compression with Large Language Models", arXiv, 2024 (Tencent). [Paper]
  • Conversation (Video):
    • Video-ChatCaptioner: "Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions", arXiv, 2023 (KAUST). [Paper][PyTorch]
    • ChatVideo: "ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System", arXiv, 2023 (Fudan). [Paper][Website]
    • VideoChat: "VideoChat: Chat-Centric Video Understanding", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Video-LLaMA: "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • Video-ChatGPT: "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models", arXiv, 2023 (MBZUAI). [Paper][PyTorch]
    • AntGPT: "AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?", arXiv, 2023 (Brown). [Paper][Website]
    • Video-LLaVA: "Video-LLaVA: Learning United Visual Representation by Alignment Before Projection", arXiv, 2023 (Peking). [Paper][PyTorch]
    • PG-Video-LLaVA: "PG-Video-LLaVA: Pixel Grounding Large Video-Language Models", arXiv, 2023 (MBZUAI). [Paper][Code (in construction)][Website]
    • MVBench: "MVBench: A Comprehensive Multi-modal Video Understanding Benchmark", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Valley: "Valley: Video Assistant with Large Language model Enhanced abilitY", arXiv, 2023 (ByteDance). [Paper]
    • GPT4Video: "GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation", arXiv, 2023 (Tencent). [Paper][Code (in construction)][Website]
    • Merlin: "Merlin: Empowering Multimodal LLMs with Foresight Minds", arXiv, 2023 (Huazhong University of Science and Technology). [Paper]
    • TimeChat: "TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding", arXiv, 2023 (Peking). [Paper][Code (in construction)]
    • VaQuitA: "VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding", arXiv, 2023 (Adobe). [Paper]
    • ?: "Audio-Visual LLM for Video Understanding", arXiv, 2023 (Alibaba). [Paper]
    • MovieChat: "MovieChat: From Dense Token to Sparse Memory for Long Video Understanding", CVPR, 2024 (Zhejiang University). [Paper][PyTorch][Website]
    • Mementos: "Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences", arXiv, 2024 (Maryland). [Paper][GitHub]
    • LVChat: "LVCHAT: Facilitating Long Video Comprehension", arXiv, 2024 (UCSD). [Paper][PyTorch]
    • Momentor: "Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning", arXiv, 2024 (Zhejiang University). [Paper][Code (in construction)]
    • IVA: "LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs", arXiv, 2024 (Harbin Institute of Technology). [Paper]
    • VLM-RLAIF: "Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback", arXiv, 2024 (Yonsei). [Paper]
    • Video-LaVIT: "Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization", arXiv, 2024 (Kuaishou). [Paper][PyTorch][Website]
    • MovieLLM: "MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies", arXiv, 2024 (Tencent). [Paper][PyTorch][Website]
    • VURF: "VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding", arXiv, 2024 (MBZUAI). [Paper]
    • VideoAgent: "VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding", arXiv, 2024 (BIGAI). [Paper][Website]
    • MiniGPT4-Video: "MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens", arXiv, 2024 (KAUST). [Paper][PyTorch][Website]
    • Elysium: "Elysium: Exploring Object-level Perception in Videos via MLLM", arXiv, 2024 (ByteDance). [Paper][Code (in construction)]
    • RED-VILLM: "From Image to Video, what do we need in multimodal LLMs?", arXiv, 2024 (Xiaohongshu). [Paper]
    • Pegasus-1: "Pegasus-1 Technical Report", arXiv, 2024 (Twelve Labs, CA). [Paper][Blog]
    • MovieChat+: "MovieChat+: Question-aware Sparse Memory for Long Video Question Answering", arXiv, 2024 (Zhejiang). [Paper][PyTorch][Website]
    • PLLaVA: "PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning", arXiv, 2024 (ByteDance). [Paper][PyTorch]
    • CVRR-ES: "Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs", arXiv, 2024 (MBZUAI). [Paper][Website][PyTorch]
    • FreeVA: "FreeVA: Offline MLLM as Training-Free Video Assistant", arXiv, 2024 (The University of Sydney). [Paper][PyTorch]
    • ?: "Streaming Long Video Understanding with Large Language Models", arXiv, 2024 (Shanghai AI Lab). [Paper]
  • Conversation (3D):
    • 3D-LLM: "3D-LLM: Injecting the 3D World into Large Language Models", NeurIPS, 2023 (UCLA). [Paper][PyTorch][Website]
    • Chat-3D: "Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
    • Chat-3D-v2: "Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers", arXiv, 2023 (Zhejiang University). [Paper][Code (in construction)]
    • LiDAR-LLM: "LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding", arXiv, 2023 (Peking). [Paper][Code (in construction)][Website]
    • GPT4Point: "GPT4Point: A Unified Framework for Point-Language Understanding and Generation", CVPR, 2024 (HKU). [Paper][PyTorch][Website]
    • Uni3D-LLM: "Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models", arXiv, 2024 (Shanghai AI Lab). [Paper]
    • ShapeLLM: "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction", arXiv, 2024 (Megvii). [Paper][Code (in construction)][Website]
    • Agent3D-Zero: "Agent3D-Zero: An Agent for Zero-shot 3D Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][Website]
    • ?: "Can 3D Vision-Language Models Truly Understand Natural Language?", arXiv, 2024 (HKU). [Paper][Code (in construction)]
    • Scene-LLM: "Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning", arXiv, 2024 (Meta). [Paper]
    • Uni3DR2: "Unified Scene Representation and Reconstruction for 3D Large Language Models", arXiv, 2024 (Shanghai AI Lab). [Paper]
    • MiniGPT-3D: "MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][Code (in construction)]
    • Cube-LLM: "Language-Image Models with 3D Understanding", arXiv, 2024 (NVIDIA). [Paper][Website]
    • Grounded-3D-LLM: "Grounded 3D-LLM with Referent Tokens", arXiv, 2024 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
  • Conversation (Multi):
    • AnyMAL: "AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model", arXiv, 2023 (Meta). [Paper]
    • OneLLM: "OneLLM: One Framework to Align All Modalities with Language", arXiv, 2023 (CUHK + Shanghai AI Lab). [Paper][PyTorch]
    • CREMA: "CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion", arXiv, 2024 (UNC). [Paper][PyTorch][Website]
    • AnyGPT: "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling", arXiv, 2024 (Fudan). [Paper][Code (in construction)][Website]
  • Visual Reasoning:
    • BDC-Adapter: "BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning", BMVC, 2023 (SUSTech). [Paper]
    • LSKD: "Localized Symbolic Knowledge Distillation for Visual Commonsense Models", NeurIPS, 2023 (UW). [Paper][Code (in construction)]
    • RPT: "Fine-Grained Regional Prompt Tuning for Visual Abductive Reasoning", arXiv, 2023 (A*STAR). [Paper]
    • LRR: "Look, Remember and Reason: Visual Reasoning with Grounded Rationales", arXiv, 2023 (Qualcomm). [Paper]
    • SDS-CLIP: "Augmenting CLIP with Improved Visio-Linguistic Reasoning", arXiv, 2023 (Maryland). [Paper]
    • ?: "Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models", arXiv, 2023 (George Mason University). [Paper]
    • ViCor: "ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models", arXiv, 2023 (UC Santa Cruz). [Paper]
    • GENOME: "GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs", arXiv, 2023 (IBM). [Paper][PyTorch][Website]
    • ?: "How Far Are We from Intelligent Visual Deductive Reasoning?", ICLRW, 2024 (Apple). [Paper]
    • SpatialVLM: "SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities", arXiv, 2024 (DeepMind). [Paper][Website]
  • Tracking:
    • JointNLT: "Joint Visual Grounding and Tracking with Natural Language Specification", CVPR, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
    • MMTrack: "Towards Unified Token Learning for Vision-Language Tracking", arXiv, 2023 (Guangxi Normal University). [Paper]
  • Scene Graph:
    • CaCao: "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World", ICCV, 2023 (Zhejiang University). [Paper]
  • Egocentric Video:
    • MMG-Ego4D: "MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition", CVPR, 2023 (Meta). [Paper]
    • EgoTV: "EgoTV: Egocentric Task Verification from Natural Language Task Descriptions", ICCV, 2023 (Meta). [Paper]
  • Dance Generation:
  • Conceptual Understanding:
    • ?: "Text-To-Concept (and Back) via Cross-Model Alignment", ICML, 2023 (Maryland). [Paper]
    • EAC: "Explain Any Concept: Segment Anything Meets Concept-Based Explanation", NeurIPS, 2023 (HKUST). [Paper]
    • ?: "Probing Conceptual Understanding of Large Visual-Language Models", arXiv, 2023 (UCF + SRI). [Paper]
  • Model Merging:
    • VL-merging: "An Empirical Study of Multimodal Model Merging", arXiv, 2023 (Microsoft). [Paper][PyTorch]
  • Visual Word Sense Disambiguation (VWSD):
    • CADG: "Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information", ACL, 2023 (UMass). [Paper]
  • Object Hallucination:
    • POPE: "Evaluating Object Hallucination in Large Vision-Language Models", arXiv, 2023 (Renmin University of China). [Paper][Code (in construction)]
  • Social Interaction:
    • HIINT: "HIINT: Historical, Intra- and Inter-personal Dynamics Modeling with Cross-person Memory Transformer", arXiv, 2023 (MIT). [Paper]
  • Evaluation:
    • HELM: "Holistic Evaluation of Text-To-Image Models", NeurIPS (Datasets and Benchmarks), 2023 (Stanford). [Paper][PyTorch][Website]
    • VisIT-Bench: "VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use", NeurIPS (Datasets and Benchmarks), 2023 (UW). [Paper][PyTorch][Website]
    • Perception-Test: "Perception Test: A Diagnostic Benchmark for Multimodal Video Models", NeurIPS (Datasets and Benchmarks), 2023 (DeepMind). [Paper][GitHub]
    • VLM-Probing: "Scalable Performance Analysis for Vision-Language Models", Joint Conference on Lexical and Computational Semantics (*SEM), 2023 (UMich). [Paper][PyTorch]
    • VisualGPTScore: "VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
    • LVLM-eHub: "LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch (in construction)]
    • VisoGender: "VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution", arXiv, 2023 (Oxford). [Paper][PyTorch]
    • MME: "MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • MMBench: "MMBench: Is Your Multi-modal Model an All-around Player?", arXiv, 2023 (Shanghai AI Lab). [Paper][Website]
    • Tiny-LVLM-eHub: "Tiny LVLM-eHub: Early Multimodal Experiments with Bard", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • MODE: "An Examination of the Compositionality of Large Generative Vision-Language Models", arXiv, 2023 (HKUST). [Paper]
    • TouchStone: "TouchStone: Evaluating Vision-Language Models by Language Models", arXiv, 2023 (Alibaba). [Paper]
    • Q-Bench: "Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision", arXiv, 2023 (NTU, Singapore). [Paper]
    • PCA-EVAL: "Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond", arXiv, 2023 (Peking). [Paper][Code (in construction)]
    • ReForm-Eval: "ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks", arXiv, 2023 (Fudan). [Paper]
    • ?: "Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models", arXiv, 2023 (Zhejiang). [Paper][Code (in construction)]
    • HallusionBench: "HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models", arXiv, 2023 (Maryland). [Paper][Code (in construction)]
    • ?: "GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks", arXiv, 2023 (UCSB). [Paper]
    • ChEF: "ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • ViLMA: "ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models", arXiv, 2023 (Koç University, Turkey). [Paper][Website]
    • VLM-Eval: "VLM-Eval: A General Evaluation on Video Large Language Models", arXiv, 2023 (Megvii). [Paper]
    • Auto-Bench: "Large Language Models as Automated Aligners for benchmarking Vision-Language Models", arXiv, 2023 (HKU). [Paper][Website]
    • AutoEval-Video: "AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering", arXiv, 2023 (ByteDance). [Paper][PyTorch]
    • Video-Bench: "Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models", arXiv, 2023 (Peking). [Paper][PyTorch]
    • ?: "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs", arXiv, 2023 (UC Santa Cruz + UNC). [Paper][PyTorch]
    • VITATECS: "VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models", arXiv, 2023 (Peking). [Paper][PyTorch]
    • SEED-Bench-2: "SEED-Bench-2: Benchmarking Multimodal Large Language Models", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • VBench: "VBench: Comprehensive Benchmark Suite for Video Generative Models", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • MERLIM: "Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models", arXiv, 2023 (KAUST). [Paper]
    • BenchLMM: "BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
    • M3DBench: "M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts", arXiv, 2023 (Fudan). [Paper][Code (in construction)][Website]
    • THRONE: "THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models", CVPR, 2024 (Amazon). [Paper]
    • MM-SAP: "MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception", arXiv, 2024 (Shanghai AI Lab). [Paper][Code (in construction)]
    • MLLM-as-a-Judge: "MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark", arXiv, 2024 (Lehigh University, Pennsylvania). [Paper][GitHub]
    • TempCompass: "TempCompass: Do Video LLMs Really Understand Videos?", arXiv, 2024 (Peking). [Paper][Code (in construction)]
    • Ch3Ef: "Assessment of Multimodal Large Language Models in Alignment with Human Values", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • UPD: "Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models", arXiv, 2024 (University of Tokyo). [Paper][PyTorch]
    • MMStar: "Are We on the Right Way for Evaluating Large Vision-Language Models?", arXiv, 2024 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
    • BLINK: "BLINK: Multimodal Large Language Models Can See but Not Perceive", arXiv, 2024 (AI2). [Paper][Code][Website]
    • MMT-Bench: "MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI", arXiv, 2024 (Shanghai AI Lab). [Paper]
    • MileBench: "MileBench: Benchmarking MLLMs in Long Context", arXiv, 2024 (CUHK). [Paper][PyTorch][Website]
    • Vibe-Eval: "Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models", arXiv, 2024 (Reka). [Paper][Code]
  • Robustness:
    • Hierarchy-CLIP: "Improving Zero-shot Generalization and Robustness of Multi-modal Models", CVPR, 2023 (Google). [Paper][JAX][Website]
    • ?: "Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning", ICML, 2023 (UCLA). [Paper]
    • SGA: "Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models", ICCV, 2023 (Southern University of Science and Technology). [Paper]
    • VLAttack: "VLAttack: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models", NeurIPS, 2023 (Pennsylvania State University). [Paper]
    • DAD: "Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models", NeurIPS, 2023 (UIUC). [Paper]
    • AttackVLM: "On Evaluating Adversarial Robustness of Large Vision-Language Models", NeurIPS, 2023 (Singapore University of Technology and Design (SUTD)). [Paper][PyTorch]
    • RoCLIP: "Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks", NeurIPS, 2023 (UCLA). [Paper][PyTorch]
    • ?: "Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models", NeurIPS (Datasets and Benchmarks), 2023 (LMU Munich). [Paper][PyTorch][Website]
    • OGEN: "Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization", ICLR, 2024 (Apple). [Paper]
    • CroPA: "An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models", ICLR, 2024 (Oxford). [Paper][PyTorch]
    • APT: "One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models", CVPR, 2024 (King's College London). [Paper][PyTorch]
    • Robust-CLIP: "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models", arXiv, 2024 (University of Tubingen, Germany). [Paper][PyTorch]
    • AVIBench: "AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions", arXiv, 2024 (Shanghai AI Lab). [Paper]
  • Compositional Reasoning:
    • SugarCrepe: "SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality", NeurIPS, 2023 (AI2). [Paper][PyTorch]
    • DAC: "Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models", NeurIPS, 2023 (IBM). [Paper]
    • CoVLM: "CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding", arXiv, 2023 (UMass). [Paper][PyTorch][Website]
  • Vocabulary-free:
    • CaSED: "Vocabulary-free Image Classification", NeurIPS, 2023 (University of Trento, Italy). [Paper][PyTorch]
    • CaSED: "Vocabulary-free Image Classification and Semantic Segmentation", arXiv, 2024 (University of Trento, Italy). [Paper][PyTorch]
  • Retrieval Augmentated Methods:
    • ?: "Improving Image Recognition by Retrieving from Web-Scale Image-Text Data", CVPR, 2023 (Google). [Paper]
  • NeRF:
    • NeRDi: "NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors", CVPR, 2023 (Waymo). [Paper]
  • Model Selection:
    • LOVM: "LOVM: Language-Only Vision Model Selection", NeurIPS, 2023 (Stanford). [Paper]
    • EMMS: "Foundation Model is Efficient Multimodal Multitask Model Selector", NeurIPS, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • Multimodal Interaction:
    • ?: "Learning Unseen Modality Interaction", arXiv, 2023 (University of Amsterdam). [Paper]
  • Multimodal Translation:
    • CLIPTrans: "CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation", ICCV, 2023 (Boston College). [Paper][PyTorch]
  • Noisy label detection:
    • VDC: "VDC: Versatile Data Cleanser for Detecting Dirty Samples via Visual-Linguistic Inconsistency", arXiv, 2023 (CUHK). [Paper]
  • Model Compression:
    • ECoFLaP: "ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models", arXiv, 2023 (UNC). [Paper][PyTorch][Website]
    • MoPE-CLIP: "MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric", CVPR, 2024 (CAS). [Paper]
    • MULTIFLOW: "MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning", CVPR, 2024 (University of Trento). [Paper][PyTorch]
  • Relation Extraction:
  • Applications:
    • MM-Navigator: "GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
    • ?: "GPT-4V(ision) as A Social Media Analysis Engine", arXiv, 2023 (University of Rochester). [Paper][GitHub]
  • X-Supervised:
    • CAPro: "CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes", NeurIPS, 2023 (Tencent). [Paper][PyTorch]
    • MetaMAE: "Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder", NeurIPS, 2023 (KAIST). [Paper][PyTorch]
  • Correction/Verification:
    • VFC: "Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation", CVPR, 2024 (NVIDIA). [Paper][Website]
    • ?: "Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?", arXiv, 2024 (NVIDIA). [Paper][Website]

[Back to Overview]