A curated list of Visual Captioning and related area.
- Survey Papers
- Research Papers
- Dataset
- Popular Codebase
- Reference and Acknowledgement
- From Show to Tell: A Survey on Image Captioning. [paper]
- Compact Bidirectional Transformer for Image Captioning. [paper] [code]
- ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer. [paper]
- I-Tuning: Tuning Language Models with Image for Caption Generation. [paper]
- CaMEL: Mean Teacher Learning for Image Captioning. [paper] [code]
- Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition. [paper]
- Discourse Analysis for Evaluating Coherence in Video Paragraph Captions. [paper]
- Cross-modal Contrastive Distillation for Instructional Activity Anticipation. [paper]
- End-to-end Generative Pretraining for Multimodal Video Captioning. [paper]
- Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation. [paper] [code]
- Dual-Level Decoupled Transformer for Video Captioning. [paper]
- Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information. [paper] [code]
- Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds. [paper]
- X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning. [paper]
- Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. [paper]
- What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics. [paper] (Workshop)
- Image Difference Captioning with Pre-training and Contrastive Learning. [paper]
- Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation. [paper]
- FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark. [paper] [code]
- Multi-modal Dependency Tree for Video Captioning. [paper]
- Visual News: Benchmark and Challenges in News Image Captioning. [paper] [code]
- R3Net:Relation-embedded Representation Reconstruction Network for Change Captioning. [paper] [code]
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. [paper]
- Journalistic Guidelines Aware News Image Captioning. [paper]
- Understanding Guided Image Captioning Performance across Domains. [paper] [code] (CoNLL)
- Language Resource Efficient Learning for Captioning. [paper] (Findings)
- Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning. [paper] (Findings)
- QACE: Asking Questions to Evaluate an Image Caption. [paper] (Findings)
- COSMic: A Coherence-Aware Generation Metric for Image Descriptions. [paper] (Findings)
- Auto-Parsing Network for Image Captioning and Visual Question Answering. [paper]
- Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning. [paper]
- Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. [paper]
- Partial Off-Policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning. [paper]
- Topic Scene Graph Generation by Attention Distillation from Caption. [paper]
- Understanding and Evaluating Racial Biases in Image Captioning. [paper] [code]
- In Defense of Scene Graphs for Image Captioning. [paper] [code]
- Viewpoint-Agnostic Change Captioning with Cycle Consistency. [paper]
- Visual-Textual Attentive Semantic Consistency for Medical Report Generation. [paper]
- Semi-Autoregressive Transformer for Image Captioning. [paper] (Workshop)
- End-to-End Dense Video Captioning with Parallel Decoding. [paper] [code]
- Motion Guided Region Message Passing for Video Captioning. [paper]
- Distributed Attention for Grounded Image Captioning. [paper]
- Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning. [paper] [code]
- Group-based Distinctive Image Captioning with Memory Attention. [paper]
- Direction Relation Transformer for Image Captioning. [paper]
- Question-controlled Text-aware Image Captioning. [paper]
- Hybrid Reasoning Network for Video-based Commonsense Captioning. [paper]
- Discriminative Latent Semantic Graph for Video Captioning. [paper] [code]
- Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention. [paper]
- CLIP4Caption: CLIP for Video Caption. [paper]
- Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers. [paper]
- Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. [paper]
- Competence-based Multimodal Curriculum Learning for Medical Report Generation.
- Control Image Captioning Spatially and Temporally. [paper]
- SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis. [paper]
- Enhancing Descriptive Image Captioning with Natural Language Inference.
- UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. [paper] [code]
- Cross-modal Memory Networks for Radiology Report Generation.
- Hierarchical Context-aware Network for Dense Video Event Captioning.
- Video Paragraph Captioning as a Text Summarization Task.
- TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [paper]
- Quality Estimation for Image Captions Based on Large-scale Human Evaluations. [paper]
- Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. [paper]
- DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [paper]
- Connecting What to Say With Where to Look by Modeling Human Attention Traces. [paper] [code]
- Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. [paper]
- Image Change Captioning by Learning From an Auxiliary Task. [paper]
- Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. [paper] [code]
- FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation. [paper]
- RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. [paper]
- Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles. [paper]
- Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship. [paper]
- TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption. [paper]
- Towards Accurate Text-Based Image Captioning With Content Diversity Exploration. [paper]
- Open-Book Video Captioning With Retrieve-Copy-Generate Network. [paper]
- Towards Diverse Paragraph Captioning for Untrimmed Videos. [paper]
- Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning. [paper]
- Cascade Attention Fusion for Fine-grained Image Captioning based on Multi-layer LSTM. [paper]
- Triple Sequence Generative Adversarial Nets for Unsupervised Image Captioning. [paper]
- Partially Non-Autoregressive Image Captioning. [paper] [code]
- Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. [paper]
- Object Relation Attention for Image Paragraph Captioning. [paper]
- Dual-Level Collaborative Transformer for Image Captioning. [paper] [code]
- Memory-Augmented Image Captioning. [paper]
- Image Captioning with Context-Aware Auxiliary Guidance. [paper]
- Consensus Graph Representation Learning for Better Grounded Image Captioning. [paper]
- FixMyPose: Pose Correctional Captioning and Retrieval. [paper] [code]
- VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning. [paper]
- Non-Autoregressive Coarse-to-Fine Video Captioning. [paper] [code]
- Semantic Grouping Network for Video Captioning. [paper] [code]
- Augmented Partial Mutual Learning with Frame Masking for Video Captioning. [paper]
- Saying the Unseen: Video Descriptions via Dialog Agents. [paper]
- CapWAP: Captioning with a Purpose. [paper] [code]
- Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. [paper] [code]
- Visually Grounded Continual Learning of Compositional Phrases. [paper]
- Pragmatic Issue-Sensitive Image Captioning. [paper]
- Structural and Functional Decomposition for Personality Image Captioning in a Communication Game. [paper]
- Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze. [paper]
- ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization. [paper]
- Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. [paper]
- RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [paper]
- Diverse Image Captioning with Context-Object Split Latent Spaces. [paper]
- Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [paper]
- Structural Semantic Adversarial Active Learning for Image Captioning. [paper]
- Iterative Back Modification for Faster Image Captioning. [paper]
- Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [paper]
- Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [paper]
- Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [paper]
- ICECAP: Information Concentrated Entity-aware Image Captioning. [paper]
- Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [paper]
- Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [paper]
- Controllable Video Captioning with an Exemplar Sentence. [paper]
- Poet: Product-oriented Video Captioner for E-commerce. [paper] [code]
- Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [paper]
- Relational Graph Learning for Grounded Video Description Generation. [paper]
- Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets. [paper]
- Towards Unique and Informative Captioning of Images. [paper]
- Learning Visual Representations with Caption Annotations. [paper]
- Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. [paper] [code]
- Length Controllable Image Captioning. [paper] [code]
- Comprehensive Image Captioning via Scene Graph Decomposition. [paper]
- Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [paper]
- Captioning Images Taken by People Who Are Blind. [paper]
- Learning to Generate Grounded Visual Captions without Localization Supervision. [paper] [code]
- Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [paper]
- Describing Textures using Natural Language. [paper]
- Connecting Vision and Language with Localized Narratives. [paper] [code]
- Character Grounding and Re-Identification in Story of Videos and Text Descriptions. [paper] [code]
- SODA: Story Oriented Dense Video Captioning Evaluation Framework. [paper] [code]
- In-Home Daily-Life Captioning Using Radio Signals. [paper]
- TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. [paper] [code]
- Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos. [paper]
- Identity-Aware Multi-Sentence Video Description. [paper]
- Human Consensus-Oriented Image Captioning. [paper]
- Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [paper]
- Recurrent Relational Memory Network for Unsupervised Image Captioning. [paper]
- Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [paper] [code]
- SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [paper]
- Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [paper]
- Clue: Cross-modal Coherence Modeling for Caption Generation. [paper]
- Improving Image Captioning Evaluation by Considering Inter References Variance. [paper] [code]
- Improving Image Captioning with Better Use of Caption. [paper] [code]
- MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [paper] [code]
- Context-Aware Group Captioning via Self-Attention and Contrastive Features. [paper] [code]
- Show, Edit and Tell: A Framework for Editing Image Captions. [paper] [code]
- Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. [paper] [code]
- Normalized and Geometry-Aware Self-Attention Network for Image Captioning. [paper]
- Meshed-Memory Transformer for Image Captioning. [paper] [code]
- X-Linear Attention Networks for Image Captioning. [paper] [code]
- Transform and Tell: Entity-Aware News Image Captioning. [paper] [code]
- More Grounded Image Captioning by Distilling Image-Text Matching Model. [paper] [code]
- Better Captioning With Sequence-Level Exploration. [paper]
- Object Relational Graph With Teacher-Recommended Learning for Video Captioning. [paper]
- Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. [paper] [code]
- Better Captioning With Sequence-Level Exploration. [paper]
- Syntax-Aware Action Targeting for Video Captioning. [paper] [code]
- Screencast Tutorial Video Understanding. [paper]
- Unified Vision-Language Pre-Training for Image Captioning and VQA. [paper] [code]
- Reinforcing an Image Caption Generator using Off-line Human Feedback. [paper]
- Memorizing Style Knowledge for Image Captioning. [paper]
- Joint Commonsense and Relation Reasoning for Image and Video Captioning. [paper]
- Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption. [paper]
- Show, Recall, and Tell: Image Captioning with Recall Mechanism. [paper]
- Interactive Dual Generative Adversarial Networks for Image Captioning. [paper]
- Feature Deformation Meta-Networks in Image Captioning of Novel Objects. [paper]
- An Efficient Framework for Dense Video Captioning. [paper]
- Adaptively Aligned Image Captioning via Adaptive Attention Time. [paper] [code]
- Image Captioning: Transforming Objects into Words. [paper] [code]
- Variational Structured Semantic Inference for Diverse Image Captioning. [paper]
- Robust Change Captioning. [paper]
- Attention on Attention for Image Captioning. [paper]
- Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style. [paper]
- Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. [paper]
- Hierarchy Parsing for Image Captioning. [paper]
- Generating Diverse and Descriptive Image Captions Using Visual Paraphrases. [paper]
- Learning to Collocate Neural Modules for Image Captioning. [paper]
- Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. [paper]
- Towards Unsupervised Image Captioning With Shared Multimodal Embeddings. [paper]
- Human Attention in Image Captioning: Dataset and Analysis. [paper]
- Reflective Decoding Network for Image Captioning. [paper]
- Joint Optimization for Cooperative Image Captioning. [paper]
- Entangled Transformer for Image Captioning. [paper]
- nocaps: novel object captioning at scale. [paper]
- Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection. [paper]
- Unpaired Image Captioning via Scene Graph Alignments. [paper]
- Learning to Caption Images Through a Lifetime by Asking Questions. [paper]
- VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. [paper]
- Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network. [paper]
- Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning. [paper]
- Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning. [paper]
-
Informative Image Captioning with External Sources of Information [paper]
-
Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning [paper]
-
Generating Question Relevant Captions to Aid Visual Question Answering [paper]
- Dense Procedure Captioning in Narrated Instructional Videos [paper]
-
Auto-Encoding Scene Graphs for Image Captioning [paper] [code]
-
Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech [paper]
-
Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [paper]
-
Describing like Humans: On Diversity in Image Captioning [paper]
-
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text [paper]
-
Leveraging Captioning to Boost Semantics for Salient Object Detection [paper] [code]
-
Context and Attribute Grounded Dense Captioning [paper]
-
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning [paper]
-
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions [paper]
-
Self-Critical N-step Training for Image Captioning [paper]
-
Look Back and Predict Forward in Image Captioning [paper]
-
Intention Oriented Image Captions with Guiding Objects [paper]
-
Adversarial Semantic Alignment for Improved Image Captions [paper]
-
Good News, Everyone! Context driven entity-aware captioning for news images. [paper] [code]
-
Pointing Novel Objects in Image Captioning [paper]
-
Engaging Image Captioning via Personality [paper]
-
Intention Oriented Image Captions With Guiding Objects [paper]
-
Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [paper]
-
Towards Unsupervised Image Captioning with Shared Multimodal Embeddings. [paper]
- Streamlined Dense Video Captioning. [paper]
- Grounded Video Description. [paper]
- Adversarial Inference for Multi-Sentence Video Description. [paper]
- Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning. [paper]
- Memory-Attended Recurrent Network for Video Captioning. [paper]
- Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. [paper]
-
Improving Image Captioning with Conditional Generative Adversarial Nets [paper]
-
Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding [paper]
-
Meta Learning for Image Captioning [paper]
-
Deliberate Residual based Attention Network for Image Captioning [paper]
-
Hierarchical Attention Network for Image Captioning [paper]
-
Learning Object Context for Dense Captioning [paper]
-
Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning [code] [paper]
-
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning [paper]
-
Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention [paper]
-
Motion Guided Spatial Attention for Video Captioning [paper]
- Unpaired Image Captioning by Language Pivoting. [paper] [code]
- Exploring Visual Relationship for Image Captioning. [paper]
- Recurrent Fusion Network for Image Captioning. [paper] [code]
- Boosted Attention: Leveraging Human Attention for Image Captioning. [paper]
- Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. [paper]
- "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention. [paper]
- Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. [paper]
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. [paper] [code]
- Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning. [paper]
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. [paper] [code]
- Neural Baby Talk. [paper]
- GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints. [paper]
- Boosting Image Captioning with Attributes. [paper]
- Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner. [paper] [code]
- SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning. [paper] [code]
- When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. [paper] [code]
- Self-critical Sequence Training for Image Captioning. [paper]
- Semantic Compositional Networks for Visual Captioning. [paper] [code]
- StyleNet: Generating Attractive Visual Captions with Styles. [paper] [code]
- BreakingNews: Article Annotation by Image and Text Processing. [paper]
- SPICE: Semantic Propositional Image Caption Evaluation. [paper] [code]
- Generating Visual Explanations. [paper] [code]
- Image Captioning with Semantic Attention. [paper] [code]
- Learning Deep Representations of Fine-grained Visual Descriptions. [paper] [code]
- Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. [paper] [code]
- Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts. [paper] [code]
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [paper]
- Guiding Long-Short Term Memory for Image Caption Generation. [paper]
- Show and Tell: A Neural Image Caption Generator. [paper]
- Deep Visual-Semantic Alignments for Generating Image Descriptions. [paper] [code]
- CIDEr: Consensus-based Image Description Evaluation. [paper] [cider]
- Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). [paper]
- MSCOCO
- Flickr30K
- Flickr8K
- VizWiz
Really appreciate for there contributions in this area.