Awesome Video-Text Retrieval by Deep Learning

A curated list of deep learning resources for video-text retrieval.

Contributing

Please feel free to pull requests to add papers.

Markdown format:

- `[Author Journal/Booktitle Year]` Title. Journal/Booktitle, Year. [[paper]](link) [[code]](link) [[homepage]](link)

Implementations

PyTorch

TensorFlow

jsfusion

Others

w2vv(Keras)

Useful Toolkit

Extracting CNN features from video frames by MXNet

Papers

OPEN SOURCE

[Dong J, Li X, Xu C, et al.] Dual encoding for video retrieval by text[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code]
[Dong J, Li X, Xu C, et al.] Dual encoding for zero-example video retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9346-9355. [paper] [code]
[Chen S, Zhao Y, Jin Q, et al.] Fine-grained video-text retrieval with hierarchical graph reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10638-10647. [paper] [code]
[Li X, Xu C, Yang G, et al.] W2vv++ fully deep learning for ad-hoc video search[C]//Proceedings of the 27th ACM International Conference on Multimedia. 2019: 1786-1794. [paper] [code]

AVS2020

[Mukai D, Utsunomiya R, Utsuki S, et al.] Kindai University and Osaka Gakuin University at TRECVID 2020 AVS and ActEV Tasks[J]. [paper]
[Cui K, Liu H, Wang C, et al.] TRECVID 2020 AVS: Solution of ZY_BJLAB Team[J]. [paper]
[Sharma R, Mishra D, Bhatt H.] rahul@ sac. isro. gov. in, DECU, ISRO Ahmedabad, India[J]. [paper]

AVS2019

[Francis D, Nguyen P A, Huet B, et al.] EURECOM at TRECVid AVS 2019[C]//TRECVID. 2019. [paper]
[Shirahama K, Sakurai D, Matsubara T, et al.] Kindai University and Kobe University at TRECVID 2019 AVS Task[C]//TRECVID. 2019. [paper]
[Lokoč J, Souček T, Mejzlík F, et al.] VIRET tool keyword search at TRECVID 2019 AVS task[J]. 2019. [paper]
[Nguyen P A, Wu J, Ngo C W, et al.] Vireo-eurecom@ trecvid 2019: Ad-hoc video search[C]//Proceedings of the TRECVID 2019 Workshop. 2019. [paper]

2021

[Luo H, Ji L, Zhong M, et al.] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval[J]. arXiv preprint arXiv:2104.08860, 2021. [paper] [code]
[Dzabraev M, Kalashnikov M, Komkov S, et al.] MDMMT: Multidomain Multimodal Transformer for Video Retrieval[J]. arXiv preprint arXiv:2103.10699, 2021. [paper] [code]
[Luo et al. ARXIV20] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]
[Li L, Chen Y C, Cheng Y, et al.] Hero: Hierarchical encoder for video+ language omni-representation pre-training[J]. arXiv preprint arXiv:2005.00200, 2020. [paper] [code]
[Dong et al. TPAMI21] Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code]
[Lei et al. CVPR21] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021. [paper] [code]
[Wray et al. CVPR21] On Semantic Similarity in Video Retrieval. CVPR, 2021. [paper] [code]
[Patrick et al. ICLR21] Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021. [paper]
[Qi et al. TIP21] Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper]
[Dong et al. NEUCOM21] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2020. [paper]

2020

[Yang et al. SIGIR20] Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020. [paper]
[Ging et al. NeurIPS20] COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020. [paper] [code]
[Gabeur et al. ECCV20] Multi-modal Transformer for Video Retrieval. ECCV, 2020. [paper] [code][homepage]
[Li et al. TMM20] SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020. [paper]
[Wang et al. TMM20] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020. [paper]
[Chen et al. TMM20] Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020. [paper]
[Wu et al. ACMMM20] Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020. [paper]
[Feng et al. IJCAI20] Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020. [paper]
[Wei et al. CVPR20] Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020. [paper]
[Doughty et al. CVPR20] Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020. [paper]
[Chen et al. CVPR20] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020. [paper]
[Zhu et al. CVPR20] ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020. [paper]
[Zhao et al. ICME20] Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020. [paper]
[Luo et al. ARXIV20] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]

2019

[Dong et al. CVPR19] Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code]
[Song et al. CVPR19] Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper]
[Wray et al. ICCV19] Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper]
[Xiong et al. ICCV19] A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper]
[Li et al. ACMMM19] W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code]
[Liu et al. BMVC19] Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code]
[Choi et al. BigMM19] From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]

2018

[Dong et al. TMM18] Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code]
[Zhang et al. ECCV18] Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code]
[Yu et al. ECCV18] A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper]
[Shao et al. ECCV18] Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper]
[Mithun et al. ICMR18] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code]
[Miech et al. arXiv18] Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]

Before

[Yu et al. CVPR17] End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code]
[OtaniEmail et al. ECCVW2016] Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper]
[Xu et al. AAAI15] Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]

Ad-hoc Video Search

For the papers targeting at ad-hoc video search in the context of TRECVID, please refer to here.

Other Related

[Li et al. arXiv20] Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper]
[Miech et al. CVPR20] End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CVPR, 2020. [paper]

Datasets

[MSVD] David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset]
[MSRVTT] Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset]
[TGIF] Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage]
[AVS] Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset]
[LSMDC] Rohrbach et al. Movie description. IJCV, 2017. [paper] [dataset]
[ActivityNet Captions] Krishna et al. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset]
[DiDeMo] Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code]
[HowTo100M] Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [paper] [homepage]
[VATEX] Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]

Licenses

To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Video-Text Retrieval by Deep Learning

Contributing

Table of Contents

Implementations

PyTorch

TensorFlow

Others

Useful Toolkit

Papers

OPEN SOURCE

AVS2020

AVS2019

2021

2020

2019

2018

Before

Ad-hoc Video Search

Other Related

Datasets

Licenses

About

Releases

Packages

KaiiZhang/awesome-video-text-retrieval

Folders and files

Latest commit

History

Repository files navigation

Awesome Video-Text Retrieval by Deep Learning

Contributing

Table of Contents

Implementations

PyTorch

TensorFlow

Others

Useful Toolkit

Papers

OPEN SOURCE

AVS2020

AVS2019

2021

2020

2019

2018

Before

Ad-hoc Video Search

Other Related

Datasets

Licenses

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages