A curated list of deep learning resources for video-text retrieval.
Please feel free to pull requests to add papers.
Markdown format:
- `[Author Journal/Booktitle Year]` Title. Journal/Booktitle, Year. [[paper]](link) [[code]](link) [[homepage]](link)
- Implementations
- Papers
- 2021 - 2020 - 2019 - 2018 - Before
- Ad-hoc Video Search
- Other Related
- Datasets
- hybrid_space
- dual_encoding
- w2vvpp
- Mixture-of-Embedding-Experts
- howto100m
- collaborative
- hgr
- coot
- mmt
- ClipBERT
- w2vv(Keras)
[Dong J, Li X, Xu C, et al.]
Dual encoding for video retrieval by text[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code][Dong J, Li X, Xu C, et al.]
Dual encoding for zero-example video retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9346-9355. [paper] [code][Chen S, Zhao Y, Jin Q, et al.]
Fine-grained video-text retrieval with hierarchical graph reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10638-10647. [paper] [code][Li X, Xu C, Yang G, et al.]
W2vv++ fully deep learning for ad-hoc video search[C]//Proceedings of the 27th ACM International Conference on Multimedia. 2019: 1786-1794. [paper] [code]
[Mukai D, Utsunomiya R, Utsuki S, et al.]
Kindai University and Osaka Gakuin University at TRECVID 2020 AVS and ActEV Tasks[J]. [paper][Cui K, Liu H, Wang C, et al.]
TRECVID 2020 AVS: Solution of ZY_BJLAB Team[J]. [paper][Sharma R, Mishra D, Bhatt H.]
rahul@ sac. isro. gov. in, DECU, ISRO Ahmedabad, India[J]. [paper]
[Francis D, Nguyen P A, Huet B, et al.]
EURECOM at TRECVid AVS 2019[C]//TRECVID. 2019. [paper][Shirahama K, Sakurai D, Matsubara T, et al.]
Kindai University and Kobe University at TRECVID 2019 AVS Task[C]//TRECVID. 2019. [paper][Lokoč J, Souček T, Mejzlík F, et al.]
VIRET tool keyword search at TRECVID 2019 AVS task[J]. 2019. [paper][Nguyen P A, Wu J, Ngo C W, et al.]
Vireo-eurecom@ trecvid 2019: Ad-hoc video search[C]//Proceedings of the TRECVID 2019 Workshop. 2019. [paper]
-
[Luo H, Ji L, Zhong M, et al.]
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval[J]. arXiv preprint arXiv:2104.08860, 2021. [paper] [code] -
[Dzabraev M, Kalashnikov M, Komkov S, et al.]
MDMMT: Multidomain Multimodal Transformer for Video Retrieval[J]. arXiv preprint arXiv:2103.10699, 2021. [paper] [code] -
[Luo et al. ARXIV20]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper] -
[Li L, Chen Y C, Cheng Y, et al.]
Hero: Hierarchical encoder for video+ language omni-representation pre-training[J]. arXiv preprint arXiv:2005.00200, 2020. [paper] [code] -
[Dong et al. TPAMI21]
Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code] -
[Lei et al. CVPR21]
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021. [paper] [code] -
[Wray et al. CVPR21]
On Semantic Similarity in Video Retrieval. CVPR, 2021. [paper] [code] -
[Patrick et al. ICLR21]
Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021. [paper] -
[Qi et al. TIP21]
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper] -
[Dong et al. NEUCOM21]
Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2020. [paper]
[Yang et al. SIGIR20]
Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020. [paper][Ging et al. NeurIPS20]
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020. [paper] [code][Gabeur et al. ECCV20]
Multi-modal Transformer for Video Retrieval. ECCV, 2020. [paper] [code][homepage][Li et al. TMM20]
SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020. [paper][Wang et al. TMM20]
Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020. [paper][Chen et al. TMM20]
Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020. [paper][Wu et al. ACMMM20]
Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020. [paper][Feng et al. IJCAI20]
Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020. [paper][Wei et al. CVPR20]
Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020. [paper][Doughty et al. CVPR20]
Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020. [paper][Chen et al. CVPR20]
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020. [paper][Zhu et al. CVPR20]
ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020. [paper][Zhao et al. ICME20]
Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020. [paper][Luo et al. ARXIV20]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]
[Dong et al. CVPR19]
Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code][Song et al. CVPR19]
Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper][Wray et al. ICCV19]
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper][Xiong et al. ICCV19]
A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper][Li et al. ACMMM19]
W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code][Liu et al. BMVC19]
Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code][Choi et al. BigMM19]
From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]
[Dong et al. TMM18]
Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code][Zhang et al. ECCV18]
Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code][Yu et al. ECCV18]
A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper][Shao et al. ECCV18]
Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper][Mithun et al. ICMR18]
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code][Miech et al. arXiv18]
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]
[Yu et al. CVPR17]
End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code][OtaniEmail et al. ECCVW2016]
Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper][Xu et al. AAAI15]
Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]
[Li et al. arXiv20]
Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper][Miech et al. CVPR20]
End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CVPR, 2020. [paper]
[MSVD]
David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset][MSRVTT]
Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset][TGIF]
Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage][AVS]
Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset][LSMDC]
Rohrbach et al. Movie description. IJCV, 2017. [paper] [dataset][ActivityNet Captions]
Krishna et al. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset][DiDeMo]
Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code][HowTo100M]
Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [paper] [homepage][VATEX]
Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]
To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.