-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path多模态
168 lines (114 loc) · 6.05 KB
/
多模态
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
多模态大模型 CLIP, BLIP, BLIP2, LLaVA, miniGPT4, InstructBLIP 系列解读
https://zhuanlan.zhihu.com/p/653902791
NeurIPS 2023 | ContextWM: 解锁世界模型的现实场景视频预训练
https://zhuanlan.zhihu.com/p/664144940
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning
https://arxiv.org/abs/2305.18499
多模态视频预训练最新进展概览
https://zhuanlan.zhihu.com/p/626527774
多模态视频预训练最新进展概览补充
https://zhuanlan.zhihu.com/p/633326924
[CVPR 2023] VideoMAE V2: 可扩展的视频基础模型预训练范式,训练出首个十亿参数量视频自监督大模型
https://zhuanlan.zhihu.com/p/618887786
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
视频理解多模态预训练综述
https://zhuanlan.zhihu.com/p/584532145
微软发布!适合所有阶段读者阅读的最新视觉-语言预训练 (VLP) 100+页综述!
https://zhuanlan.zhihu.com/p/649124369
《VideoCLIP》-Facebook&CMU开源视频文本理解的对比学习预训练,性能SOTA!适用于零样本学习!
https://zhuanlan.zhihu.com/p/423553168
80TB!58.5亿!世界第一大规模公开图文数据集LAION-5B 解读
https://zhuanlan.zhihu.com/p/571741834
超过Midjourney v5.2的开源文生图大模型Playground v2.5来了
https://zhuanlan.zhihu.com/p/684287454
https://playground.com/create
https://huggingface.co/spaces/playgroundai/playground-v2.5
https://huggingface.co/spaces/stabilityai/stable-diffusion
https://github.com/Yuliang-Liu/VimTS
https://huggingface.co/spaces/modelscope/ReplaceAnything
https://github.com/Yuliang-Liu/Monkey
https://civitai.com/
https://huggingface.co/spaces/levihsu/OOTDiffusion
2024 多模态大模型综述总结
https://zhuanlan.zhihu.com/p/713777861
阿里开源视觉多模态模型 Qwen2-VL,技术能力如何?
https://www.zhihu.com/question/665704731
多模态大模型:视觉模型与LLM的结合之路(六, Qwen2VL)
https://zhuanlan.zhihu.com/p/720112307
视频与图片检索中的多模态语义匹配模型:原理、启示、应用与展望
https://zhuanlan.zhihu.com/p/611433243
万字浅析视频搜索系统中的多模态能力建设
https://zhuanlan.zhihu.com/p/706294003
[领域综述] 视频理解 | 第一视角的视频理解
https://zhuanlan.zhihu.com/p/490778815
视频理解综述-2024
https://zhuanlan.zhihu.com/p/699932060
VidEgoThink:评估具身智能以自我中心的视频理解能力
https://zhuanlan.zhihu.com/p/2130170069
https://arxiv.org/abs/2410.11623
百万级高质量视频数据集发布,登顶抱抱脸数据集排行榜,中科大&上海AI Lab等出品
https://zhuanlan.zhihu.com/p/704825276
https://sharegpt4video.github.io/
https://arxiv.org/abs/2406.04325v1
万字长文总结多模态大模型最新进展(Video篇)
https://zhuanlan.zhihu.com/p/704246896
SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
https://arxiv.org/abs/2412.06878
Grounded Video Caption Generation
https://arxiv.org/abs/2411.07584
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
https://arxiv.org/abs/2411.18211
When, Where, and What? A Novel Benchmark for Accident Anticipation and Localization with Large Language Models
https://arxiv.org/abs/2407.16277
Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection
https://arxiv.org/abs/2407.19493
ReplanVLM: Replanning Robotic Tasks with Visual Language Models
https://arxiv.org/abs/2407.21762
Addressing Failures in Robotics using Vision-Based Language Models (VLMs) and Behavior Trees (BT)
https://arxiv.org/abs/2411.01568
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
https://arxiv.org/abs/2311.12015
ChatSUMO: Large Language Model for Automating Traffic Scenario Generation in Simulation of Urban MObility
https://arxiv.org/abs/2409.09040
Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLMs as Alternative Annotators
https://arxiv.org/abs/2411.05854
How Well Can Vision Language Models See Image Details?
https://arxiv.org/abs/2408.03940
GUI Action Narrator: Where and When Did That Action Take Place?
https://arxiv.org/abs/2406.13719
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
https://arxiv.org/abs/2406.10819
Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
https://arxiv.org/abs/2405.00181
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
https://arxiv.org/abs/2407.21794
[综述翻译]Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
https://zhuanlan.zhihu.com/p/720333862
Hawk: Learning to Understand Open-World Video Anomalies
https://arxiv.org/abs/2405.16886
Video Anomaly Detection and Explanation via Large Language Models
https://arxiv.org/abs/2401.05702
RelationVLM: Making Large Vision-Language Models Understand Visual Relations
https://arxiv.org/abs/2403.12801
Visual Prompting in Multimodal Large Language Models: A Survey
https://arxiv.org/abs/2409.15310
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
https://arxiv.org/abs/2403.20271
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
https://arxiv.org/abs/2407.10299
Sharingan: Extract User Action Sequence from Desktop Recordings
https://arxiv.org/abs/2411.08768
Interpretable Action Recognition on Hard to Classify Actions
https://arxiv.org/abs/2409.13091
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
https://jianzongwu.github.io/projects/diffsensei/
https://github.com/jianzongwu/DiffSensei
The Key of Understanding Vision Tasks: Explanatory Instructions
https://arxiv.org/abs/2412.18525
从高效多模态模型到世界模型:综述
https://zhuanlan.zhihu.com/p/7635656841
多模态(VLM)常用数据集
https://zhuanlan.zhihu.com/p/701404377
Qwen-VL系列(Qwen-VL、Qwen2-VL论文解读)
https://zhuanlan.zhihu.com/p/894766012