-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
311 lines (279 loc) · 14.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="VisualWebBench: how far have multimodal LLMs evolved in web page understanding and grounding?">
<meta name="keywords" content="MLLM, VLM, LMM, VisualWebBench, Large Language Model, Multimodal Large Language Model, MLLM Evaluation, Benchmark">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/logo.ico">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">
<img src="static/images/logo.jpeg" style="width:2em;vertical-align: middle">
<span>VisualWebBench</span>
</h1>
<h2 class="subtitle is-3 publication-title">How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?</h2>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://github.com/Junpliu" target="_blank">Junpeng Liu</a><sup>◦*</sup>,
</span>
<span class="author-block">
<a href="https://github.com/Yifan-Song793" target="_blank">Yifan Song</a><sup>§*</sup>,
</span>
<span class="author-block">
<a href="https://yuchenlin.xyz/" target="_blank">Bill Yuchen Lin</a><sup>♠</sup>,
</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?user=ewA4NAcAAAAJ&hl=en" target="_blank">Wai Lam</a><sup>◦</sup>,
</span>
<span class="author-block">
<a href="https://www.phontron.com/" target="_blank">Graham Neubig</a><sup>♣</sup>,
</span>
<span class="author-block">
<a href="https://www.andrew.cmu.edu/user/yuanzhil/" target="_blank">Yuanzhi Li</a><sup>♢</sup>,
</span>
<span class="author-block">
<a href="https://xiangyue9607.github.io/" target="_blank">Xiang Yue</a><sup>♣</sup>,
</span>
</div>
<br>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>♣</sup>Carnegie Mellon University</span>
<span class="author-block"><sup>◦</sup>The Chinese University of Hong Kong</span>
<span class="author-block"><sup>§</sup>Peking University</span><br>
<span class="author-block"><sup>♢</sup>MBZUAI</span>
<span class="author-block"><sup>♠</sup>Allen Institute for AI</span>
</div>
<br>
<div class="is-size-5 thanks">
<span class="author-block"><sup>*</sup>Equal Contribution</span><br>
<span class="author-block">†Corresponding to:</span>
<span class="author-block"><a href="mailto:[email protected]">[email protected]</a>,</span>
<span class="author-block"><a href="mailto:[email protected]">[email protected]</a></span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a href="https://arxiv.org/abs/2404.05955" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<span class="link-block">
<a href="https://huggingface.co/datasets/visualwebbench/VisualWebBench" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<p style="font-size:18px">🤗</p>
</span>
<span>Dataset</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/VisualWebBench/VisualWebBench" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- Twitter Link. -->
<span class="link-block">
<a href="https://x.com/xiangyue96/status/1778265120633745435"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-twitter"></i>
</span>
<span>Twitter</span>
</a>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="content has-text-centered">
<img src="static/images/main.png" alt="geometric reasoning" width="90%"/>
</div>
</div>
</section>
<section class="section hero is-light">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="content has-text-justified">
<p>
We introduce <strong>VisualWebBench</strong>, a multimodal benchmark designed to assess the <strong>understanding and grounding capabilities of MLLMs in web scenarios</strong>. VisualWebBench consists of <strong>seven tasks</strong>, and comprises <strong>1.5K</strong> human-curated instances from <strong>139</strong> real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude 3, and GPT-4V(ision) on VisualWebBench, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe VisualWebBench will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.
</p>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<div>
<video controls>
<source src="static/images/demo.mp4" type="video/mp4">
</video>
</div>
<h2 class="title is-3">Update</h2>
<div class="content has-text-justified">
<p><strong>2024/10/18</strong>: We introduce <a href="https://huggingface.co/datasets/neulab/MultiUI"><strong>🤗 MultiUI</strong></a>, 7.3M general multimodal instructions synthesized from webUIs using text-based LLMs, enhancing both UI-related and Doc/OCR/chart understanding tasks.</p>
</div>
<h2 class="title is-3">Overview</h2>
<div class="content has-text-justified">
We introduce VisualWebBench, a comprehensive multimodal benchmark designed to assess the capabilities of MLLMs in the web domain. Inspired by the human interaction process with web browsers, VisualWebBench consists of seven tasks that map to core abilities required for web tasks: captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding, as detailed in the figure. The benchmark comprises 1.5K instances, all uniformly formulated in the QA style, making it easy to evaluate and compare the performance of different MLLMs.
</div>
<img id="model" width="100%" src="static/images/compare.png">
<br>
<div class="content has-text-justified">
The proposed VisualWebBench possesses the following features:
<ul>
<li>
<strong>Comprehensiveness</strong>: VisualWebBench spans 139 websites with 1.5K samples, encompassing 12 different domains (e.g., travel, sports, hobby, lifestyle, animals, science, etc.) and 87 sub-domains.
</li>
<li>
<strong>Multi-granularity</strong>: VisualWebBench assesses MLLMs at three levels: website-level, element-level, and action-level.
</li>
<li>
<strong>Multi-tasks</strong>: VisualWebBench encompasses seven tasks designed to evaluate the understanding, OCR, grounding, and reasoning capabilities of MLLMs.
</li>
<li>
<strong>High quality</strong>: Quality is ensured through careful human verification and curation efforts.
</li>
</ul>
</div>
<img id="model" width="100%" src="static/images/detail.png">
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-full">
<h2 class="title is-3">Experimental Results</h2>
<div class="content has-text-justified">
<p>
We evaluate 14 open-source general MLLMs on VisualWebBench. By default, for each model family, we use the largest available checkpoint. We consider three scales of LLaVA, 7B, 13B, and 34B, for model scaling analysis. Several strong close-source MLLMs, Gemini Pro, Claude series, and GPT-4V(ision), are also included for evaluation. In addition, we evaluate 2 GUI agent MLLMs, i.e., CogAgent and SeeClick, on VisualWebBench.
</p>
</div>
<img id="model" width="100%" src="static/images/exp.png">
<div class="content has-text-justified">
We highlight the following findings:
<ul>
<li>
<strong>Challenging Nature of Web Tasks</strong>: Even the most powerful MLLMs, GPT-4V and Claude Sonnet achieve average scores of 64.6 and 65.8, respectively, leaving ample room for improvement.
</li>
<li>
<strong>Disparity between Open-source and Proprietary MLLMs</strong>: GPT-4V and Claude outperform open-source MLLMs including GUI agent MLLMs by a large margin, highlighting a discernible gap in the capabilities of current open-source MLLMs compared to proprietary ones.
</li>
<li>
<strong>Relatively strong correlation with general understanding benchmarks like MMMU but weak correlation with web agent benchmark like Mind2Web</strong>: MLLMs' abilities in web agent tasks, such as Mind2Web, do not correlate much with their performance on VisualWebBench, highlighting the importance of web understanding benchmarks like VisualWebBench.
</li>
<li>
<strong>Importance of Image Resolution</strong>: The limited image resolution handling capabilities of most open-source MLLMs restrict their utility in web scenarios, where rich text and elements are prevalent.
</li>
<li>
<strong>Weak Grounding Ability</strong>: Grounding ability, a crucial skill for developing MLLM-based web applications like autonomous web agents, is a weakness for most MLLMs.
</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-full">
<h2 class="title is-3">Case Study</h2>
<div id="results-carousel" class="carousel results-carousel" data-slides-to-scroll="2">
<div class="item">
<img src="static/images/case1.png" width="60%">
</div>
<div class="item">
<img src="static/images/case2.png" width="60%">
</div>
<div class="item">
<img src="static/images/case3.png" width="60%">
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@misc{liu2024visualwebbench,
title={VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?},
author={Junpeng Liu and Yifan Song and Bill Yuchen Lin and Wai Lam and Graham Neubig and Yuanzhi Li and Xiang Yue},
year={2024},
eprint={2404.05955},
archivePrefix={arXiv},
primaryClass={cs.CL}
}</code></pre>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<a class="icon-link" target="_blank"
href="https://arxiv.org/abs/2404.05955">
<i class="fas fa-file-pdf"></i>
</a>
<a class="icon-link" href="https://github.com/VisualWebBench/VisualWebBench" target="_blank" class="external-link" disabled>
<i class="fab fa-github"></i>
</a>
</div>
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is licensed under a <a rel="license" target="_blank"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
<p>
This means you are free to borrow the <a target="_blank"
href="https://github.com/nerfies/nerfies.github.io">source code</a> of this website,
we just ask that you link back to this page in the footer.
Please remember to remove the analytics code included in the header of the website which
you do not want on your website.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>