-
Notifications
You must be signed in to change notification settings - Fork 0
/
research.html
580 lines (507 loc) · 39.2 KB
/
research.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
<!DOCTYPE html>
<!--
Plain-Academic by Vasilios Mavroudis
Released under the Simplified BSD License/FreeBSD (2-clause) License.
https://github.com/mavroudisv/plain-academic
-->
<html lang="en">
<head>
<meta name=viewport content=“width=800”>
<meta name="generator" content="HTML Tidy for Linux/x86 (vers 11 February 2007), see www.w3.org">
<style type="text/css">
a {
color: #1772d0;
text-decoration:none;
}
a:focus, a:hover {
color: #f09228;
text-decoration:none;
}
body,td,th,tr,p,a {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 15px
}
table, th, td {
border: 10px;
padding: 15px;
}
table {
border-spacing: 35px;
}
strong {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 15px;
}
heading {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 25px;
}
papertitle {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 18px;
font-weight: 700
}
name {
font-family: 'Lato', Verdana, Helvetica, sans-serif;
font-size: 32px;
}
.one
{
width: 160px;
height: 160px;
position: relative;
}
.two
{
width: 160px;
height: 160px;
position: absolute;
transition: opacity .2s ease-in-out;
-moz-transition: opacity .2s ease-in-out;
-webkit-transition: opacity .2s ease-in-out;
}
.fade {
transition: opacity .2s ease-in-out;
-moz-transition: opacity .2s ease-in-out;
-webkit-transition: opacity .2s ease-in-out;
}
span.highlight {
background-color: #ffffd0;
}
</style>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-585B7WN');</script>
<!-- End Google Tag Manager -->
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-QW6NPLQPRW"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-QW6NPLQPRW');
</script>
<title>Erhan Gundogdu</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
<link href='https://fonts.googleapis.com/css?family=Oswald:700' rel='stylesheet' type='text/css'>
<link rel="apple-touch-icon" sizes="180x180" href="files/favicon_package/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="files/favicon_package/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="files/favicon_package/favicon-16x16.png">
<link rel="manifest" href="/site.webmanifest">
<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="theme-color" content="#ffffff">
</head>
<body>
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-585B7WN"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<!-- Navigation -->
<nav class="navbar navbar-inverse">
<div class="container">
<ul class="nav navbar-nav">
<li><a href="index.html">Home</a></li>
<li><a style="color:#e01709" href="research.html">Research and Publications</a></li>
<li><a href="other_activities.html">News and Activities</a></li>
</ul>
</div>
</nav>
<!-- Page Content -->
<div class="container">
<div class="row">
<!-- Publications -->
<div class="col-md-8" style="min-height: 100vh; height: auto;">
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="100%" valign="middle">
<heading style = "font-size:30px"><b>Research Interests</b></heading>
<p style = "font-size:15px">
My research interests include but not limited to video understanding, multi-modal image/video representation learning, (visible and infrared) object tracking, recognition and (weakly-supervised) detection, deep metric learning, 3D object understanding (3D cloth fitting, 3D shape recognition and extraction).
<p style = "font-size:15px"> For my full publication list, please visit <a style = "font-size:15px" target="_blank" href="https://scholar.google.ch/citations?user=nZD_5vsAAAAJ&hl=en&oi=ao">my Google Scholar Page</a>.
My Ph.D. thesis is about visual object tracking (<a style = "font-size:15px" target="_blank" href="http://etd.lib.metu.edu.tr/upload/12621448/index.pdf">lib.metu</a>) and my M.Sc. thesis is about local feature detection and description learning for fast image matching (<a style = "font-size:15px" target="_blank" href="https://etd.lib.metu.edu.tr/upload/12614618/index.pdf">lib.metu</a>).
</td>
</tr>
</table>
<table width="1000" border="0" align="center" cellspacing="30" cellpadding="0">
<tr>
<td>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="20%">
<img src='iEDIT.png' width="300">
</td>
<td valign="top" width="60%">
<heading>Generative AI</heading><br>
<papertitle>iEdit: Localised Text-guided Image Editing with Weak Supervision</papertitle>
<br>
(<a target="_blank" href="https://arxiv.org/pdf/2305.05947.pdf">arXiv</a>)
<br>
R. Bodur, <strong>E. Gundogdu</strong>, B. Bhattarai, T.K. Kim, M. Donoser, L. Bazzani,
<em>preprint (arXiv Preprint)</em>, 2023 <br>
<p></p>
<p id="textAreaiEDIT" align="justify" style = "font-size:15px">Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely iEdit, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose ...
</p><a id="toggleButtoniEDIT" onclick="toggleTextiEDIT()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="1000" border="0" align="center" cellspacing="30" cellpadding="0">
<tr>
<td>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="20%">
<img src='CLAP.png' width="300">
</td>
<td valign="top" width="60%">
<heading>Video Representation Learning</heading><br>
<papertitle>Contrastive Language-Action Pre-training for Temporal Localization</papertitle>
<br>
(<a target="_blank" href="https://arxiv.org/pdf/2204.12293.pdf">arXiv</a>)
<br>
M. Xu, <strong>E. Gundogdu</strong>, M. Lapin, B. Ghanem, M. Donoser, L. Bazzani,
<em>preprint (arXiv Preprint)</em>, 2022 <br>
<p></p>
<p id="textAreaCLAP" align="justify" style = "font-size:15px">In this work, we address the limitations of using pre-trained video backbones on trimmed action recognition datasets which do not have sufficient temporal sensitivity to distinguish foreground and background. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.
</p><a id="toggleButtonCLAP" onclick="toggleTextCLAP()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="20%">
<img src='ABO.png' width="300">
</td>
<td valign="top" width="60%">
<heading>Object Retrieval Benchmark</heading><br>
<papertitle>Abo: Dataset and benchmarks for real-world 3d object understanding</papertitle>
<br>
(<a target="_blank" href="https://openaccess.thecvf.com/content/CVPR2022/papers/Collins_ABO_Dataset_and_Benchmarks_for_Real-World_3D_Object_Understanding_CVPR_2022_paper.pdf">CVF</a>)
(<a target="_blank" href="https://amazon-berkeley-objects.s3.amazonaws.com/index.html">Dataset</a>)<br>
J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, <strong>E. Gundogdu</strong>, X. Zhang, T. F. Y. Vicente, T. Dideriksen, H. Arora, M. Guillaumin, J. Malik,
<em>published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2022 <br>
<p></p>
<p id="textAreaABO" align="justify" style = "font-size:15px">We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.
</p><a id="toggleButtonABO" onclick="toggleTextABO()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="20%">
<img src='food.png' width="300">
</td>
<td valign="top" width="60%">
<heading>Cross-Modal Recipe Retrieval</heading><br>
<papertitle>Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning</papertitle>
<br>
(<a target="_blank" href="https://openaccess.thecvf.com/content/CVPR2021/papers/Salvador_Revamping_Cross-Modal_Recipe_Retrieval_With_Hierarchical_Transformers_and_Self-Supervised_Learning_CVPR_2021_paper.pdf">CVF</a>)
(<a target="_blank" href="https://github.com/amzn/image-to-recipe-transformers">Code</a>)<br>
A. Salvador, <strong>E. Gundogdu</strong>, L. Bazzani, M. Donoser,
<em>published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2021 <br>
<p></p>
<p id="textAreaFOOD" align="justify" style = "font-size:15px">In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high performing encoders for text and images. We leverage transformers more effectively with a hierarchical design and exploit self-supervised text representation learning where we support different food descriptions to be similar but not the same. As a result, our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset. We make code and models publicly available.
</p><a id="toggleButtonFOOD" onclick="toggleTextFOOD()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="25%">
<video width="300" class="border" controls loop>
<source src="output.mp4" type="video/mp4">
</video>
<video width="300" class="border" controls loop>
<source src="output2.mp4" type="video/mp4">
</video>
<video width="300" class="border" controls loop>
<source src="output3.mp4" type="video/mp4">
</video><p></p><p></p>
</td>
<td valign="top" width="75%">
<heading>3D Cloth Draping by Deep Learning</heading><br>
<ul>
<li>
<papertitle>GarNet++: Improving Fast and Accurate Static 3D Cloth Draping by Curvature Loss</papertitle>
(<a target="_blank" href="https://ieeexplore.ieee.org/document/9145703">ieee.org</a>, <a target="_blank" href="https://arxiv.org/pdf/2007.10867.pdf">arXiv Preprint</a>)
<strong>E. Gundogdu</strong>, V. Constantin, S. Parashar, A. Seifoddini, M. Dang, M. Salzmann, P. Fua,
<em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, 2020
(<a target="_blank" href="garnet.bib">bibtex</a>, <a target="_blank" href="https://cvlab.epfl.ch/research/garment-simulation/garnet/">webpage</a>)
<p></p>
<li>
<papertitle>GarNet: A Two-stream Network for Fast and Accurate 3D Cloth Draping</papertitle>
(<a target="_blank" href="http://openaccess.thecvf.com/content_ICCV_2019/papers/Gundogdu_GarNet_A_Two-Stream_Network_for_Fast_and_Accurate_3D_Cloth_ICCV_2019_paper.pdf">thecvf.com</a>, <a target="_blank" href="https://arxiv.org/abs/1811.10983">arXiv Preprint</a>)
<strong>E. Gundogdu</strong>, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, P. Fua,
<em>IEEE International Conference on Computer Vision</em>, 2019
(<a target="_blank" href="garnet.bib">bibtex</a>, <a target="_blank" href="https://cvlab.epfl.ch/research/garment-simulation/garnet/">webpage</a>)
<p></p>
</ul>
<p id="textAreaGAR" align="justify" style = "font-size:15px"> In this work, we tackle the problem of static 3D cloth draping on virtual human bodies. We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes. Our network learns to mimic a Physics-Based Simulation (PBS) method while requiring two orders of magnitude less computation time.
</p><a id="toggleButtonGAR" onclick="toggleTextGAR()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="20%">
<img src='uv_space.png' width="300" height="200">
</td>
<td valign="top" width="60%">
<heading>Shape Reconstruction</heading><br>
<papertitle>Shape Reconstruction by Learning Differentiable Surface Representations</papertitle>
<br>
(<a target="_blank" href="https://arxiv.org/pdf/1911.11227.pdf">arXiv Preprint</a>)<br>
J. Bednarik, S. Parashar, <strong>E. Gundogdu</strong>, M. Salzmann, P. Fua,
<em>accepted to IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2020 <br>
<p></p>
<p id="textAreaSHAPE" align="justify" style = "font-size:15px">In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap.
</p><a id="toggleButtonSHAPE" onclick="toggleTextSHAPE()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="25%">
<img src='visuals.png' width="300" height="200">
</td>
<td valign="top" width="60%">
<heading>Deep Learning for Correlation Filters</heading><br>
<papertitle>Good Features to Correlate for Visual Tracking</papertitle>
<br>
(<a target="_blank" href="https://ieeexplore.ieee.org/document/8291524/">ieee.org</a>,
<a target="_blank" href="https://arxiv.org/pdf/1704.06326.pdf">arXiv Preprint</a>)<br>
<strong>E. Gundogdu</strong>, A. A. Alatan,
<em>IEEE Transactions on Image Processing</em>, 2018 <br>
<a target="_blank" href="https://github.com/egundogdu/CFCF">code</a>
<a target="_blank" href="CFCF.bib">bibtex</a>
<p></p>
<p align="justify" style = "font-size:15px">In this work, the problem of learning deep fully convolutional features for the
CFB visual tracking is formulated. To learn the proposed model, a novel and efficient backpropagation algorithm is presented
based on the loss function of the network. The proposed learning framework enables the network model to be flexible
for a custom design. Moreover, it alleviates the dependency on the network trained for classification. The proposed tracking method is the winner of
<a target="_blank" href="http://www.votchallenge.net/">VOT2017</a> Challenge, organized by IEEE ICCV 2017.</p>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="25%">
<img src='ensemble.png' width="300" height="200">
<img src='spatialWindowing.png' width="300" height="200">
</td>
<td valign="top" width="60%">
<heading>Improving Correlation Filters</heading><br>
<ul>
<li><papertitle>Extending Correlation Filter based Visual Tracking by Tree-Structured Ensemble and Spatial Windowing</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/document/7995133/">ieee.org</a>)<br>
<strong>E. Gundogdu</strong>, H. Ozkan, A. A. Alatan,
<em>IEEE Transactions on Image Processing</em>, 2017 <br>
</li><li><papertitle>Spatial Windowing for Correlation Filter Based Visual Tracking</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/document/7532645/">ieee.org</a>)<br>
<strong>E. Gundogdu</strong>, A. A. Alatan,
<em>IEEE International Conference on Image Processing (ICIP), 2016</em> <br>
</li><li><papertitle>Ensemble of Adaptive Correlation Filters for Robust Visual Tracking</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/document/7738031/">ieee.org</a>)<br>
<strong>E. Gundogdu</strong>, H. Ozkan, A. A. Alatan,
<em>IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), 2016</em> <br>
</li></ul>
<a target="_blank" href="ENSEMBLE.bib">bibtex</a>
<p></p>
<p align="justify" style = "font-size:15px">In the studies above, we improve upon the conventional correlation filters by proposing two methods. First, we present an approach to learn a spatial window at each frame during the course of the tracking. When the learned window is element-wise multiplied by the object patch/correlation filter, it can suppress the irrelevant regions of the object patch. Second, a tree-structured ensemble of trackers algorithm is proposed to combine multiple correaltion filter-based trackers while hierarchically keeping the appearance model of the object at the tree nodes. At each frame, only the relevant node trackers are activated to be combined as the final tracking decision. The combination of these two approaches also yield a better performance.</p>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="25%">
<img src='MarvelDataset.jpg' width="300" height="200">
</td>
<td valign="top" width="60%">
<heading>Visual Recognition for Maritime Vessels</heading><br>
<ul>
<li><papertitle>MARVEL: A Large-Scale Image Dataset for Maritime Vessels</papertitle> (<a target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-319-54193-8_11">SpringerLink</a>)<br>
<strong>E. Gundogdu</strong>, B. Solmaz, V. Yucesoy, A. Koc,
<em>Asian Conference on Computer Vision</em>, 2016 <br>
</li>
<li><papertitle>Generic and Attribute-specific Deep Representations for Maritime Vessels </papertitle>(<a target="_blank" href="https://ipsjcva.springeropen.com/articles/10.1186/s41074-017-0033-4">SpringerOpen</a>)<br>
B. Solmaz, <strong>E. Gundogdu</strong>, V. Yucesoy, A. Koc,
<em>IPSJ Transactions on Computer Vision and Applications, 2017</em> <br>
</li>
<li><papertitle>Fine-Grained Recognition of Maritime Vessels and Land Vehicles by Deep Feature Embedding </papertitle>(<a target="_blank" href="http://digital-library.theiet.org/content/journals/10.1049/iet-cvi.2018.5187">IET Digital Lib.</a>)<br>
B. Solmaz, <strong>E. Gundogdu</strong>, V. Yucesoy, A. Koc, A. A. Alatan,
<em>IEEE, IET Computer Vision, 2018</em> <br>
</li>
</ul>
<a target="_blank" href="VESSELS.bib">bibtex</a>
/
<a target="_blank" href="https://github.com/avaapm/marveldataset2016">dataset page</a>
<p></p>
<p id="textAreaVES" align="justify" style = "font-size:15px">In the studies above, we first construct a large-scale maritime vessel dataset by distilling 2M annotated vessel images. Based on a semi-supervised clustering scheme, 26 hyper-classes for vessel types are construced. Four potential applications are introduced; namely, vessel classification, verification, retrieval and recognition with their provided baseline results.
</p> <a id="toggleButtonVES" onclick="toggleTextVES()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
<table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
<tr>
<td width="25%">
<img src='InfraredFeats.png' width="300" height="200">
<img src='TBoost.png' width="300" height="200">
</td>
<td valign="top" width="60%">
<heading>Tracking and Recognition in Infrared Spectrum</heading><br>
<ul>
<li><papertitle>Comparison of Infrared and Visible Imagery for Object Tracking: Toward Trackers with Superior IR Performance</papertitle> (<a target="_blank" href="http://openaccess.thecvf.com/content_cvpr_workshops_2015/W05/papers/Gundogdu_Comparison_of_Infrared_2015_CVPR_paper.pdf">thecvf.com</a>)<br>
<strong>E. Gundogdu</strong>, H. Ozkan, H. S. Demir, H. Ergezer, E. Akagunduz, S. K. Pakin<br>
<em>IEEE Computer Vision and Pattern Recognition Workshops</em>, 2015 <br>
</li>
<li><papertitle>Object classification in infrared images using deep representations</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/abstract/document/7532521/">ieee.org</a>)<br>
<strong>E. Gundogdu</strong>, A. Koc, A. A. Alatan <br>
<em>IEEE International Conference on Image Processing (ICIP), 2016</em> <br>
</li>
<li><papertitle>Evaluation of Feature Channels for Correlation-Filter-Based Visual Object Tracking in Infrared Spectrum</papertitle> (<a target="_blank" href="http://openaccess.thecvf.com/content_cvpr_2016_workshops/w9/papers/Gundogdu_Evaluation_of_Feature_CVPR_2016_paper.pdf">thecvf.com</a>)<br>
<strong>E. Gundogdu</strong>, A. Koc, B. Solmaz, R. I. Hammoud, A. A. Alatan<br>
<em>IEEE Computer Vision and Pattern Recognition Workshops</em>, 2016 <br>
</li>
</ul>
<a target="_blank" href="INFRARED.bib">bibtex</a>
<p></p>
<p id="textAreaIR" align="justify" style = "font-size:15px">Unlike the visible spectrum, the problem of object recognition and tracking are not extensively studied in Infrared (IR) Spectrum. In these studies, we first provide the first benchmark comparison work where the available tracking methods are evaluated in IR and Visible pairs of 20 videos and a novel ensemble of trackers method is presented.
</p> <a id="toggleButtonIR" onclick="toggleTextIR()" href="javascript:void(0);">See More</a>
</td>
</tr>
</table>
<br>
</td>
</tr>
</table>
</div>
</div>
<script>
var statusIR = "less";
function toggleTextIR()
{
var text="Unlike the visible spectrum, the problem of object recognition and tracking are not extensively studied in Infrared (IR) Spectrum. In these studies, we first provide the first benchmark comparison work where the available tracking methods are evaluated in IR and Visible pairs of 20 videos and a novel ensemble of trackers method is presented. Second, a deep learning based classification network is trained in an in-house dataset (consisting of more than 70 real-world IR videos) to learn IR specific features. Finally, these IR specific features are utilized for IR object tracking, and a significant amount of performance increase is observed with respect to the manually designed features of visible spectrum.";
if (statusIR == "less") {
document.getElementById("textAreaIR").innerHTML=text;
document.getElementById("toggleButtonIR").innerHTML = "See Less";
statusIR = "more";
} else if (statusIR == "more") {
document.getElementById("textAreaIR").innerHTML = "Unlike the visible spectrum, the problem of object recognition and tracking are not extensively studied in Infrared (IR) Spectrum. In these studies, we first provide the first benchmark comparison work where the available tracking methods are evaluated in IR and Visible pairs of 20 videos and a novel ensemble of trackers method is presented.";
document.getElementById("toggleButtonIR").innerHTML = "See More";
statusIR = "less"
}
}
var statusVES = "less";
function toggleTextVES()
{
var text="In the above studies, we first construct a large-scale maritime vessel dataset by distilling 2M annotated vessel images. Based on a semi-supervised clustering scheme, 26 hyper-classes for vessel types are construced. Four potential applications are introduced; namely, vessel classification, verification, retrieval and recognition with their provided baseline results. Furthermore, we attempted interesting problems of visual marine surveillance such as predicting and classifying maritime vessel attributes such as length, summer deadweight, draught, and gross tonnage by solely interpreting the visual content in the wild, where no additional cues such as scale, orientation, or location are provided. By utilizing generic and attribute-specific deep representations for maritime vessels, we obtained promising results for the aforementioned applications.";
if (statusVES == "less") {
document.getElementById("textAreaVES").innerHTML=text;
document.getElementById("toggleButtonVES").innerHTML = "See Less";
statusVES = "more";
} else if (statusVES == "more") {
document.getElementById("textAreaVES").innerHTML = "In the above studies, we first construct a large-scale maritime vessel dataset by distilling 2M annotated vessel images. Based on a semi-supervised clustering scheme, 26 hyper-classes for vessel types are construced. Four potential applications are introduced; namely, vessel classification, verification, retrieval and recognition with their provided baseline results.";
document.getElementById("toggleButtonVES").innerHTML = "See More";
statusVES = "less"
}
}
var statusGAR = "less";
function toggleTextGAR()
{
var text="In this work, we tackle the problem of static 3D cloth draping on virtual human bodies. We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes. Our network learns to mimic a Physics-Based Simulation (PBS) method while requiring two orders of magnitude less computation time. To train the network, we introduce loss terms inspired by PBS to produce plausible results and make the model collision-aware. To increase the details of the draped garment, we introduce two loss functions that penalize the difference between the curvature of the predicted cloth and PBS. Particularly, we study the impact of mean curvature and a novel detail-preserving loss both qualitatively and quantitatively. Our new curvature loss computes the local covariance matrices of the 3D points, and compares the Rayleigh quotients of the prediction and PBS. This leads to more details while performing favorably or comparably against the loss that considers mean curvature vectors in the 3D triangulated meshes. We validate our framework on four garment types for various body shapes and poses. Finally, we achieve superior performance against a recently proposed data-driven method.";
if (statusGAR == "less") {
document.getElementById("textAreaGAR").innerHTML=text;
document.getElementById("toggleButtonGAR").innerHTML = "See Less";
statusGAR = "more";
} else if (statusGAR == "more") {
document.getElementById("textAreaGAR").innerHTML = "In this work, we tackle the problem of static 3D cloth draping on virtual human bodies. We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes. Our network learns to mimic a Physics-Based Simulation (PBS) method while requiring two orders of magnitude less computation time.";
document.getElementById("toggleButtonGAR").innerHTML = "See More";
statusGAR = "less"
}
}
var statusSHAPE = "less";
function toggleTextSHAPE()
{
var text="Generative models that produce point clouds have emerged as a powerful tool to represent 3D surfaces, and the best current ones rely on learning an ensemble of parametric representations. Unfortunately, they offer no control over the deformations of the surface patches that form the ensemble and thus fail to prevent them from either overlapping or collapsing into single points or lines. As a consequence, computing shape properties such as surface normals and curvatures becomes difficult and unreliable. In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap. Furthermore, this lets us reliably compute quantities such as surface normals and curvatures. We will demonstrate on several tasks that this yields more accurate surface reconstructions than the state-of-the-art methods in terms of normals estimation and amount of collapsed and overlapped patches.";
if (statusSHAPE == "less") {
document.getElementById("textAreaSHAPE").innerHTML=text;
document.getElementById("toggleButtonSHAPE").innerHTML = "See Less";
statusSHAPE = "more";
} else if (statusSHAPE == "more") {
document.getElementById("textAreaSHAPE").innerHTML = "In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap.";
document.getElementById("toggleButtonSHAPE").innerHTML = "See More";
statusSHAPE = "less"
}
}
var statusFOOD = "less";
function toggleTextFOOD()
{
var text="Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives, as well as the availability of vast amounts of digital cooking recipes and food images to train machine learning models. In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high performing encoders for text and images. We introduce a hierarchical recipe Transformer which attentively encodes individual recipe components (titles, ingredients and instructions). Further, we propose a self-supervised loss function computed on top of pairs of individual recipe components, which is able to leverage semantic relationships within recipes, and enables training using both image-recipe and recipe-only samples. We conduct a thorough analysis and ablation studies to validate our design choices. As a result, our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset. We make code and models publicly available.";
if (statusFOOD == "less") {
document.getElementById("textAreaFOOD").innerHTML=text;
document.getElementById("toggleButtonFOOD").innerHTML = "See Less";
statusFOOD = "more";
} else if (statusFOOD == "more") {
document.getElementById("textAreaFOOD").innerHTML = "In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high performing encoders for text and images. We leverage transformers more effectively with a hierarchical design and exploit self-supervised text representation learning where we support different food descriptions to be similar but not the same. As a result, our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset. We make code and models publicly available.";
document.getElementById("toggleButtonFOOD").innerHTML = "See More";
statusFOOD = "less"
}
}
var statusABO = "less";
function toggleTextABO()
{
var text="We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. ABO contains product catalog images, metadata, and artist-created 3D models with complex geometries and physically-based materials that correspond to real, household objects. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.";
if (statusABO == "less") {
document.getElementById("textAreaABO").innerHTML=text;
document.getElementById("toggleButtonABO").innerHTML = "See Less";
statusABO = "more";
} else if (statusABO == "more") {
document.getElementById("textAreaABO").innerHTML = "We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.";
document.getElementById("toggleButtonABO").innerHTML = "See More";
statusABO = "less"
}
}
var statusCLAP = "less";
function toggleTextCLAP()
{
var text="Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.";
if (statusCLAP == "less") {
document.getElementById("textAreaCLAP").innerHTML=text;
document.getElementById("toggleButtonCLAP").innerHTML = "See Less";
statusCLAP = "more";
} else if (statusCLAP == "more") {
document.getElementById("textAreaCLAP").innerHTML = "In this work, we address the limitations of using pre-trained video backbones on trimmed action recognition datasets which do not have sufficient temporal sensitivity to distinguish foreground and background. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.";
document.getElementById("toggleButtonCLAP").innerHTML = "See More";
statusCLAP = "less"
}
}
var statusiEDIT = "less";
function toggleTextiEDIT()
{
var text="Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely iEdit, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose to automatically construct a dataset derived from LAION-5B, containing pseudo-target images with their descriptive edit prompts given input image-caption pairs. This dataset gives us the flexibility of introducing a weakly-supervised loss function to generate the pseudo-target image from the latent noise of the source image conditioned on the edit prompt. To encourage localised editing and preserve or modify spatial structures in the image, we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Our model is trained on the constructed dataset with 200K samples and constrained GPU resources. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.";
if (statusiEDIT == "less") {
document.getElementById("textAreaiEDIT").innerHTML=text;
document.getElementById("toggleButtoniEDIT").innerHTML = "See Less";
statusiEDIT = "more";
} else if (statusiEDIT == "more") {
document.getElementById("textAreaiEDIT").innerHTML = "Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely iEdit, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose ...";
document.getElementById("toggleButtoniEDIT").innerHTML = "See More";
statusiEDIT = "less"
}
}
</script>
</body>
</html>