research.html

<!DOCTYPE html>
<!--
    Plain-Academic by Vasilios Mavroudis
    Released under the Simplified BSD License/FreeBSD (2-clause) License.
    https://github.com/mavroudisv/plain-academic
-->

<html lang="en">
<head>


  <meta name=viewport content=“width=800”>
  <meta name="generator" content="HTML Tidy for Linux/x86 (vers 11 February 2007), see www.w3.org">
  <style type="text/css">

    a {
    color: #1772d0;
    text-decoration:none;
    }
    a:focus, a:hover {
    color: #f09228;
    text-decoration:none;
    }
    body,td,th,tr,p,a {
    font-family: 'Lato', Verdana, Helvetica, sans-serif;
    font-size: 15px
    }
    table, th, td {
    border: 10px;
    padding: 15px;
    }
    table {
    border-spacing: 35px;
    }
    strong {
    font-family: 'Lato', Verdana, Helvetica, sans-serif;
    font-size: 15px;
    }
    heading {
    font-family: 'Lato', Verdana, Helvetica, sans-serif;
    font-size: 25px;
    }
    papertitle {
    font-family: 'Lato', Verdana, Helvetica, sans-serif;
    font-size: 18px;
    font-weight: 700
    }
    name {
    font-family: 'Lato', Verdana, Helvetica, sans-serif;
    font-size: 32px;
    }
    .one
    {
    width: 160px;
    height: 160px;
    position: relative;
    }
    .two
    {
    width: 160px;
    height: 160px;
    position: absolute;
    transition: opacity .2s ease-in-out;
    -moz-transition: opacity .2s ease-in-out;
    -webkit-transition: opacity .2s ease-in-out;
    }
    .fade {
     transition: opacity .2s ease-in-out;
     -moz-transition: opacity .2s ease-in-out;
     -webkit-transition: opacity .2s ease-in-out;
    }
    span.highlight {
        background-color: #ffffd0;
    }

  </style>

<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-585B7WN');</script>
<!-- End Google Tag Manager -->
	
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-QW6NPLQPRW"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-QW6NPLQPRW');
</script>
	
	<title>Erhan Gundogdu</title>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js"></script>
	<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
	<link href='https://fonts.googleapis.com/css?family=Oswald:700' rel='stylesheet' type='text/css'>
	<link rel="apple-touch-icon" sizes="180x180" href="files/favicon_package/apple-touch-icon.png">
	<link rel="icon" type="image/png" sizes="32x32" href="files/favicon_package/favicon-32x32.png">
	<link rel="icon" type="image/png" sizes="16x16" href="files/favicon_package/favicon-16x16.png">
	<link rel="manifest" href="/site.webmanifest">
	<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#5bbad5">
	<meta name="msapplication-TileColor" content="#da532c">
	<meta name="theme-color" content="#ffffff">
</head>

<body>
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-585B7WN"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
	
    <!-- Navigation -->
    <nav class="navbar navbar-inverse">
        <div class="container">
            <ul class="nav navbar-nav">
                <li><a href="index.html">Home</a></li>
                <li><a style="color:#e01709" href="research.html">Research and Publications</a></li>
              <li><a href="other_activities.html">News and Activities</a></li>
            </ul>
        </div>
    </nav>

    
    <!-- Page Content -->
    <div class="container">

        <div class="row">

            <!-- Publications -->
            <div class="col-md-8" style="min-height: 100vh; height: auto;">    

     <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
        <td width="100%" valign="middle">
          <heading style = "font-size:30px"><b>Research Interests</b></heading>
          <p style = "font-size:15px">
          My research interests include but not limited to video understanding, multi-modal image/video representation learning, (visible and infrared) object tracking, recognition and (weakly-supervised) detection, deep metric learning, 3D object understanding (3D cloth fitting, 3D shape recognition and extraction).
          <p style = "font-size:15px"> For my full publication list, please visit <a style = "font-size:15px" target="_blank" href="https://scholar.google.ch/citations?user=nZD_5vsAAAAJ&hl=en&oi=ao">my Google Scholar Page</a>.
            My Ph.D. thesis is about visual object tracking (<a style = "font-size:15px" target="_blank" href="http://etd.lib.metu.edu.tr/upload/12621448/index.pdf">lib.metu</a>) and my M.Sc. thesis is about local feature detection and description learning for fast image matching (<a style = "font-size:15px" target="_blank" href="https://etd.lib.metu.edu.tr/upload/12614618/index.pdf">lib.metu</a>).
        </td>
      </tr>
      </table>

  <table width="1000" border="0" align="center" cellspacing="30" cellpadding="0">
    <tr>
    <td>
      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="20%">
      <img src='iEDIT.png' width="300">
      </td>
      <td valign="top" width="60%">
      <heading>Generative AI</heading><br>
      <papertitle>iEdit: Localised Text-guided Image Editing with Weak Supervision</papertitle>
      <br>
      (<a target="_blank" href="https://arxiv.org/pdf/2305.05947.pdf">arXiv</a>)
      <br>
      R. Bodur, <strong>E. Gundogdu</strong>, B. Bhattarai, T.K. Kim, M. Donoser, L. Bazzani,
      <em>preprint (arXiv Preprint)</em>, 2023 <br>
      <p></p>
      <p id="textAreaiEDIT" align="justify" style = "font-size:15px">Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely iEdit, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose ...
      </p><a id="toggleButtoniEDIT" onclick="toggleTextiEDIT()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
  </table>
      <br>
            
  <table width="1000" border="0" align="center" cellspacing="30" cellpadding="0">
    <tr>
    <td>
      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="20%">
      <img src='CLAP.png' width="300">
      </td>
      <td valign="top" width="60%">
      <heading>Video Representation Learning</heading><br>
      <papertitle>Contrastive Language-Action Pre-training for Temporal Localization</papertitle>
      <br>
      (<a target="_blank" href="https://arxiv.org/pdf/2204.12293.pdf">arXiv</a>)
      <br>
      M. Xu, <strong>E. Gundogdu</strong>, M. Lapin, B. Ghanem, M. Donoser, L. Bazzani,
      <em>preprint (arXiv Preprint)</em>, 2022 <br>
      <p></p>
      <p id="textAreaCLAP" align="justify" style = "font-size:15px">In this work, we address the limitations of using pre-trained video backbones on trimmed action recognition datasets which do not have sufficient temporal sensitivity to distinguish foreground and background. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.
      </p><a id="toggleButtonCLAP" onclick="toggleTextCLAP()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>
      <br>


      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="20%">
      <img src='ABO.png' width="300">
      </td>
      <td valign="top" width="60%">
      <heading>Object Retrieval Benchmark</heading><br>
      <papertitle>Abo: Dataset and benchmarks for real-world 3d object understanding</papertitle>
      <br>
      (<a target="_blank" href="https://openaccess.thecvf.com/content/CVPR2022/papers/Collins_ABO_Dataset_and_Benchmarks_for_Real-World_3D_Object_Understanding_CVPR_2022_paper.pdf">CVF</a>)
      (<a target="_blank" href="https://amazon-berkeley-objects.s3.amazonaws.com/index.html">Dataset</a>)<br>
      J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, <strong>E. Gundogdu</strong>, X. Zhang, T. F. Y. Vicente, T. Dideriksen, H. Arora, M. Guillaumin, J. Malik,
      <em>published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2022 <br>
      <p></p>
      <p id="textAreaABO" align="justify" style = "font-size:15px">We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.
      </p><a id="toggleButtonABO" onclick="toggleTextABO()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>
      <br>


      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="20%">
      <img src='food.png' width="300">
      </td>
      <td valign="top" width="60%">
      <heading>Cross-Modal Recipe Retrieval</heading><br>
      <papertitle>Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning</papertitle>
      <br>
      (<a target="_blank" href="https://openaccess.thecvf.com/content/CVPR2021/papers/Salvador_Revamping_Cross-Modal_Recipe_Retrieval_With_Hierarchical_Transformers_and_Self-Supervised_Learning_CVPR_2021_paper.pdf">CVF</a>)
      (<a target="_blank" href="https://github.com/amzn/image-to-recipe-transformers">Code</a>)<br>
      A. Salvador, <strong>E. Gundogdu</strong>, L. Bazzani, M. Donoser,
      <em>published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2021 <br>
      <p></p>
      <p id="textAreaFOOD" align="justify" style = "font-size:15px">In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high performing encoders for text and images. We leverage transformers more effectively with a hierarchical design and exploit self-supervised text representation learning where we support different food descriptions to be similar but not the same. As a result, our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset. We make code and models publicly available.
      </p><a id="toggleButtonFOOD" onclick="toggleTextFOOD()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>
      <br>


      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="25%">
      <video width="300" class="border" controls loop>
      <source src="output.mp4" type="video/mp4">
      </video>
      <video width="300" class="border" controls loop>
      <source src="output2.mp4" type="video/mp4">
      </video>
      <video width="300" class="border" controls loop>
      <source src="output3.mp4" type="video/mp4">
      </video><p></p><p></p>
      </td>

      <td valign="top" width="75%">
      <heading>3D Cloth Draping by Deep Learning</heading><br>
      <ul>
      <li>
      <papertitle>GarNet++: Improving Fast and Accurate Static 3D Cloth Draping by Curvature Loss</papertitle> 
      (<a target="_blank" href="https://ieeexplore.ieee.org/document/9145703">ieee.org</a>, <a target="_blank" href="https://arxiv.org/pdf/2007.10867.pdf">arXiv Preprint</a>)
      <strong>E. Gundogdu</strong>, V. Constantin, S. Parashar, A. Seifoddini, M. Dang, M. Salzmann, P. Fua, 
      <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, 2020
      (<a target="_blank" href="garnet.bib">bibtex</a>, <a target="_blank" href="https://cvlab.epfl.ch/research/garment-simulation/garnet/">webpage</a>)
      <p></p>

      <li>
      <papertitle>GarNet: A Two-stream Network for Fast and Accurate 3D Cloth Draping</papertitle> 
      (<a target="_blank" href="http://openaccess.thecvf.com/content_ICCV_2019/papers/Gundogdu_GarNet_A_Two-Stream_Network_for_Fast_and_Accurate_3D_Cloth_ICCV_2019_paper.pdf">thecvf.com</a>, <a target="_blank" href="https://arxiv.org/abs/1811.10983">arXiv Preprint</a>)
      <strong>E. Gundogdu</strong>, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, P. Fua, 
      <em>IEEE International Conference on Computer Vision</em>, 2019
      (<a target="_blank" href="garnet.bib">bibtex</a>, <a target="_blank" href="https://cvlab.epfl.ch/research/garment-simulation/garnet/">webpage</a>)
      <p></p>
      </ul>

      <p id="textAreaGAR" align="justify" style = "font-size:15px"> In this work, we tackle the problem of static 3D cloth draping on virtual human bodies. We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes. Our network learns to mimic a Physics-Based Simulation (PBS) method while requiring two orders of magnitude less computation time. 
      </p><a id="toggleButtonGAR" onclick="toggleTextGAR()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>
      <br>

      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="20%">
      <img src='uv_space.png' width="300" height="200">
      </td>
      <td valign="top" width="60%">
      <heading>Shape Reconstruction</heading><br>
      <papertitle>Shape Reconstruction by Learning Differentiable Surface Representations</papertitle>
      <br>
      (<a target="_blank" href="https://arxiv.org/pdf/1911.11227.pdf">arXiv Preprint</a>)<br>
      J. Bednarik, S. Parashar, <strong>E. Gundogdu</strong>, M. Salzmann, P. Fua,
      <em>accepted to IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 2020 <br>
      <p></p>
      <p id="textAreaSHAPE" align="justify" style = "font-size:15px">In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap.
      </p><a id="toggleButtonSHAPE" onclick="toggleTextSHAPE()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>
      <br>


      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="25%">
      <img src='visuals.png' width="300" height="200">
      </td>
      <td valign="top" width="60%">
      <heading>Deep Learning for Correlation Filters</heading><br>
      <papertitle>Good Features to Correlate for Visual Tracking</papertitle>
      <br>
      (<a target="_blank" href="https://ieeexplore.ieee.org/document/8291524/">ieee.org</a>,
      <a target="_blank" href="https://arxiv.org/pdf/1704.06326.pdf">arXiv Preprint</a>)<br>
      <strong>E. Gundogdu</strong>, A. A. Alatan,
      <em>IEEE Transactions on Image Processing</em>, 2018 <br>
      <a target="_blank" href="https://github.com/egundogdu/CFCF">code</a>
      <a target="_blank" href="CFCF.bib">bibtex</a>
      <p></p>
      <p align="justify" style = "font-size:15px">In this work, the problem of learning deep fully convolutional features for the
        CFB visual tracking is formulated. To learn the proposed model, a novel and efficient backpropagation algorithm is presented
        based on the loss function of the network. The proposed learning framework enables the network model to be flexible
        for a custom design. Moreover, it alleviates the dependency on the network trained for classification. The proposed tracking method is the winner of
        <a target="_blank" href="http://www.votchallenge.net/">VOT2017</a> Challenge, organized by IEEE ICCV 2017.</p>
      </td>
      </tr>
      </table>
      <br>
      
      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="25%">
        <img src='ensemble.png' width="300" height="200">
        <img src='spatialWindowing.png' width="300" height="200">  
      </td>
      <td valign="top" width="60%">
      <heading>Improving Correlation Filters</heading><br>
      <ul>
      <li><papertitle>Extending Correlation Filter based Visual Tracking by Tree-Structured Ensemble and Spatial Windowing</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/document/7995133/">ieee.org</a>)<br>
        <strong>E. Gundogdu</strong>, H. Ozkan, A. A. Alatan,
        <em>IEEE Transactions on Image Processing</em>, 2017 <br>   
      </li><li><papertitle>Spatial Windowing for Correlation Filter Based Visual Tracking</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/document/7532645/">ieee.org</a>)<br>
        <strong>E. Gundogdu</strong>, A. A. Alatan,
        <em>IEEE International Conference on Image Processing (ICIP), 2016</em> <br>
      </li><li><papertitle>Ensemble of Adaptive Correlation Filters for Robust Visual Tracking</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/document/7738031/">ieee.org</a>)<br>
        <strong>E. Gundogdu</strong>, H. Ozkan, A. A. Alatan,
        <em>IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), 2016</em> <br>
      </li></ul>
      <a target="_blank" href="ENSEMBLE.bib">bibtex</a>
      <p></p>
      
        <p align="justify" style = "font-size:15px">In the studies above, we improve upon the conventional correlation filters by proposing two methods. First, we present an approach to learn a spatial window at each frame during the course of the tracking. When the learned window is element-wise multiplied by the object patch/correlation filter, it can suppress the irrelevant regions of the object patch. Second, a tree-structured ensemble of trackers algorithm is proposed to combine multiple correaltion filter-based trackers while hierarchically keeping the appearance model of the object at the tree nodes. At each frame, only the relevant node trackers are activated to be combined as the final tracking decision. The combination of these two approaches also yield a better performance.</p>
      </td>
      </tr>
      </table>   
      <br>

      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="25%">
        <img src='MarvelDataset.jpg' width="300" height="200">
      </td>
      <td valign="top" width="60%">
      <heading>Visual Recognition for Maritime Vessels</heading><br>
      <ul>
      <li><papertitle>MARVEL: A Large-Scale Image Dataset for Maritime Vessels</papertitle> (<a target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-319-54193-8_11">SpringerLink</a>)<br>
        <strong>E. Gundogdu</strong>, B. Solmaz, V. Yucesoy, A. Koc,
        <em>Asian Conference on Computer Vision</em>, 2016 <br>
      </li>
      <li><papertitle>Generic and Attribute-specific Deep Representations for Maritime Vessels </papertitle>(<a target="_blank" href="https://ipsjcva.springeropen.com/articles/10.1186/s41074-017-0033-4">SpringerOpen</a>)<br>
       B. Solmaz, <strong>E. Gundogdu</strong>, V. Yucesoy, A. Koc,
        <em>IPSJ Transactions on Computer Vision and Applications, 2017</em> <br>
        </li>
      <li><papertitle>Fine-Grained Recognition of Maritime Vessels and Land Vehicles by Deep Feature Embedding </papertitle>(<a target="_blank" href="http://digital-library.theiet.org/content/journals/10.1049/iet-cvi.2018.5187">IET Digital Lib.</a>)<br>
       B. Solmaz, <strong>E. Gundogdu</strong>, V. Yucesoy, A. Koc, A. A. Alatan,
        <em>IEEE, IET Computer Vision, 2018</em> <br>
      </li>
      </ul>
      <a target="_blank" href="VESSELS.bib">bibtex</a>
      / 
      <a target="_blank" href="https://github.com/avaapm/marveldataset2016">dataset page</a>
      <p></p>
      
        <p id="textAreaVES" align="justify" style = "font-size:15px">In the studies above, we first construct a large-scale maritime vessel dataset by distilling 2M annotated vessel images. Based on a semi-supervised clustering scheme, 26 hyper-classes for vessel types are construced. Four potential applications are introduced; namely, vessel classification, verification, retrieval and recognition with their provided baseline results.
      </p> <a id="toggleButtonVES" onclick="toggleTextVES()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>
      <br>

      <table width="100%" align="center" border="0" cellspacing="0" cellpadding="20">
      <tr>
      <td width="25%">
        <img src='InfraredFeats.png' width="300" height="200">
        <img src='TBoost.png' width="300" height="200">  
      </td>
      <td valign="top" width="60%">
      <heading>Tracking and Recognition in Infrared Spectrum</heading><br>

      <ul>
      <li><papertitle>Comparison of Infrared and Visible Imagery for Object Tracking: Toward Trackers with Superior IR Performance</papertitle> (<a target="_blank" href="http://openaccess.thecvf.com/content_cvpr_workshops_2015/W05/papers/Gundogdu_Comparison_of_Infrared_2015_CVPR_paper.pdf">thecvf.com</a>)<br>
        <strong>E. Gundogdu</strong>, H. Ozkan, H. S. Demir, H. Ergezer, E. Akagunduz, S. K. Pakin<br>
        <em>IEEE Computer Vision and Pattern Recognition Workshops</em>, 2015 <br>   
      </li>
      <li><papertitle>Object classification in infrared images using deep representations</papertitle> (<a target="_blank" href="https://ieeexplore.ieee.org/abstract/document/7532521/">ieee.org</a>)<br>
        <strong>E. Gundogdu</strong>, A. Koc, A. A. Alatan <br>
        <em>IEEE International Conference on Image Processing (ICIP), 2016</em> <br>
      </li>
      <li><papertitle>Evaluation of Feature Channels for Correlation-Filter-Based Visual Object Tracking in Infrared Spectrum</papertitle> (<a target="_blank" href="http://openaccess.thecvf.com/content_cvpr_2016_workshops/w9/papers/Gundogdu_Evaluation_of_Feature_CVPR_2016_paper.pdf">thecvf.com</a>)<br>
        <strong>E. Gundogdu</strong>, A. Koc, B. Solmaz, R. I. Hammoud, A. A. Alatan<br>
        <em>IEEE Computer Vision and Pattern Recognition Workshops</em>, 2016 <br>
      </li>
      </ul>
      <a target="_blank" href="INFRARED.bib">bibtex</a>
      <p></p>
      
        <p id="textAreaIR" align="justify" style = "font-size:15px">Unlike the visible spectrum, the problem of object recognition and tracking are not extensively studied in Infrared (IR) Spectrum. In these studies, we first provide the first benchmark comparison work where the available tracking methods are evaluated in IR and Visible pairs of 20 videos and a novel ensemble of trackers method is presented.
      </p> <a id="toggleButtonIR" onclick="toggleTextIR()" href="javascript:void(0);">See More</a>
      </td>
      </tr>
      </table>  
      <br>

    </td>
    </tr>
</table>


        </div>

    </div>

    
<script>
var statusIR = "less";

function toggleTextIR()
{
    var text="Unlike the visible spectrum, the problem of object recognition and tracking are not extensively studied in Infrared (IR) Spectrum. In these studies, we first provide the first benchmark comparison work where the available tracking methods are evaluated in IR and Visible pairs of 20 videos and a novel ensemble of trackers method is presented. Second, a deep learning based classification network is trained in an in-house dataset (consisting of more than 70 real-world IR videos) to learn IR specific features. Finally, these IR specific features are utilized for IR object tracking, and a significant amount of performance increase is observed with respect to the manually designed features of visible spectrum.";

    if (statusIR == "less") {
        document.getElementById("textAreaIR").innerHTML=text;
        document.getElementById("toggleButtonIR").innerHTML = "See Less";
        statusIR = "more";
    } else if (statusIR == "more") {
        document.getElementById("textAreaIR").innerHTML = "Unlike the visible spectrum, the problem of object recognition and tracking are not extensively studied in Infrared (IR) Spectrum. In these studies, we first provide the first benchmark comparison work where the available tracking methods are evaluated in IR and Visible pairs of 20 videos and a novel ensemble of trackers method is presented.";
        document.getElementById("toggleButtonIR").innerHTML = "See More";
        statusIR = "less"
    }
}

var statusVES = "less";

function toggleTextVES()
{
    var text="In the above studies, we first construct a large-scale maritime vessel dataset by distilling 2M annotated vessel images. Based on a semi-supervised clustering scheme, 26 hyper-classes for vessel types are construced. Four potential applications are introduced; namely, vessel classification, verification, retrieval and recognition with their provided baseline results. Furthermore, we attempted interesting problems of visual marine surveillance such as predicting and classifying maritime vessel attributes such as length, summer deadweight, draught, and gross tonnage by solely interpreting the visual content in the wild, where no additional cues such as scale, orientation, or location are provided. By utilizing generic and attribute-specific deep representations for maritime vessels, we obtained promising results for the aforementioned applications.";

    if (statusVES == "less") {
        document.getElementById("textAreaVES").innerHTML=text;
        document.getElementById("toggleButtonVES").innerHTML = "See Less";
        statusVES = "more";
    } else if (statusVES == "more") {
        document.getElementById("textAreaVES").innerHTML = "In the above studies, we first construct a large-scale maritime vessel dataset by distilling 2M annotated vessel images. Based on a semi-supervised clustering scheme, 26 hyper-classes for vessel types are construced. Four potential applications are introduced; namely, vessel classification, verification, retrieval and recognition with their provided baseline results.";
        document.getElementById("toggleButtonVES").innerHTML = "See More";
        statusVES = "less"
    }
}

var statusGAR = "less";

function toggleTextGAR()
{
    var text="In this work, we tackle the problem of static 3D cloth draping on virtual human bodies. We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes. Our network learns to mimic a Physics-Based Simulation (PBS) method while requiring two orders of magnitude less computation time. To train the network, we introduce loss terms inspired by PBS to produce plausible results and make the model collision-aware. To increase the details of the draped garment, we introduce two loss functions that penalize the difference between the curvature of the predicted cloth and PBS. Particularly, we study the impact of mean curvature and a novel detail-preserving loss both qualitatively and quantitatively. Our new curvature loss computes the local covariance matrices of the 3D points, and compares the Rayleigh quotients of the prediction and PBS. This leads to more details while performing favorably or comparably against the loss that considers mean curvature vectors in the 3D triangulated meshes. We validate our framework on four garment types for various body shapes and poses. Finally, we achieve superior performance against a recently proposed data-driven method.";

    if (statusGAR == "less") {
        document.getElementById("textAreaGAR").innerHTML=text;
        document.getElementById("toggleButtonGAR").innerHTML = "See Less";
        statusGAR = "more";
    } else if (statusGAR == "more") {
        document.getElementById("textAreaGAR").innerHTML = "In this work, we tackle the problem of static 3D cloth draping on virtual human bodies. We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes. Our network learns to mimic a Physics-Based Simulation (PBS) method while requiring two orders of magnitude less computation time.";
        document.getElementById("toggleButtonGAR").innerHTML = "See More";
        statusGAR = "less"
    }
}

var statusSHAPE = "less";

function toggleTextSHAPE()
{
    var text="Generative models that produce point clouds have emerged as a powerful tool to represent 3D surfaces, and the best current ones rely on learning an ensemble of parametric representations. Unfortunately, they offer no control over the deformations of the surface patches that form the ensemble and thus fail to prevent them from either overlapping or collapsing into single points or lines. As a consequence, computing shape properties such as surface normals and curvatures becomes difficult and unreliable. In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap. Furthermore, this lets us reliably compute quantities such as surface normals and curvatures. We will demonstrate on several tasks that this yields more accurate surface reconstructions than the state-of-the-art methods in terms of normals estimation and amount of collapsed and overlapped patches.";

    if (statusSHAPE == "less") {
        document.getElementById("textAreaSHAPE").innerHTML=text;
        document.getElementById("toggleButtonSHAPE").innerHTML = "See Less";
        statusSHAPE = "more";
    } else if (statusSHAPE == "more") {
        document.getElementById("textAreaSHAPE").innerHTML = "In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap.";
        document.getElementById("toggleButtonSHAPE").innerHTML = "See More";
        statusSHAPE = "less"
    }
}

var statusFOOD = "less";

function toggleTextFOOD()
{
    var text="Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives, as well as the availability of vast amounts of digital cooking recipes and food images to train machine learning models. In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high performing encoders for text and images. We introduce a hierarchical recipe Transformer which attentively encodes individual recipe components (titles, ingredients and instructions). Further, we propose a self-supervised loss function computed on top of pairs of individual recipe components, which is able to leverage semantic relationships within recipes, and enables training using both image-recipe and recipe-only samples. We conduct a thorough analysis and ablation studies to validate our design choices. As a result, our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset. We make code and models publicly available.";

    if (statusFOOD == "less") {
        document.getElementById("textAreaFOOD").innerHTML=text;
        document.getElementById("toggleButtonFOOD").innerHTML = "See Less";
        statusFOOD = "more";
    } else if (statusFOOD == "more") {
        document.getElementById("textAreaFOOD").innerHTML = "In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high performing encoders for text and images. We leverage transformers more effectively with a hierarchical design and exploit self-supervised text representation learning where we support different food descriptions to be similar but not the same. As a result, our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset. We make code and models publicly available.";
        document.getElementById("toggleButtonFOOD").innerHTML = "See More";
        statusFOOD = "less"
    }
}

var statusABO = "less";

function toggleTextABO()
{
    var text="We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. ABO contains product catalog images, metadata, and artist-created 3D models with complex geometries and physically-based materials that correspond to real, household objects. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.";

    if (statusABO == "less") {
        document.getElementById("textAreaABO").innerHTML=text;
        document.getElementById("toggleButtonABO").innerHTML = "See Less";
        statusABO = "more";
    } else if (statusABO == "more") {
        document.getElementById("textAreaABO").innerHTML = "We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.";
        document.getElementById("toggleButtonABO").innerHTML = "See More";
        statusABO = "less"
    }
}

var statusCLAP = "less";

function toggleTextCLAP()
{
    var text="Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks.  Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.";

    if (statusCLAP == "less") {
        document.getElementById("textAreaCLAP").innerHTML=text;
        document.getElementById("toggleButtonCLAP").innerHTML = "See Less";
        statusCLAP = "more";
    } else if (statusCLAP == "more") {
        document.getElementById("textAreaCLAP").innerHTML = "In this work, we address the limitations of using pre-trained video backbones on trimmed action recognition datasets which do not have sufficient temporal sensitivity to distinguish foreground and background. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.";
        document.getElementById("toggleButtonCLAP").innerHTML = "See More";
        statusCLAP = "less"
    }
}


var statusiEDIT = "less";
function toggleTextiEDIT()
{
    var text="Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely iEdit, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose to automatically construct a dataset derived from LAION-5B, containing pseudo-target images with their descriptive edit prompts given input image-caption pairs. This dataset gives us the flexibility of introducing a weakly-supervised loss function to generate the pseudo-target image from the latent noise of the source image conditioned on the edit prompt. To encourage localised editing and preserve or modify spatial structures in the image, we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Our model is trained on the constructed dataset with 200K samples and constrained GPU resources. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.";

    if (statusiEDIT == "less") {
        document.getElementById("textAreaiEDIT").innerHTML=text;
        document.getElementById("toggleButtoniEDIT").innerHTML = "See Less";
        statusiEDIT = "more";
    } else if (statusiEDIT == "more") {
        document.getElementById("textAreaiEDIT").innerHTML = "Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely iEdit, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with target images does not exist, previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image, leading to issues on preserving the fidelity of the source image. We propose ...";
        document.getElementById("toggleButtoniEDIT").innerHTML = "See More";
        statusiEDIT = "less"
    }
}


</script>

</body>


</html>