Skip to content

Commit

Permalink
3_20
Browse files Browse the repository at this point in the history
  • Loading branch information
jzhzhang committed Mar 20, 2024
1 parent 57ee535 commit d8b49f5
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 9 deletions.
23 changes: 14 additions & 9 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
ul {
text-align: left; /* 确保文本左对齐 */
list-style-position: inside; /* 列表标记与文本对齐 */
font-size: 17px;
font-size: 16px;
list-style-type: circle; /* 使用圆形作为项目符号 */

}
Expand Down Expand Up @@ -156,7 +156,7 @@ <h1 class="title is-1 publication-title">NaVid: Video-based VLM Plans the Next S
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code (soon)</span>
<span>Code (TBD)</span>
</a>
</span> -->

Expand Down Expand Up @@ -202,7 +202,9 @@ <h2>
<!-- <p> <b>Real-world demos by following simple instructions, such as walking to a single landmark.</b></p> -->
<ul>
<li><b>NaVid is the first video-based Vision-Language Model (VLM) for the task of vision-and-language navigation (VLN).</b></li>
<li><b>NaVid is trained using 763k web-scale caption samples and 510k simulated samples, including VLN action samples (500k) and reasoning samples (10k).</b></li>
<li><b>NaVid leverages only RGB sequences, eliminating the need for location, orientation, depth, or map.</b></li>
<li><b>NaVid is co-trained with real-world caption data (763k) and simulated VLN data (510k). The VLN capability is obtained by leveraging simulation environments, while real-world understanding is gained through real-world caption data.
</b></li>
<li><b>NaVid demonstrates strong generalizability and achieves state-of-the-art (SOTA) performance in both simulated and real-world environments.</b></li>
</ul>

Expand All @@ -228,7 +230,7 @@ <h2>



<h2 class="title is-3">Simple Instruction VLN</h2>
<h2 class="title is-3">(Sim-to-Real) Simple Instruction VLN</h2>
<p> <b>Real-world demos by following simple instructions, such as walking to a single landmark.</b></p>

<div id="results-carousel-teaser1" class="carousel results-carousel">
Expand Down Expand Up @@ -266,7 +268,7 @@ <h2 class="title is-3">Simple Instruction VLN</h2>
<section class="hero is-small">
<div class="hero-body">
<div class="container is-max-desktop has-text-centered">
<h2 class="title is-3">Complex Instruction VLN</h2>
<h2 class="title is-3">(Sim-to-Real) Complex Instruction VLN</h2>
<p> <b>Real-world demos by following complex instructions, which consist of several simple instructions.</b></p>

<div id="results-carousel-teaser2" class="carousel results-carousel">
Expand Down Expand Up @@ -325,7 +327,7 @@ <h2 class="title is-3">Method Overview</h2>
<img src="static/images/method.png" alt="NaVid" class="center-image blend-img-background">
<div class="level-set has-text-justified">
<p class="has-text-justified">
<b>The overview of NaVid.</b> The inputs of NaVid consist of the RGB frames from the online video observation {x0, · · · , xt} along with the human instruction I. For each frame, we use an observation encoder to extract the visual information with the instruction to obtain observation tokens, including, instruction-queried tokens (orange blocks) and instruction-agnostic tokens (blue blocks). At the current step t, the history frames and current frame xt are encoded as observation tokens, with 4 and 64 instruction-agnostic tokens for history frames and current frames, respectively. Besides, our method obtains language tokens by a text encoder. Finally, split by the special tokens [HIS], [OBS], and [NAV], we concatenate the observation tokens and language tokens and send the tokens to the Vicuna-7B then obtain the next-step action.
<b>The overview of NaVid.</b> The inputs of NaVid consist of the RGB frames from the online video observation {x<sub>0</sub>, · · · , x<sub>t</sub>} along with the human instruction I. For each frame, we use an observation encoder to extract the visual information with the instruction to obtain observation tokens, including, instruction-queried tokens (orange blocks) and instruction-agnostic tokens (blue blocks). At the current step t, the history frames and current frame x<sub>t</sub> are encoded as observation tokens, with 4 and 64 instruction-agnostic tokens for history frames and current frames, respectively. Besides, our method obtains language tokens by a text encoder. Finally, split by the special tokens [HIS], [OBS], and [NAV], we concatenate the observation tokens and language tokens and send the tokens to the Vicuna-7B then obtain the next-step action.
</p>
</div>
</div>
Expand All @@ -344,9 +346,12 @@ <h2 class="title is-3">Method Overview</h2>
<h2 class="title is-3">Data Collection</h2>
<img src="static/images/data.png" alt="NaVid" class="center-image blend-img-background">
<div class="level-set has-text-justified">
<!-- <p class="has-text-justified">
<b>The overview of NaVid.</b> The inputs of NaVid consist of the RGB frames from the online video observation {x0, · · · , xt} along with the human instruction I. For each frame, we use an observation encoder to extract the visual information with the instruction to obtain observation tokens, including, instruction-queried tokens (orange blocks) and instruction-agnostic tokens (blue blocks). At the current step t, the history frames and current frame xt are encoded as observation tokens, with 4 and 64 instruction-agnostic tokens for history frames and current frames, respectively. Besides, our method obtains language tokens by a text encoder. Finally, split by the special tokens [HIS], [OBS], and [NAV], we concatenate the observation tokens and language tokens and send the tokens to the Vicuna-7B then obtain the next-step action.
</p> -->
<p class="has-text-justified">
<b>We co-train NaVid using real-world caption data (763k) and simulated VLN data (510k). The simulated VLN data consists of 500k action planning samples and 10k instruction reasoning samples.</b>
</p>
<p class="has-text-justified">
<b>We initialize the encoders and Vicuna-7B using pre-trained weights, and our model requires only one epoch for the training process.</b>
</p>
</div>
</div>
</div>
Expand Down
Binary file modified static/images/data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d8b49f5

Please sign in to comment.