Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
boyuanzheng010 committed Dec 30, 2023
1 parent 92a6744 commit 4c40ae7
Showing 1 changed file with 23 additions and 5 deletions.
28 changes: 23 additions & 5 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -363,16 +363,34 @@ <h2 class="title is-3">Action Grounding</h2>

<h2 class="title is-3">Experiments and Results</h2>
<div class="content has-text-justified">
<!-- <p>-->
<!-- To compare with SEEACT, we also implement methods based on text-only LLMs and BLIP2 following the two-stage strategy of MindAct.-->
<!-- Firstly, we employ the ranker above to pick the top 50 elements.-->
<!-- Subsequently, the action generation problem is formulated as a multi-choice question answering problem, with the candidate elements as options, including a "None" option if the target element is absent.-->
<!-- During inference, elements are clustered into groups of 5 elements, with iterative refinement, until a single choice is made or all options are discarded.-->
<!-- We evaluate supervised fine-tuning (SFT) methods using FLAN-T5 and BLIP2-T5 and in-context learning (ICL) methods using GPT-3.5, GPT-4. The experiment results are shown in the following table.-->
<!-- </p>-->
<p>
To compare with SEEACT, we also implement methods based on text-only LLMs and BLIP2 following the two-stage strategy of MindAct.
Firstly, we employ the ranker above to pick the top 50 elements.
Subsequently, the action generation problem is formulated as a multi-choice question answering problem, with the candidate elements as options, including a "None" option if the target element is absent.
During inference, elements are clustered into groups of 5 elements, with iterative refinement, until a single choice is made or all options are discarded.
We compare SEEACT with other models following the two-stage strategy of MindAct.
We evaluate supervised fine-tuning (SFT) methods using FLAN-T5 and BLIP2-T5 and in-context learning (ICL) methods using GPT-3.5, GPT-4. The experiment results are shown in the following table.
</p>

<div class="content has-text-centered">
<img src="static/images/main_table.png" alt="algebraic reasoning" class="center" style="width: 90%; height: auto;">
</div>
<ul>
We observes the following results in the experiments:
<li>
(1) SEEACT with GPT-4V is a strong generalist web agent, if oracle grounding is provided, which substantially outperforming existing methods like GPT-4 (20%) or FLAN-T5 (18%)
</li>
<li>
(2) Grounding is still a major challenge. The best grounding strategy still has a 20-25% gap with oracle grounding
</li>
<li>
(3) In-context learning with large models (both LMMs and LLMs) show better generalization to unseen websites, while supervised fine-tuning still has an edge on websites seen during training
</li>

</ul>
</div>


Expand Down Expand Up @@ -604,7 +622,7 @@ <h1 class="title is-1 mmmu">
<p>
We develop a new online evaluation tool using Playwright to evaluate web agents on live websites.
Our tool can convert the predicted action into a browser event and exectute on the website.
To adhere to ethical standards, our experiments are restricted to non-login tasks in compliance with user agreements, and we closely monitor agent activities during online evaluation to prevent any actions that have potentially harmful impact, like placing an order or modifying the user profile.
To adhere to ethical standards, our experiments are restricted to non-login tasks in compliance with user agreements, and we closely monitor agent activities during online evaluation to prevent any actions that have potentially harmful impact.
</p>
</div>

Expand Down

0 comments on commit 4c40ae7

Please sign in to comment.