From 92a6744477b51ed7533d68072e098139628f6f0f Mon Sep 17 00:00:00 2001
From: Boyuan Zheng
SEEACT is a generalist web agent based on GPT-4V.
Specifically, given a web-based task (e.g., “Compare iPhone 15 Pro Max with iPhone 13 Pro Max” in Apple homepage),
the agent first perform Action Generation to produce an action description at each step towards completing the task (e.g., “Navigate to the iPhone category”),
- and then Element Grounding to identify an HTML element (e.g., “[button] iPhone”) at the current step on the webpage.
+ and then Action Grounding to identify an HTML element (e.g., “[button] iPhone”) at the current step on the webpage.
- SEEACT can successfully compete 50% of the tasks on live websites given an oracle element grounding method. + SEEACT can successfully compete 50% of the tasks on live websites given an oracle action grounding method. It also exhibits remarkable capabilities, ranging from long-range action planning, webpage content reasoning, and error correction.
@@ -298,7 +298,8 @@- SEEACT leverages an LMM like GPT-4V to visually perceive websites and generate plans in textual forms (Action Generation). The textual plans are then grounded onto the HTML elements and operations to act on the website (Action Grounding). + SEEACT firstly perform Action Generation by leveraging an LMM like GPT-4V to visually perceive websites and generate plans in textual forms, + and then Action Grounding to grounded textual plans onto the HTML elements and operations to act on the website