diff --git a/README.md b/README.md index 660bfdeac..40cccfe70 100644 --- a/README.md +++ b/README.md @@ -40,6 +40,8 @@ Start building LLM-empowered multi-agent applications in an easier way. ## News +- new**[2024-09-03]** AgentScope supports **Web Browser Control** now! Refer to our [example](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent) for more details. + - new**[2024-07-18]** AgentScope supports streaming mode now! Refer to our [tutorial](https://modelscope.github.io/agentscope/en/tutorial/203-stream.html) and example [conversation in stream mode](https://github.com/modelscope/agentscope/tree/main/examples/conversation_in_stream_mode) for more details.
diff --git a/README_ZH.md b/README_ZH.md index 4d607df93..089875af6 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -41,6 +41,8 @@ ## 新闻 +- new**[2024-09-03]** AgentScope 已更新浏览器控制模块,利用 vision 模型实现智能体对浏览器的控制。请参考[**样例**](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent) + - new**[2024-07-18]** AgentScope 已支持模型流式输出。请参考我们的 [**教程**](https://modelscope.github.io/agentscope/zh_CN/tutorial/203-stream.html) 和 [**流式对话样例**](https://github.com/modelscope/agentscope/tree/main/examples/conversation_in_stream_mode)!
diff --git a/docs/sphinx_doc/en/source/tutorial/211-web.md b/docs/sphinx_doc/en/source/tutorial/211-web.md index f91901c4b..e77b1d919 100644 --- a/docs/sphinx_doc/en/source/tutorial/211-web.md +++ b/docs/sphinx_doc/en/source/tutorial/211-web.md @@ -1,95 +1,90 @@ (211-web-en)= -# Enabling Web Browsing Ability in AgentScope +# Web Browser Control -Here we introduce how we implemented the web browsing ability in AgentScope. +AgentScope supports web browser control with the `agentscope.service.WebBrowser` module. +It allows agent to interact with web pages, and take actions like clicking, typing and scrolling. -The two modules we implemented are simple and straightforward: a **web browser** and a **web browsing agent**. -The web browser serves as an interface to the web, and the web browsing agent, equipped with the web browser, can perform web browsing tasks given the powerful capabilities with vision-based LLM. +> Note the current web browser module requires a vision LLM to work properly. We will provide text-based vision in the future. -Note that the web browsing abilities provided in AgentScope are still in beta. -We will refine the module and provide corresponding updates both in our codebase and the documentation here. +> Note the web browser module is still in beta, which will be updated frequently. -Now, let's take a closer look at the web browser and the web browsing agent. -## Web Browser Interface +## Prerequisites -The `WebBrowser` class we implemented is a simple interface to the web. It can open a webpage, click on elements, type in elements, take screenshots, etc. -The `WebBrowser` class is implemented in [web_browser.py](https://github.com/modelscope/agentscope/blob/main/src/agentscope/browser/web_browser.py). +The `WebBrowser` module is implemented based on [Playwright](https://playwright.dev/). +You need to install the lasted AgentScope, as well as the playwright packages as follows: -### Prerequisites +```bash +# Install the latest AgentScope from source +git clone https://github.com/modelscope/agentscope.git +cd agentscope +pip install -e . -The `WebBrowser` class operates your browser through Playwright. -To use the web browser, you need to install the necessary Playwright packages: - -- Run `pip install playwright` to set up the Python environment. -- Run `playwright install` to install the required browser for Playwright. - -### Usage - -The web browser we provided has simple interfaces. To use the web browser, we can simply create it and use it with the defined methods. +# Install playwright +pip install playwright +playwright install +``` -#### How to Create a Web Browser Instance +## Guidance -You can initialize the `WebBrowser` instance with the `__init__` function of `WebBrowser`: +Initialize the `WebBrowser` module as follows ```python +from agentscope.service import WebBrowser + browser = WebBrowser() ``` -(Optional) You can also set specific attributes of the browser instance when initializing: -- `headless`: Determines whether to run the browser in headless mode. Defaults to `False`. When set to `False`, the browser will be visible to the user. -- `timeout`: The time for the browser to wait for the page to load. Defaults to `60000`. -- `default_width`: The default width of the browser. Defaults to `1280`. -- `default_height`: The default height of the browser. Defaults to `1080`. - -#### Defined Browser Interface -Currently, the defined methods of the `WebBrowser` are: `visit_page`, `crawl_page`, `click`, `type`, `scroll`, `press_key`, `close`, etc. The properties of the web browser include `url`, `page_html`, `page_title`, etc. -Due to space constraints, detailed introductions for each method interface are not provided here, but you can refer to the Sphinx API docs and inline comments for more details. -Instead, we will provide general guidance on how to use the web browser module. +The `WebBrowser` module facilitates browser control and state retrieval. +The name of the control functions are all prefixed by "action_", e.g. `action_visit_url`, +and `action_click`. To see the full list of functions, calling the `get_action_functions` method. -#### How Does Our `WebBrowser` Work? -To use our web browser, you need to first visit a webpage, then crawl the page to retrieve elements, and finally interact with those elements. +```python +# To see full supported actions +print(browser.get_action_functions()) -##### Visit Webpage -**First**, you can use `browser.visit_page(url)` to start the browser with the page you are interested in. Webpage navigation can also be triggered by using the `click` or `type` method. +# Visit a new webpage +browser.action_visit_url("https://www.bing.com") +``` -##### Crawl the Page -**Then**, you should call the `crawl_page` method to gather the elements of the current web page. A key feature of our web browser is that we label the interactive elements on the webpage with numbers, implemented in the `crawl_page` method. With the labeled elements, you can ensure the browser interacts with the correct elements on the page using methods like `click`, `type`, and `focus_element`. After the `crawl_page` method, the interactive elements will be labeled with numbers and stored in the `self.page_elements` property of the `WebBrowser` instance. You can then use methods like `click`, `type`, and `focus_element` to interact with these elements. +To monitor the current state of the browser, you can call the function prefixed by `"page_"`, e.g. `page_url`, `page_title`, and `page_html`". -The `crawl_page` method has three input arguments and four return values. You can choose whether to use vision to add a Set-of-Marks to the webpage and whether to include the `meta_data` field in the returned formatted text. +```python +# The url +print(browser.page_url) -> Set-of-Mark is a visual prompting method that partitions an image into numbered regions to improve the visual grounding ability of LLMs. You can refer to the paper https://arxiv.org/abs/2310.11441 for details. Or you can check out [our example](https://github.com/modelscope/agentscope/blob/main/examples/conversation_with_web_voyager_agent/README.md) for a demonstration. +# The page title +print(browser.page_title) -`crawl_page` input arguments: -- `vision` (`bool`): Adds a set-of-marks to the webpage if vision is enabled and takes a screenshot. Instead of using a segmentation model, we use native JavaScript to bound the interactive elements on the webpage. -- `with_meta` (`bool`): Includes the `meta_data` field in the returned formatted text. -- `with_select` (`bool`): Returns only the selected interactive elements or all the numbered interactive elements. +# The page in MarkDown format (parsed by markdownify) +print(browser.page_markdown) -`crawl_page` returns: -- `elements`: The handler from Playwright of interactive elements, also stored in the `self.page_elements` property of the `WebBrowser` instance. -- `format_ele_text`: A list of the formatted elements' text descriptions, labeled with numbers starting from zero. -- `screenshot_bytes`: The screenshot of the webpage with the Set-of-Marks, in bytes. If you want to use vision-based MLLM, save the screenshot to a file using the `file_manager` module provided and provide the image path using the format function. -- `web_ele_infos`: The info dictionary of interactive elements. +# The page html (maybe too long) +print(browser.page_html) +``` -##### Perform Action -**Finally**, you can call methods like `click`, `type`, `focus_element` to interact with the labeled interactive elements, for example: +Besides, to help vision models to understand the webpage better, we provide `set_interactive_marks` function, +which will mark all the interactive elements on the current webpage with index labels. +After calling `set_interactive_marks` function, more actions can be performed on the webpage. +For example, clicking a button, typing in a text box, etc. ```python -browser.click(element_id) -``` +# Set interactive marks with index labels +browser.set_interactive_marks() -```python -browser.type(element_id, "Hello, World!") +# Remove interactive marks +# browser.remove_interactive_marks() ``` -you can also call methods other methods like `scroll`, `press_key`, `close`. You may refer to the API documentation or the [original code](https://github.com/modelscope/agentscope/blob/main/src/agentscope/browser/web_browser.py) for more details. +## Work with Agent -### Web Browsing Agent -The `WebVoyagerAgent` we implemented is a simple agent that can perform web browsing tasks. It is implemented in [web_voyager_agent.py](https://github.com/modelscope/agentscope/blob/main/src/agentscope/agents/web_voyager_agent.py). The agent's reply function is implemented in the `reply` method of the `WebVoyagerAgent` class, and it follows these steps: 1. visit web page -> 2. crawl the page -> 3. perform action -> 4. repeat iteration until the goal is achieved. +The above functions provide basic operations for interactive web browser control. +You can use them to build your own web browsing agent. -You can try interacting with the agent in [our example](https://github.com/modelscope/agentscope/blob/main/examples/conversation_with_web_voyager_agent/). Since the module is still in beta and the agent is far from perfect, you can try to improve it or even build an agent by yourself if you are capable. +In AgentScope, the web browser is also some kind of tool functions, so you can use it together with the service toolkit module to build your own agent. +We also provide a [web browser agent](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent)) in our example. +You can refer to it for more details. -Enjoy exploring the web with our agent! [[Back to the top]](#211-web-en) diff --git a/docs/sphinx_doc/zh_CN/source/tutorial/211-web.md b/docs/sphinx_doc/zh_CN/source/tutorial/211-web.md index a709a8b64..935aeb887 100644 --- a/docs/sphinx_doc/zh_CN/source/tutorial/211-web.md +++ b/docs/sphinx_doc/zh_CN/source/tutorial/211-web.md @@ -1,108 +1,92 @@ (211-web-cn)= -# 在AgentScope中启用网页浏览功能 -在这里,我们介绍了如何在AgentScope中实现网页浏览功能。 -我们实现的两个模块简单明了:**web browser** 和 **web browsing agent**。 -web browser作为网页的接口,而配备了web browser的web browsing agent可以利用基于视觉的LLM的强大功能执行网页浏览任务。 -请注意,AgentScope所提供的网页浏览功能仍处于beta阶段。 -我们将不断完善该模块,并在我们的代码库和文档中提供相应的更新。 +AgentScope 支持使用 `agentscope.service.WebBrowser` 模块进行 Web 浏览器控制。 +它允许代理与网页进行交互,并执行点击、输入和滚动等网页操作。 -现在,让我们详细看看web browser和web browsing agent。 +> 注意当前的 Web 浏览器模块仍处于测试阶段,在未来的一段时间内将会频繁更新和优化。 -## Web Browser接口 +## 预备 -我们实现的`WebBrowser`类是一个简单的网页接口。它可以打开一个网页,点击元素,在元素中输入内容,截图等。 +`WebBrowser` 模块基于 [Playwright](https://playwright.dev/) 实现,需要安装最新版本的 AgentScope 和 playwright 环境: -`WebBrowser`类实现在[web_browser.py](https://github.com/modelscope/agentscope/blob/main/src/agentscope/browser/web_browser.py) 中。 +```bash +# 从源码安装最新版本的 AgentScope +git clone https://github.com/modelscope/agentscope.git +cd agentscope +pip install -e . -### 前提条件 - -`WebBrowser`类通过Playwright操作浏览器。 -要使用web browser,你需要安装必要的Playwright包: -- 运行 `pip install playwright` 设置Python环境。 -- 运行 `playwright install` 安装Playwright所需的浏览器。 - -### 用法 +# 安装 playwright +pip install playwright +playwright install +``` -我们提供的web browser拥有简单的接口。要使用web browser,我们只需创建并使用定义好的方法接口。 +## Guidance -#### 如何创建Web Browser实例 +通过以下方式初始化一个 `WebBrowser` 模块实例: -你可以使用`WebBrowser`的`__init__`函数初始化`WebBrowser`实例: ```python +from agentscope.service import WebBrowser + browser = WebBrowser() ``` -(可选)你还可以在初始化时设置浏览器实例的特定属性: -- `headless`:决定是否以headless模式运行浏览器。默认值为`False`。设置为`False`时,浏览器对用户可见。 -- `timeout`:浏览器等待页面加载的时间。默认值为`60000`。 -- `default_width`:浏览器的默认宽度。默认值为`1280`。 -- `default_height`:浏览器的默认高度。默认值为`1080`。 - -#### 定义的浏览器接口 - -目前,`WebBrowser`定义的方法有:`visit_page`,`crawl_page`,`click`,`type`,`scroll`,`press_key`,`close`等。浏览器的属性包括`url`,`page_html`,`page_title`等。 - -由于空间限制,这里不提供每个方法接口的详细介绍,但你可以参考Sphinx API文档和内联注释了解更多细节。 -相反,我们将提供有关如何使用web browser模块的一般指导。 - -#### 我们的`WebBrowser`如何工作? - -要使用我们的web browser,你首先需要访问一个网页,然后爬取页面以获取元素,最后与这些元素进行交互。 - -##### 访问网页 +The `WebBrowser` module facilitates browser control and state retrieval. +The name of the control functions are all prefixed by "action_", e.g. `action_visit_url`, +and `action_click`. To see the full list of functions, calling the `get_action_functions` method. -**首先**,你可以使用`browser.visit_page(url)`启动浏览器并访问你感兴趣的页面。通过使用`click`或`type`方法也可以触发网页导航。 +`WebBrowser` 模块提供了浏览器控制和状态检索的功能。 +其中控制函数的名称都以 "action_" 为前缀,例如 `action_visit_url` 和 `action_click`。可以通过调用 `get_action_functions` 方法查看完整的函数列表。 -##### 爬取页面 - -**然后**,你应该调用`crawl_page`方法来获取当前网页的元素。 -我们web browser的一个关键特性是,我们用数字标记网页上的交互元素,这在`crawl_page`方法中实现。 +```python +# 查看所有支持的操作 +print(browser.get_action_functions()) -通过标记的元素,你可以确保浏览器与页面上的正确元素进行交互,使用类似`click`,`type`和`focus_element`的方法。`crawl_page`方法后,交互元素将被数字标记,并存储在`WebBrowser`实例的`self.page_elements`属性中。然后你可以使用类似`click`,`type`和`focus_element`的方法与这些元素进行交互。 +# 访问新的网页 +browser.action_visit_url("https://www.bing.com") +``` -`crawl_page` 方法有三个输入参数和四个返回值。你可以选择是否使用vision向网页添加Set-of-Marks,以及是否在返回的格式化文本中包含`meta_data`字段。 +为了获取当前浏览器的状态,可以调用以 `"page_"` 为前缀的函数,例如 `page_url`、`page_title` 和 `page_html`。 -> Set-of-Mark是一种视觉提示方法,它将图像分成编号的区域,以提高LLMs的视觉定位能力。你可以参阅论文https://arxiv.org/abs/2310.11441了解详细信息。或者可以查看[我们的示例](https://github.com/modelscope/agentscope/blob/main/examples/conversation_with_web_voyager_agent/README.md)中的演示。 +```python +# 当前网页的url +print(browser.page_url) -`crawl_page` 的输入参数: +# 当前网页的标题 +print(browser.page_title) -- `vision`(`bool`):如果启用vision,则向网页添加set-of-marks并截图。我们使用本地JavaScript将网页上的交互元素进行边界划分,而不是使用分割模型。 -- `with_meta`(`bool`):包括`meta_data`字段在返回的格式化文本中。 -- `with_select`(`bool`):只返回选定的交互元素或所有编号的交互元素。 +# 以 MarkDown 的格式获取当前的页面信息(通过markdownify进行解析) +print(browser.page_markdown) -`crawl_page` 的返回值: +# 当前网页的 html 源码(可能会太长) +print(browser.page_html) +``` -- `elements`:来源于Playwright的交互元素处理程序,也存储在`WebBrowser`实例的`self.page_elements`属性中。 -- `format_ele_text`:格式化的元素文本描述列表,从零开始编号。 -- `screenshot_bytes`:网页的带Set-of-Marks的截图,以字节形式。如果你想使用基于视觉的MLLM,请使用我们提供的`file_manager`模块将截图保存到文件,并使用格式函数提供图像路径。 -- `web_ele_infos`:交互元素的信息字典。 +Besides, to help vision models to understand the webpage better, we provide `set_interactive_marks` function, +which will mark all the interactive elements on the current webpage with index labels. +After calling `set_interactive_marks` function, more actions can be performed on the webpage. +For example, clicking a button, typing in a text box, etc. -##### 执行操作 +此外,为了帮助视觉模型更好地理解网页,我们提供了 `set_interactive_marks` 函数,该函数会把当前网页上所有的可交互元素标记出来,并用序号标签进行标注(从0开始)。 +调用 `set_interactive_marks` 函数标记网页后,我们就可以在网页上执行更多的操作,例如点击指定序号的按钮、在指定序号的文本框中进行输入等。 -**最后**,你可以调用诸如`click`,`type`,`focus_element`等方法与标记的交互元素进行交互。 -例如: ```python -browser.click(element_id) -``` +# 为网页上的交互元素添加序号标签 +browser.set_interactive_marks() -```python -browser.type(element_id, "Hello, World!") +# 删除交互标记 +# browser.remove_interactive_marks() ``` -还可以调用其他方法如`scroll`,`press_key`,`close`。 -你可以参考API文档或[原始代码](https://github.com/modelscope/agentscope/blob/main/src/agentscope/browser/web_browser.py)了解更多细节。 - -### Web Browsing Agent - -我们实现的`WebVoyagerAgent`是一个简单的Agent,可以执行网页浏览任务。它实现在[web_voyager_agent.py](https://github.com/modelscope/agentscope/blob/main/src/agentscope/agents/web_voyager_agent.py)。 - -Agent的响应函数实现在`WebVoyagerAgent`类的 `reply` 方法中,遵循以下步骤:1. 访问网页 -> 2. 爬取页面 -> 3. 执行操作 -> 4. 重复迭代直到实现目标。 +## 与智能体结合 -你可以在[我们的示例](https://github.com/modelscope/agentscope/blob/main/examples/conversation_with_web_voyager_agent/)中尝试与Agent交互。 +上述的所有函数为交互式的 Web 浏览器控制提供了基本操作接口。开发者可以使用这些接口来构建自己的 Web 浏览代理。 -由于该模块仍处于beta阶段,目前实现的Agent远非完美。如果你有能力的话,你可以尝试改进它,甚至可以自己构建一个Agent。 +In AgentScope, the web browser is also some kind of tool functions, so you can use it together with the service toolkit module to build your own agent. +We also provide a [web browser agent](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent)) in our example. +You can refer to it for more details. -希望你可以享受使用我们的Agent浏览网页吧! +在 AgentScope 中,Web 浏览器也是一种工具函数,因此可以使用 `agentscope.service.ServiceToolkit` 来处理 `WebBrowser` 模块提供的函数,并构建自己的智能体。 +我们在示例中提供了一个[Web 浏览器智能体](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent)的样例。 +可以参考该样例了解更多细节。 [[回到顶部]](#211-web-cn) diff --git a/examples/conversation_with_web_browser_agent/README.md b/examples/conversation_with_web_browser_agent/README.md new file mode 100644 index 000000000..5462e8a15 --- /dev/null +++ b/examples/conversation_with_web_browser_agent/README.md @@ -0,0 +1,140 @@ +# WebBrowsing in AgentScope + +This example demonstrates how to utilize AgentScope to build a web browsing agent. Throughout this tutorial, you will gain insights into the following features of AgentScope: + +- How to use the [WebBrowser](https://github.com/modelscope/agentscope/blob/main/src/agentscope/service/browser/web_browser.py) module in AgentScope +- How to build a conversation with an agent that can browse the web + +Refer to our tutorial for more details on the `WebBrowser` module. + +> Note: The `WebBrowser` module is currently in beta, and we are continuously working on enhancements, such as +> - allowing the agent to make long-term plan +> - handling web pages with CAPTCHA. +> - enabling text-based models to interact with web pages + +## Tested Models + +This example has been tested with the following model: +- GPT-4o + +We plan to test additional vision models in the near future. Additionally, we will enable web browsing capabilities for text-only models. + +## Prerequisites + +To run this example, you need to: + +1. Ensure you have access to a vision model that can handle vision tasks, and set your api key in the model config. +2. Install the necessary Playwright packages: + - Run `pip install playwright` to set up the Python environment. + - Run `playwright install` to install the required browser for Playwright. +3. [Optional] For a better understanding of how web browsing is implemented, refer to the original code in [web_browser.py](../../src/agentscope/browser/web_browser.py) and [web_voyager_agent.py](../../src/agentscope/agents/web_voyager_agent.py). + +## Running the Example + +Follow the steps below to run the example: +1. Fill your OpenAI API key in `main.py`, or providing a new configuration for vision models. +2. Run the `main.py` file directly: +```bash +python main.py +``` + +> Note +> - The screenshots of the web pages will be saved locally. + +## Code Snippets + +The `webact_agent.py` provides an agent `WebActAgent` that integrates the web browsing module into [ReAct algorithm](https://arxiv.org/abs/2210.03629). +It will interact with web pages in a reasoning-acting loop, and provide the final answer to the user by calling a built-in `finish` function. + +The major difference with the traditional ReAct algorithm is that the agent will set interactive marks and obtain the web page screenshot in the reasoning phase. +This allows the agent to interact with the web page more naturally. + +```python + # ... + + def _reasoning(self) -> Union[dict, None]: + """The reasoning process of the agent. + + Returns: + `Union[dict, None]`: + Return `None` if meet parsing error, otherwise return the + parsed function call dictionary. + """ + + # Mark the current interactive elements in the web page + self.browser.set_interactive_marks() + + # After marking, take a screenshot and save it locally + path_img = FileManager.get_instance().save_image( + self.browser.page_screenshot, + ) + + # Assemble the prompt + prompt = self.model.format( + self.memory.get_memory(), + # The observation message won't be stored in memory to avoid too + # many images in prompt + Msg( + "user", + _HINT_PROMPT.format( + url=self.browser.url, + format_instruction=self.parser.format_instruction, + ), + role="user", + url=path_img, + echo=True, + ), + ) + + # ... +``` + +To save image tokens, the message with the screenshot won't be saved to the agent's memory. +That means, in each reasoning phase, the agent will only have the latest screenshot. +Developers can modify the code to implement a more complex memory mechanism. + +### Example Demonstration + + +https://github.com/user-attachments/assets/6d03caab-6193-4ac6-8b1c-36f152ec02ec + + +In the first iter of our web browsing agent, the agent opens the default webpage, in this case, the Google webpage. + +We can see from the saved screenshot here that the interactive elements in this webpage are marked with numbers. This is called the set-of-mark prompting([github link](https://github.com/microsoft/SoM), [paper link](https://arxiv.org/abs/2310.11441)). Utilizing the set-of-mark prompting, the agent can interact with the webpage more naturally by selecting the elements with the corresponding numbers. + +After recieving the observation, the agent will give it's thought and corresponding action. +In this case, the agent select the search bar (numbered as [4]) and type in it. + +![screenshot_1](https://github.com/garyzhang99/agentscope/assets/46197280/9de208b8-4ef4-4b4f-9328-2f7bb500fcb2) + + +``` +Thought: To find out how many stars the project "agentscope" has on GitHub, I need to search for "agentscope GitHub" on Google first. + +Action: Type [4]; agentscope GitHub +``` + + +In the next iter, we see that the agent is presented with the searching result page, and the agent select the offical github link. +![screenshot_2](https://github.com/garyzhang99/agentscope/assets/46197280/9b6708c6-eced-4d8b-8ebe-cdbd197b40ea) + +``` +Thought: The search results from Google have populated, and I found a link that likely leads to the "agentscope" GitHub project page. + +Action: Click [18] +``` + +As the agent view the github page of agentscope, it note the github stars, hence it answer our question. + +![screenshot_3](https://github.com/garyzhang99/agentscope/assets/46197280/5cad5472-b45b-4ef3-a8fa-324d5a20073a) + + +``` +Thought: I can see from the screenshot that the star count for the "agentscope" project on GitHub is listed as "2.9k" stars. + +Action: ANSWER; The project agentscope has received 2.9k stars on GitHub. +``` + +The above content provides a simple example of using the web browsing agent in AgentScope. Feel free to try it out yourself and explore the capabilities of web browsing with AgentScope! + diff --git a/examples/conversation_with_web_voyager_agent/main.py b/examples/conversation_with_web_browser_agent/main.py similarity index 96% rename from examples/conversation_with_web_voyager_agent/main.py rename to examples/conversation_with_web_browser_agent/main.py index 3d867475b..d3462f78d 100644 --- a/examples/conversation_with_web_voyager_agent/main.py +++ b/examples/conversation_with_web_browser_agent/main.py @@ -28,6 +28,7 @@ agent = WebActAgent( name="assistant", model_config_name="gpt-4o_config", + verbose=True, ) user = UserAgent( @@ -36,7 +37,7 @@ ) x = None -while x is not None: +while True: x = user(x) if x.content == "exit": break diff --git a/examples/conversation_with_web_voyager_agent/webact_agent.py b/examples/conversation_with_web_browser_agent/webact_agent.py similarity index 97% rename from examples/conversation_with_web_voyager_agent/webact_agent.py rename to examples/conversation_with_web_browser_agent/webact_agent.py index fbea98752..9d9aeb1d8 100644 --- a/examples/conversation_with_web_voyager_agent/webact_agent.py +++ b/examples/conversation_with_web_browser_agent/webact_agent.py @@ -68,7 +68,7 @@ def __init__( # Init the browser self.browser = WebBrowser() - self.browser.action_visit_url("https://www.google.com") + self.browser.action_visit_url("https://www.bing.com") # Init the service toolkit with the browser commands. Since they don't # require developers to specify parameters, we directly place them into @@ -165,7 +165,7 @@ def _reasoning(self) -> Union[dict, None]: """ # Mark the current interactive elements in the web page - self.browser.mark_interactive_elements() + self.browser.set_interactive_marks() # After marking, take a screenshot and save it locally path_img = FileManager.get_instance().save_image( @@ -185,7 +185,7 @@ def _reasoning(self) -> Union[dict, None]: ), role="user", url=path_img, - echo=True, + echo=self.verbose, ), ) @@ -221,11 +221,9 @@ def _acting(self, function_call: dict) -> None: }, ] - execute_results = self.toolkit.parse_and_call_func( + msg_res = self.toolkit.parse_and_call_func( formatted_function_call, ) - - msg_res = Msg("system", execute_results, "system") self.speak(msg_res) self.memory.add(msg_res) diff --git a/examples/conversation_with_web_voyager_agent/README.md b/examples/conversation_with_web_voyager_agent/README.md deleted file mode 100644 index 46c122b6f..000000000 --- a/examples/conversation_with_web_voyager_agent/README.md +++ /dev/null @@ -1,134 +0,0 @@ -# WebBrowsing with AgentScope - -This example demonstrates how to utilize AgentScope to enable web browsing capabilities. Throughout this tutorial, you will gain insights into the following features of AgentScope: - -- How to use the [WebBrowser](../../src/agentscope/browser/web_browser.py) component in AgentScope to enable web browsing capabilities. -- Utilizing the [WebVoyagerAgent](../../src/agentscope/agents/web_voyager_agent.py) to perform web browsing tasks. - -**Note: Still in Beta** -- The [WebBrowser](../../src/agentscope/browser/web_browser.py) component is currently in beta, and we are continuously working on enhancements. -- The current implementation of the [WebVoyagerAgent](../../src/agentscope/agents/web_voyager_agent.py), referenced from the [WebVoyager GitHub repository](https://github.com/MinorJerry/WebVoyager/tree/main), serves as a demonstration of enabling agents with web browsing capabilities. - -The existing implementation has several limitations, including: -- The agent is not yet equipped with planning or critic modules, which impairs its ability to manage complex tasks requiring strong reasoning and self-correction while performing web browsing tasks. -- The agent is unable to handle webpages with CAPTCHA. - -We are actively developing a more advanced web browsing agent, focusing on improving performance, reducing error rates, and minimizing latency. Please stay tuned for updates. - -## Tested Models - -This example has been tested with the following model: -- GPT-4o - -We plan to test additional vision models in the near future. Additionally, we will enable web browsing capabilities for text-only models. - -## Prerequisites - -To run this example, you need to: - -1. Ensure you have access to a vision model that can handle vision tasks, and set your api key in the model config. -2. Install the necessary Playwright packages: - - Run `pip install playwright` to set up the Python environment. - - Run `playwright install` to install the required browser for Playwright. -3. [Optional] For a better understanding of how web browsing is implemented, refer to the original code in [web_browser.py](../../src/agentscope/browser/web_browser.py) and [web_voyager_agent.py](../../src/agentscope/agents/web_voyager_agent.py). - - -## Code Snippets and Example Demonstration - -Here is a demo of the how web browsing currently works in AgentScope. - -### Code Snippets - -First we init agentcope and the model configs. - -```python -import agentscope - -# Fill in your OpenAI API key -YOUR_OPENAI_API_KEY = "xxx" - -model_config = { - "config_name": "gpt-4o_config", - "model_type": "openai_chat", - "model_name": "gpt-4o", - "api_key": YOUR_OPENAI_API_KEY, - "generate_args": { - "temperature": 0.7, - }, -} - -agentscope.init( - model_configs="gpt-4o_config", - project="Conversation with WebVoyagerAgent", -) -``` - -Then we init the browser and the agent. - -``` python -from agentscope.browser import WebBrowser -from agentscope.agents import WebVoyagerAgent - - -browser = WebBrowser() -agent = WebVoyagerAgent( - browser=browser, - model_config_name="gpt-4o", - name="Browser Agent") -``` - -Finally, we can use the agent to perform web browsing tasks. -Here, we ask the agent how many stars have our agentscope project received on github. - -```python -question = "How many stars have the project agentscope recieved on the github?" -msg = Msg(name="user", content=question, role="user") - -ans_msg = agent.reply(msg) -``` - -### Example Demonstration - - -https://github.com/user-attachments/assets/6d03caab-6193-4ac6-8b1c-36f152ec02ec - - -In the first iter of our web browsing agent, the agent opens the default webpage, in this case, the google webpage. - -We can see from the saved screenshot here that the interactive elements in this webpage are marked with numbers. This is called the set-of-mark prompting([github link](https://github.com/microsoft/SoM), [paper link](https://arxiv.org/abs/2310.11441)). Utilizing the set-of-mark prompting, the agent can interact with the webpage more naturally by selecting the elements with the corresponding numbers. - -After recieving the observation, the agent will give it's thought and corresponding action. -In this case, the agent select the search bar (numbered as [4]) and type in it. - -![screenshot_1](https://github.com/garyzhang99/agentscope/assets/46197280/9de208b8-4ef4-4b4f-9328-2f7bb500fcb2) - - -``` -Thought: To find out how many stars the project "agentscope" has on GitHub, I need to search for "agentscope GitHub" on Google first. - -Action: Type [4]; agentscope GitHub -``` - - -In the next iter, we see that the agent is presented with the searching result page, and the agent select the offical github link. -![screenshot_2](https://github.com/garyzhang99/agentscope/assets/46197280/9b6708c6-eced-4d8b-8ebe-cdbd197b40ea) - -``` -Thought: The search results from Google have populated, and I found a link that likely leads to the "agentscope" GitHub project page. - -Action: Click [18] -``` - -As the agent view the github page of agentscope, it note the github stars, hence it answer our question. - -![screenshot_3](https://github.com/garyzhang99/agentscope/assets/46197280/5cad5472-b45b-4ef3-a8fa-324d5a20073a) - - -``` -Thought: I can see from the screenshot that the star count for the "agentscope" project on GitHub is listed as "2.9k" stars. - -Action: ANSWER; The project agentscope has received 2.9k stars on GitHub. -``` - -The above content provides a simple example of using the web browsing agent in AgentScope. Feel free to try it out yourself and explore the capabilities of web browsing with AgentScope! - diff --git a/src/agentscope/models/post_model.py b/src/agentscope/models/post_model.py index 385881fd2..7cb1fc25c 100644 --- a/src/agentscope/models/post_model.py +++ b/src/agentscope/models/post_model.py @@ -154,14 +154,21 @@ def __call__(self, input_: str, **kwargs: Any) -> ModelResponse: # step3: record model invocation # record the model api invocation, which will be skipped if # `FileManager.save_api_invocation` is `False` + try: + response_json = response.json() + except requests.exceptions.JSONDecodeError as e: + raise RuntimeError( + f"Fail to serialize the response to json: \n{str(response)}", + ) from e + self._save_model_invocation( arguments=request_kwargs, - response=response.json(), + response=response_json, ) # step4: parse the response if response.status_code == requests.codes.ok: - return self._parse_response(response.json()) + return self._parse_response(response_json) else: logger.error(json.dumps(request_kwargs, indent=4)) raise RuntimeError( diff --git a/src/agentscope/service/browser/web_browser.py b/src/agentscope/service/browser/web_browser.py index 312986a27..b0958eb3c 100644 --- a/src/agentscope/service/browser/web_browser.py +++ b/src/agentscope/service/browser/web_browser.py @@ -1,8 +1,9 @@ # -*- coding: utf-8 -*- # pylint: disable=C0301 """The web browser module for agent to interact with web pages.""" +import time from pathlib import Path -from typing import Union, Callable +from typing import Union, Callable, Optional import requests from loguru import logger @@ -125,7 +126,7 @@ class WebBrowser: def __init__( self, - timeout: int = 60000, + timeout: int = 30, browser_visible: bool = True, browser_width: int = 1280, browser_height: int = 1080, @@ -133,8 +134,8 @@ def __init__( """Initialize the web browser module. Args: - timeout (`int`, defaults to `60000`): - The timeout (in milliseconds) for the browser to wait for the + timeout (`int`, defaults to `30`): + The timeout (in seconds) for the browser to wait for the page to load, defaults to 60s. browser_visible (`bool`, defaults to `True`): Whether the browser is visible. @@ -162,7 +163,7 @@ def __init__( ) self._page = self.browser.new_page() - self._page.set_default_timeout(timeout) + self._page.set_default_timeout(timeout * 1000) self._page.set_viewport_size( { "width": browser_width, @@ -239,7 +240,6 @@ def action_click(self, element_id: int) -> ServiceResponse: self._wait_for_load( "Wait for click event", "Finished", - 5, ) return ServiceResponse( @@ -247,7 +247,12 @@ def action_click(self, element_id: int) -> ServiceResponse: content=f"Click on element {element_id} done", ) - def action_type(self, element_id: int, text: str) -> ServiceResponse: + def action_type( + self, + element_id: int, + text: str, + submit: bool, + ) -> ServiceResponse: """Type text into the element with the given id. Args: @@ -255,6 +260,8 @@ def action_type(self, element_id: int, text: str) -> ServiceResponse: The id of the element to type text into. text (`str`): The text to type into the element. + submit (`bool`): + If press the "Enter" after typing text. Returns: `ServiceResponse`: @@ -288,9 +295,11 @@ def action_type(self, element_id: int, text: str) -> ServiceResponse: self._wait_for_load( "Wait for finish typing", "Finished", - 1, ) + if submit: + self.action_press_key("Enter") + return ServiceResponse( status=ServiceExecStatus.SUCCESS, content="Typing done", @@ -324,15 +333,21 @@ def action_press_key(self, key: str) -> ServiceResponse: Chosen from `F1` - `F12`, `Digit0`- `Digit9`, `KeyA`- `KeyZ`, `Backquote`, `Minus`, `Equal`, `Backslash`, `Backspace`, `Tab`, `Delete`, `Escape`, `ArrowDown`, `End`, `Enter`, `Home`, `Insert`, `PageDown`, `PageUp`, `ArrowRight`, `ArrowUp`, etc. """ # noqa self._page.keyboard.press(key) + + # TODO: in a more elegant way to wait for the page to be loaded rather + # then using time.sleep + # Wait for the page to be loaded + time.sleep(2) + self._wait_for_load( f"Wait for press key: {key}", "Finished", - 5, ) - return ServiceResponse( + response = ServiceResponse( status=ServiceExecStatus.SUCCESS, content=f"Press key: {key} done", ) + return response # ------ Actions which are performed to change the web page --------------- def action_visit_url(self, url: str) -> ServiceResponse: @@ -345,7 +360,6 @@ def action_visit_url(self, url: str) -> ServiceResponse: self._page.goto(url) self._wait_for_load( f"Wait for page {url} to load.", - timeout=10, ) return ServiceResponse( @@ -385,7 +399,7 @@ def set_interactive_marks(self) -> list[WebElementInfo]: .values() ) self._interactive_elements = [ - item.get_property("element").as_element for item in js_handles + item.get_property("element").as_element() for item in js_handles ] # Get the interactive items @@ -412,7 +426,7 @@ def _wait_for_load( self, hint_s: str, hint_e: str = "Page loaded.", - timeout: int = 10, + timeout: Optional[int] = None, ) -> None: """Wait to ensure the page is loaded after certain actions. @@ -421,12 +435,16 @@ def _wait_for_load( The hint message before waiting. hint_e (`str`): The hint message after waiting. - timeout (`int`): - The timeout for the page to load. + timeout (`Optional[int]`, defaults to `None`) + The timeout for the page to load (in seconds) """ - logger.info(hint_s) + logger.debug(hint_s) + + if timeout is not None: + timeout = timeout * 1000 + self._page.wait_for_load_state("load", timeout=timeout) - logger.info(hint_e) + logger.debug(hint_e) def _verify_element_id(self, element_id: int) -> bool: """Verify the given element id is valid or not."""