Skip to content

Commit

Permalink
Finish refactor web browser module
Browse files Browse the repository at this point in the history
  • Loading branch information
DavdGao committed Sep 6, 2024
1 parent 9c4e138 commit fcba69a
Show file tree
Hide file tree
Showing 10 changed files with 307 additions and 294 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ Start building LLM-empowered multi-agent applications in an easier way.

## News

- <img src="https://img.alicdn.com/imgextra/i3/O1CN01SFL0Gu26nrQBFKXFR_!!6000000007707-2-tps-500-500.png" alt="new" width="30" height="30"/>**[2024-09-03]** AgentScope supports **Web Browser Control** now! Refer to our [example](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent) for more details.

- <img src="https://img.alicdn.com/imgextra/i3/O1CN01SFL0Gu26nrQBFKXFR_!!6000000007707-2-tps-500-500.png" alt="new" width="30" height="30"/>**[2024-07-18]** AgentScope supports streaming mode now! Refer to our [tutorial](https://modelscope.github.io/agentscope/en/tutorial/203-stream.html) and example [conversation in stream mode](https://github.com/modelscope/agentscope/tree/main/examples/conversation_in_stream_mode) for more details.

<h5 align="left">
Expand Down
2 changes: 2 additions & 0 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@

## 新闻

- <img src="https://img.alicdn.com/imgextra/i3/O1CN01SFL0Gu26nrQBFKXFR_!!6000000007707-2-tps-500-500.png" alt="new" width="30" height="30"/>**[2024-09-03]** AgentScope 已更新浏览器控制模块,利用 vision 模型实现智能体对浏览器的控制。请参考[**样例**](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent)

- <img src="https://img.alicdn.com/imgextra/i3/O1CN01SFL0Gu26nrQBFKXFR_!!6000000007707-2-tps-500-500.png" alt="new" width="30" height="30"/>**[2024-07-18]** AgentScope 已支持模型流式输出。请参考我们的 [**教程**](https://modelscope.github.io/agentscope/zh_CN/tutorial/203-stream.html)[**流式对话样例**](https://github.com/modelscope/agentscope/tree/main/examples/conversation_in_stream_mode)

<h5 align="left">
Expand Down
115 changes: 55 additions & 60 deletions docs/sphinx_doc/en/source/tutorial/211-web.md
Original file line number Diff line number Diff line change
@@ -1,95 +1,90 @@
(211-web-en)=

# Enabling Web Browsing Ability in AgentScope
# Web Browser Control

Here we introduce how we implemented the web browsing ability in AgentScope.
AgentScope supports web browser control with the `agentscope.service.WebBrowser` module.
It allows agent to interact with web pages, and take actions like clicking, typing and scrolling.

The two modules we implemented are simple and straightforward: a **web browser** and a **web browsing agent**.
The web browser serves as an interface to the web, and the web browsing agent, equipped with the web browser, can perform web browsing tasks given the powerful capabilities with vision-based LLM.
> Note the current web browser module requires a vision LLM to work properly. We will provide text-based vision in the future.
Note that the web browsing abilities provided in AgentScope are still in beta.
We will refine the module and provide corresponding updates both in our codebase and the documentation here.
> Note the web browser module is still in beta, which will be updated frequently.
Now, let's take a closer look at the web browser and the web browsing agent.

## Web Browser Interface
## Prerequisites

The `WebBrowser` class we implemented is a simple interface to the web. It can open a webpage, click on elements, type in elements, take screenshots, etc.
The `WebBrowser` class is implemented in [web_browser.py](https://github.com/modelscope/agentscope/blob/main/src/agentscope/browser/web_browser.py).
The `WebBrowser` module is implemented based on [Playwright](https://playwright.dev/).
You need to install the lasted AgentScope, as well as the playwright packages as follows:

### Prerequisites
```bash
# Install the latest AgentScope from source
git clone https://github.com/modelscope/agentscope.git
cd agentscope
pip install -e .

The `WebBrowser` class operates your browser through Playwright.
To use the web browser, you need to install the necessary Playwright packages:

- Run `pip install playwright` to set up the Python environment.
- Run `playwright install` to install the required browser for Playwright.

### Usage

The web browser we provided has simple interfaces. To use the web browser, we can simply create it and use it with the defined methods.
# Install playwright
pip install playwright
playwright install
```

#### How to Create a Web Browser Instance
## Guidance

You can initialize the `WebBrowser` instance with the `__init__` function of `WebBrowser`:
Initialize the `WebBrowser` module as follows

```python
from agentscope.service import WebBrowser

browser = WebBrowser()
```

(Optional) You can also set specific attributes of the browser instance when initializing:
- `headless`: Determines whether to run the browser in headless mode. Defaults to `False`. When set to `False`, the browser will be visible to the user.
- `timeout`: The time for the browser to wait for the page to load. Defaults to `60000`.
- `default_width`: The default width of the browser. Defaults to `1280`.
- `default_height`: The default height of the browser. Defaults to `1080`.

#### Defined Browser Interface
Currently, the defined methods of the `WebBrowser` are: `visit_page`, `crawl_page`, `click`, `type`, `scroll`, `press_key`, `close`, etc. The properties of the web browser include `url`, `page_html`, `page_title`, etc.
Due to space constraints, detailed introductions for each method interface are not provided here, but you can refer to the Sphinx API docs and inline comments for more details.
Instead, we will provide general guidance on how to use the web browser module.
The `WebBrowser` module facilitates browser control and state retrieval.
The name of the control functions are all prefixed by "action_", e.g. `action_visit_url`,
and `action_click`. To see the full list of functions, calling the `get_action_functions` method.

#### How Does Our `WebBrowser` Work?
To use our web browser, you need to first visit a webpage, then crawl the page to retrieve elements, and finally interact with those elements.
```python
# To see full supported actions
print(browser.get_action_functions())

##### Visit Webpage
**First**, you can use `browser.visit_page(url)` to start the browser with the page you are interested in. Webpage navigation can also be triggered by using the `click` or `type` method.
# Visit a new webpage
browser.action_visit_url("https://www.bing.com")
```

##### Crawl the Page
**Then**, you should call the `crawl_page` method to gather the elements of the current web page. A key feature of our web browser is that we label the interactive elements on the webpage with numbers, implemented in the `crawl_page` method. With the labeled elements, you can ensure the browser interacts with the correct elements on the page using methods like `click`, `type`, and `focus_element`. After the `crawl_page` method, the interactive elements will be labeled with numbers and stored in the `self.page_elements` property of the `WebBrowser` instance. You can then use methods like `click`, `type`, and `focus_element` to interact with these elements.
To monitor the current state of the browser, you can call the function prefixed by `"page_"`, e.g. `page_url`, `page_title`, and `page_html`".

The `crawl_page` method has three input arguments and four return values. You can choose whether to use vision to add a Set-of-Marks to the webpage and whether to include the `meta_data` field in the returned formatted text.
```python
# The url
print(browser.page_url)

> Set-of-Mark is a visual prompting method that partitions an image into numbered regions to improve the visual grounding ability of LLMs. You can refer to the paper https://arxiv.org/abs/2310.11441 for details. Or you can check out [our example](https://github.com/modelscope/agentscope/blob/main/examples/conversation_with_web_voyager_agent/README.md) for a demonstration.
# The page title
print(browser.page_title)

`crawl_page` input arguments:
- `vision` (`bool`): Adds a set-of-marks to the webpage if vision is enabled and takes a screenshot. Instead of using a segmentation model, we use native JavaScript to bound the interactive elements on the webpage.
- `with_meta` (`bool`): Includes the `meta_data` field in the returned formatted text.
- `with_select` (`bool`): Returns only the selected interactive elements or all the numbered interactive elements.
# The page in MarkDown format (parsed by markdownify)
print(browser.page_markdown)

`crawl_page` returns:
- `elements`: The handler from Playwright of interactive elements, also stored in the `self.page_elements` property of the `WebBrowser` instance.
- `format_ele_text`: A list of the formatted elements' text descriptions, labeled with numbers starting from zero.
- `screenshot_bytes`: The screenshot of the webpage with the Set-of-Marks, in bytes. If you want to use vision-based MLLM, save the screenshot to a file using the `file_manager` module provided and provide the image path using the format function.
- `web_ele_infos`: The info dictionary of interactive elements.
# The page html (maybe too long)
print(browser.page_html)
```

##### Perform Action
**Finally**, you can call methods like `click`, `type`, `focus_element` to interact with the labeled interactive elements, for example:
Besides, to help vision models to understand the webpage better, we provide `set_interactive_marks` function,
which will mark all the interactive elements on the current webpage with index labels.
After calling `set_interactive_marks` function, more actions can be performed on the webpage.
For example, clicking a button, typing in a text box, etc.

```python
browser.click(element_id)
```
# Set interactive marks with index labels
browser.set_interactive_marks()

```python
browser.type(element_id, "Hello, World!")
# Remove interactive marks
# browser.remove_interactive_marks()
```

you can also call methods other methods like `scroll`, `press_key`, `close`. You may refer to the API documentation or the [original code](https://github.com/modelscope/agentscope/blob/main/src/agentscope/browser/web_browser.py) for more details.
## Work with Agent

### Web Browsing Agent
The `WebVoyagerAgent` we implemented is a simple agent that can perform web browsing tasks. It is implemented in [web_voyager_agent.py](https://github.com/modelscope/agentscope/blob/main/src/agentscope/agents/web_voyager_agent.py). The agent's reply function is implemented in the `reply` method of the `WebVoyagerAgent` class, and it follows these steps: 1. visit web page -> 2. crawl the page -> 3. perform action -> 4. repeat iteration until the goal is achieved.
The above functions provide basic operations for interactive web browser control.
You can use them to build your own web browsing agent.

You can try interacting with the agent in [our example](https://github.com/modelscope/agentscope/blob/main/examples/conversation_with_web_voyager_agent/). Since the module is still in beta and the agent is far from perfect, you can try to improve it or even build an agent by yourself if you are capable.
In AgentScope, the web browser is also some kind of tool functions, so you can use it together with the service toolkit module to build your own agent.
We also provide a [web browser agent](https://github.com/modelscope/agentscope/tree/main/examples/conversation_with_web_browser_agent)) in our example.
You can refer to it for more details.

Enjoy exploring the web with our agent!

[[Back to the top]](#211-web-en)
Loading

0 comments on commit fcba69a

Please sign in to comment.