Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: switch to pdm #157

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 35 additions & 25 deletions .github/workflows/upload-package.yml
Original file line number Diff line number Diff line change
@@ -1,34 +1,44 @@
name: Upload Python Package

on:
push:
tags:
- 'v*'
push:
tags:
- "v*"

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
python-version: [3.9, 3.10, 3.11, 3.12]
os: [ubuntu-latest, macOS-latest, windows-latest]

- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.8'
steps:
- uses: actions/checkout@v3
- name: Set up PDM
uses: pdm-project/setup-pdm@v3
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Install dependencies
run: |
pdm sync -d
- name: Run Tests
run: |
pdm run -v pytest tests

- name: Build and check package
run: |
python setup.py sdist bdist_wheel
twine check dist/*

- name: Upload to PyPi
uses: pypa/[email protected]
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
pypi-publish:
name: upload release to PyPI
runs-on: ubuntu-latest
permissions:
# This permission is needed for private repositories.
contents: read
# IMPORTANT: this permission is mandatory for trusted publishing
id-token: write
steps:
- uses: actions/checkout@v3

- uses: pdm-project/setup-pdm@v3

- name: Publish package distributions to PyPI
run: pdm publish
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ ipython_config.py
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
.pdm-python

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
Expand Down
143 changes: 89 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<strong>A framework to enable multimodal models to operate a computer.</strong>
</p>
<p align="center">
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
</p>

<div align="center">
Expand All @@ -16,139 +16,172 @@
**This model is currently experiencing an outage so the self-operating computer may not work as expected.**
-->


## Key Features

- **Compatibility**: Designed for various multimodal models.
- **Integration**: Currently integrated with **GPT-4v, Gemini Pro Vision, and LLaVa.**
- **Future Plans**: Support for additional models.

## Ongoing Development

At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.

## Agent-1-Vision Model API Access

We will soon be offering API access to our Agent-1-Vision model.

If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com).

## Demo

https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0

<https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0>

## Run `Self-Operating Computer`

1. **Install the project**
```
pip install self-operating-computer
```

```bash
pip install self-operating-computer
```

2. **Run the project**
```
operate
```

```bash
operate
```

3. **Enter your OpenAI Key**: If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys)

<div align="center">
<img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/key.png" width="300" style="margin: 10px;"/>
</div>
<div align="center">
<img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/key.png" width="300" style="margin: 10px;"/>
</div>

4. **Give Terminal app the required permissions**: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

<div align="center">
<img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-1.png" width="300" style="margin: 10px;"/>
<img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-2.png" width="300" style="margin: 10px;"/>
</div>
<div align="center">
<img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-1.png" width="300" style="margin: 10px;"/>
<img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-2.png" width="300" style="margin: 10px;"/>
</div>

### Alternatively installation with `.sh`

1. **Clone the repo** to a directory on your computer:
```
git clone https://github.com/OthersideAI/self-operating-computer.git
```

```bash
git clone https://github.com/OthersideAI/self-operating-computer.git
```

2. **Cd into directory**:

```
cd self-operating-computer
```
```bash
cd self-operating-computer
```

3. **Run the installation script**:
3. **Run the installation script**:

```
./run.sh
```
```bash
./run.sh
```

## Development

We use [PDM](https://pdm-project.org/latest/) as our package and dependency manager. You can find instructions for insallation and usage [here](https://pdm-project.org/latest/#recommended-installation-method).

## Using `operate` Modes

### Multimodal Models `-m`
An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below.

An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below.

Start `operate` with the Gemini model
```

```bash
operate -m gemini-pro-vision
```

**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

### Locally Hosted LLaVA Through Ollama
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
*Note: Ollama currently only supports MacOS and Linux*

First, install Ollama on your machine from https://ollama.ai/download.
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
*Note: Ollama currently only supports MacOS and Linux*

First, install Ollama on your machine from <https://ollama.ai/download>.

Once Ollama is installed, pull the LLaVA model:
```

```bash
ollama pull llava
```
This will download the model on your machine which takes approximately 5 GB of storage.

This will download the model on your machine which takes approximately 5 GB of storage.

When Ollama has finished pulling LLaVA, start the server:
```

```bash
ollama serve
```

That's it! Now start `operate` and select the LLaVA model:
```

```bash
operate -m llava
```
```

**Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)

### Voice Mode `--voice`
The framework supports voice inputs for the objective. Try voice by following the instructions below.

The framework supports voice inputs for the objective. Try voice by following the instructions below.
**Clone the repo** to a directory on your computer:
```

```bash
git clone https://github.com/OthersideAI/self-operating-computer.git
```

**Cd into directory**:
```

```bash
cd self-operating-computer
```

Install the additional `requirements-audio.txt`
```

```bash
pip install -r requirements-audio.txt
```

**Install device requirements**
For mac users:
```

```bash
brew install portaudio
```

For Linux users:
```

```bash
sudo apt install portaudio19-dev python3-pyaudio
```

Run with voice mode
```

```bash
operate --voice
```

### Optical Character Recognition Mode `-m gpt-4-with-ocr`
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:

`operate` or `operate -m gpt-4-with-ocr` will also work.
`operate` or `operate -m gpt-4-with-ocr` will also work.

### Set-of-Mark Prompting `-m gpt-4-with-som`

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
Expand All @@ -157,35 +190,37 @@ For this initial version, a simple YOLOv8 model is trained for button detection,

Start `operate` with the SoM model

```
```bash
operate -m gpt-4-with-som
```



## Contributions are Welcomed!:
## Contributions are Welcomed

If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).

## Feedback

For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter.
For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter.

## Join Our Discord Community

For real-time discussions and community support, join our Discord server.
For real-time discussions and community support, join our Discord server.

- If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
- If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).

## Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

- Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI).
- Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/).

## Compatibility

- This project is compatible with Mac OS, Windows, and Linux (with X server installed).

## OpenAI Rate Limiting Note
The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.

The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.
Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**
Loading