OthersideAI · mjspeck · Feb 7, 2024 · Feb 8, 2024 · Feb 8, 2024 · Feb 8, 2024
diff --git a/.github/workflows/upload-package.yml b/.github/workflows/upload-package.yml
@@ -1,34 +1,44 @@
 name: Upload Python Package
 
 on:
-  push:
-    tags:
-      - 'v*'
+    push:
+        tags:
+            - "v*"
 
 jobs:
-  deploy:
-    runs-on: ubuntu-latest
-    steps:
-    - uses: actions/checkout@v3
+    test:
+        runs-on: ${{ matrix.os }}
+        strategy:
+            matrix:
+                python-version: [3.9, 3.10, 3.11, 3.12]
+                os: [ubuntu-latest, macOS-latest, windows-latest]
 
-    - name: Set up Python
-      uses: actions/setup-python@v3
-      with:
-        python-version: '3.8'
+        steps:
+            - uses: actions/checkout@v3
+            - name: Set up PDM
+              uses: pdm-project/setup-pdm@v3
+              with:
+                  python-version: ${{ matrix.python-version }}
 
-    - name: Install dependencies
-      run: |
-        python -m pip install --upgrade pip
-        pip install setuptools wheel twine
+            - name: Install dependencies
+              run: |
+                  pdm sync -d
+            - name: Run Tests
+              run: |
+                  pdm run -v pytest tests
 
-    - name: Build and check package
-      run: |
-        python setup.py sdist bdist_wheel
-        twine check dist/*
-
-    - name: Upload to PyPi
-      uses: pypa/[email protected]
-      with:
-        user: __token__
-        password: ${{ secrets.PYPI_API_TOKEN }}
+    pypi-publish:
+        name: upload release to PyPI
+        runs-on: ubuntu-latest
+        permissions:
+            # This permission is needed for private repositories.
+            contents: read
+            # IMPORTANT: this permission is mandatory for trusted publishing
+            id-token: write
+        steps:
+            - uses: actions/checkout@v3
 
+            - uses: pdm-project/setup-pdm@v3
+
+            - name: Publish package distributions to PyPI
+              run: pdm publish
diff --git a/.gitignore b/.gitignore
@@ -108,6 +108,7 @@ ipython_config.py
 #   in version control.
 #   https://pdm.fming.dev/#use-with-ide
 .pdm.toml
+.pdm-python
 
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/

diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
   <strong>A framework to enable multimodal models to operate a computer.</strong>
 </p>
 <p align="center">
-  Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. 
+  Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
 </p>
 
 <div align="center">
@@ -16,139 +16,172 @@
 **This model is currently experiencing an outage so the self-operating computer may not work as expected.**
 -->
 
-
 ## Key Features
+
 - **Compatibility**: Designed for various multimodal models.
 - **Integration**: Currently integrated with **GPT-4v, Gemini Pro Vision, and LLaVa.**
 - **Future Plans**: Support for additional models.
 
 ## Ongoing Development
+
 At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.
 
 ## Agent-1-Vision Model API Access
+
 We will soon be offering API access to our Agent-1-Vision model.
 
 If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com).
 
 ## Demo
 
-https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0
-
+<https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0>
 
 ## Run `Self-Operating Computer`
 
 1. **Install the project**
-```
-pip install self-operating-computer
-```
+
+    ```bash
+    pip install self-operating-computer
+    ```
+
 2. **Run the project**
-```
-operate
-```
+
+    ```bash
+    operate
+    ```
+
 3. **Enter your OpenAI Key**: If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys)
 
-<div align="center">
-  <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/key.png" width="300"  style="margin: 10px;"/>
-</div>
+    <div align="center">
+    <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/key.png" width="300"  style="margin: 10px;"/>
+    </div>
 
 4. **Give Terminal app the required permissions**: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".
 
-<div align="center">
-  <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-1.png" width="300"  style="margin: 10px;"/>
-  <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-2.png" width="300"  style="margin: 10px;"/>
-</div>
+    <div align="center">
+    <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-1.png" width="300"  style="margin: 10px;"/>
+    <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-2.png" width="300"  style="margin: 10px;"/>
+    </div>
 
 ### Alternatively installation with `.sh`
 
 1. **Clone the repo** to a directory on your computer:
-```
-git clone https://github.com/OthersideAI/self-operating-computer.git
-```
+
+    ```bash
+    git clone https://github.com/OthersideAI/self-operating-computer.git
+    ```
+
 2. **Cd into directory**:
 
-```
-cd self-operating-computer
-```
+    ```bash
+    cd self-operating-computer
+    ```
 
-3. **Run the installation script**: 
+3. **Run the installation script**:
 
-```
-./run.sh
-```
+    ```bash
+    ./run.sh
+    ```
+
+## Development
+
+We use [PDM](https://pdm-project.org/latest/) as our package and dependency manager. You can find instructions for insallation and usage [here](https://pdm-project.org/latest/#recommended-installation-method).
 
 ## Using `operate` Modes
 
 ### Multimodal Models  `-m`
-An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below. 
+
+An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below.
 
 Start `operate` with the Gemini model
-```
+
+```bash
 operate -m gemini-pro-vision
 ```
 
 **Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.
 
 ### Locally Hosted LLaVA Through Ollama
-If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!   
-*Note: Ollama currently only supports MacOS and Linux*   
 
-First, install Ollama on your machine from https://ollama.ai/download.   
+If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
+*Note: Ollama currently only supports MacOS and Linux*
+
+First, install Ollama on your machine from <https://ollama.ai/download>.
 
 Once Ollama is installed, pull the LLaVA model:
-```
+
+```bash
 ollama pull llava
 ```
-This will download the model on your machine which takes approximately 5 GB of storage.   
+
+This will download the model on your machine which takes approximately 5 GB of storage.
 
 When Ollama has finished pulling LLaVA, start the server:
-```
+
+```bash
 ollama serve
 ```
 
 That's it! Now start `operate` and select the LLaVA model:
-```
+
+```bash
 operate -m llava
-```   
+```
+
 **Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.
 
 Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)
 
 ### Voice Mode `--voice`
-The framework supports voice inputs for the objective. Try voice by following the instructions below. 
+
+The framework supports voice inputs for the objective. Try voice by following the instructions below.
 **Clone the repo** to a directory on your computer:
-```
+
+```bash
 git clone https://github.com/OthersideAI/self-operating-computer.git
 ```
+
 **Cd into directory**:
-```
+
+```bash
 cd self-operating-computer
 ```
+
 Install the additional `requirements-audio.txt`
-```
+
+```bash
 pip install -r requirements-audio.txt
 ```
+
 **Install device requirements**
 For mac users:
-```
+
+```bash
 brew install portaudio
 ```
+
 For Linux users:
-```
+
+```bash
 sudo apt install portaudio19-dev python3-pyaudio
 ```
+
 Run with voice mode
-```
+
+```bash
 operate --voice
 ```
 
 ### Optical Character Recognition Mode `-m gpt-4-with-ocr`
-The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click. 
 
-Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write: 
+The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
+
+Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
 
- `operate` or `operate -m gpt-4-with-ocr` will also work. 
+ `operate` or `operate -m gpt-4-with-ocr` will also work.
 
 ### Set-of-Mark Prompting `-m gpt-4-with-som`
+
 The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
 
 Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
@@ -157,35 +190,37 @@ For this initial version, a simple YOLOv8 model is trained for button detection,
 
 Start `operate` with the SoM model
 
-```
+```bash
 operate -m gpt-4-with-som
 ```
 
-
-
-## Contributions are Welcomed!:
+## Contributions are Welcomed
 
 If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).
 
 ## Feedback
 
-For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter. 
+For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter.
 
 ## Join Our Discord Community
 
-For real-time discussions and community support, join our Discord server. 
+For real-time discussions and community support, join our Discord server.
+
 - If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
 - If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
 
 ## Follow HyperWriteAI for More Updates
 
 Stay updated with the latest developments:
+
 - Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI).
 - Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/).
 
 ## Compatibility
+
 - This project is compatible with Mac OS, Windows, and Linux (with X server installed).
 
 ## OpenAI Rate Limiting Note
-The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.   
+
+The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.
 Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**