Self-Operating Computer Framework

ome

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

ce

Key Features

Compatibility: Designed for various multimodal models.
Centralized Model Management: All model configurations are now managed in a single, easy-to-update file (operate/models/model_configs.py), simplifying the addition and management of new models.
Expanded Model Support: Now integrated with the latest OpenAI o3, o4-mini, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, Gemini 2.5 Pro, Gemini 2.5 Flash, and Gemma 3n models (including e2b and e4b variants), and Gemma 3:12b alongside existing support for GPT-4o, Claude 3, Qwen-VL, and LLaVa.
Enhanced Ollama Integration: Improved handling for Ollama models, including default host configuration and more informative error messages.
Future Plans: Support for additional models.

Debugging

To gain deeper insights into the model's behavior and troubleshoot issues, you can run operate with the -d or --verbose flag. This will enable verbose mode, providing detailed debugging information directly in your terminal.

operate -d # or operate --verbose

When verbose mode is active, you will see:

Prompt Sent to AI: The exact JSON prompt that is constructed and sent to the language model for its decision.
Response from AI: The raw JSON response received back from the language model.

This detailed output is invaluable for understanding why the model made a particular decision, identifying issues with prompt construction, or debugging unexpected model behaviors.

Adding New Models

With the introduction of centralized model management, adding new models to the Self-Operating Computer Framework is straightforward. All model configurations are now defined in a single file: operate/models/model_configs.py.

To add a new model:

Locate model_configs.py: Navigate to operate/models/model_configs.py in your project directory.
Understand the MODELS Dictionary: This file contains a Python dictionary named MODELS. Each key in this dictionary is the internal identifier for a model (e.g., "gpt-4o", "llava"), and its value is another dictionary containing the model's configuration.

Each model configuration dictionary has the following keys:
- "api_key" (string): The name of the environment variable that stores the API key required for this model (e.g., "OPENAI_API_KEY", "GOOGLE_API_KEY", "OLLAMA_HOST"). If no API key is needed, you can omit this key or set its value to None.
- "provider" (string): The service provider for the model (e.g., "openai", "google", "ollama", "anthropic", "qwen"). For OpenRouter models, the provider is internally set to "openrouter_internal" to ensure proper routing. This string directly corresponds to the logic implemented in operate/models/apis.py that handles API calls for that provider.
- "display_name" (string): The user-friendly name that will appear in the interactive model selection menu when you run operate without specifying a model.

Add Your New Model Entry: Append a new entry to the MODELS dictionary following the existing format. For example, to add a hypothetical new Ollama model:

MODELS = { # ... existing models ... "my-new-ollama-model": {"api_key": "OLLAMA_HOST", "provider": "ollama", "display_name": "My New Ollama Model (via Ollama)"}, }

Ensure Provider Logic Exists: The "provider" you specify (e.g., "ollama") must have corresponding API call logic implemented in operate/models/apis.py. If you are adding a model from an existing provider (like OpenAI, Google, Ollama, Anthropic, or Qwen), the existing logic should handle it. If it's a completely new provider, you would need to add the necessary API integration code in operate/models/apis.py.

By following these steps, you can easily extend the framework to support new models without modifying core application logic in multiple places.

Demo

final-low.mp4

Run `Self-Operating Computer`

Run from PyPI

To run the application by installing it from PyPI, follow these steps:

Install the project:
```
pip install self-ai-operating-computer
```
Run the project:
```
operate
```

Run from Source Code

To run the application from your local copy (after making changes), follow these steps:

Uninstall previous installations (if any):
```
pip uninstall self-ai-operating-computer
```
Confirm uninstallation when prompted.
Install in editable mode: Navigate to the project's root directory (where setup.py is located) and run:
```
pip install -e .
```
This links your local source code to your Python environment, so changes are immediately reflected.
Run the project:
```
operate
```
If you run operate without any arguments, a welcome screen will be displayed, followed by an interactive model selection menu. If a model requires an API key that is not found in your environment variables or .env file, you will be prompted to enter it.

You can also specify the OpenRouter model directly via an environment variable to bypass the interactive selection:
```
export OPENROUTER_MODEL="openai/gpt-4o-mini" operate -m openrouter
```
Or for other models:
```
export OPENROUTER_MODEL="google/gemini-2.5-pro" operate -m openrouter
```

Once you select a model, you will be prompted to provide a custom system prompt if the CUSTOM_SYSTEM_PROMPT environment variable is not set. If the environment variable is set, the prompt screen will be skipped and the value from the environment variable will be used automatically. You can also load the prompt from a file if needed.

Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here. If you need you change your key at a later point, run vim .env to open the .env and replace the old key.

Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Using `operate` Modes

OpenAI models

The default model for the project is gpt-4o which you can use by simply typing operate. To try running OpenAI's new o1 model, use the command below.

operate -m o1-with-ocr

To experiment with OpenAI's latest gpt-4.1 model, run:

operate -m gpt-4.1

To experiment with OpenAI's latest gpt-4.1 mini model, run:

operate -m gpt-4.1-mini

To experiment with OpenAI's latest gpt-4.1 nano model, run:

operate -m gpt-4.1-nano

To experiment with OpenAI's latest o3 model, run:

operate -m o3

To experiment with OpenAI's latest o4-mini model, run:

operate -m o4-mini

Multimodal Models `-m`

Try Google's gemini-1.5-pro-latest, gemini-2.5-pro, or gemini-2.5-flash by following the instructions below. Start operate with the Gemini model

operate -m gemini-2.5-pro

Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

Try Claude `-m claude-3`

Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.

operate -m claude-3

Try qwen `-m qwen-vl`

Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Qwen dashboard to get an API key and run the command below to try it.

operate -m qwen-vl

Try LLaVa Hosted Through Ollama `-m llava`

If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the LLaVA model:

ollama pull llava

This will download the model on your machine which takes approximately 5 GB of storage.

When Ollama has finished pulling LLaVA, start the server:

ollama serve

That's it! Now start operate and select the LLaVA model:

operate -m llava

Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

Try Gemma 3n Hosted Through Ollama `-m gemma3n`

If you wish to experiment with the Self-Operating Computer Framework using Gemma 3n on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux. Windows now in Preview

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the Gemma 3n model:

ollama pull gemma3n

This will download the model on your machine. You can also pull specific versions like ollama pull gemma3n:e2b or ollama pull gemma3n:e4b.

When Ollama has finished pulling Gemma 3n, start the server:

ollama serve

That's it! Now start operate and select the Gemma 3n model. You can specify gemma3n, gemma3n:e2b, or gemma3n:e4b:

operate -m gemma3n:e2b

Try Gemma 3:12b Hosted Through Ollama `-m gemma3:12b`

If you wish to experiment with the Self-Operating Computer Framework using Gemma 3:12b on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux. Windows now in Preview

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the Gemma 3:12b model:

ollama pull gemma3:12b

This will download the model on your machine.

When Ollama has finished pulling Gemma 3:12b, start the server:

ollama serve

That's it! Now start operate and select the Gemma 3:12b model:

operate -m gemma3:12b

Learn more about Ollama at its GitHub Repository

Voice Mode `--voice`

The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:

git clone https://github.com/malah-code/self-ai-operating-computer.git

Cd into directory:

cd self-ai-operating-computer

Install the additional requirements-audio.txt

pip install -r requirements-audio.txt

Install device requirements For mac users:

brew install portaudio

For Linux users:

sudo apt install portaudio19-dev python3-pyaudio

Run with voice mode

operate --voice

Optical Character Recognition Mode `-m gpt-4-with-ocr`

The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than som and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:

operate or operate -m gpt-4-with-ocr will also work.

Set-of-Mark Prompting `-m gpt-4-with-som`

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: here.

For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start operate with the SoM model

operate -m gpt-4-with-som

Contributions are Welcomed!:

If you want to contribute yourself, see CONTRIBUTING.md.

Feedback

For any input on improving this project, feel free to reach out to Josh on Twitter.

Join Our Discord Community

For real-time discussions and community support, join our Discord server.

If you're already a member, join the discussion in #self-ai-operating-computer.
If you're new, first join our Discord Server.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

Follow HyperWriteAI on Twitter.
Follow HyperWriteAI on LinkedIn.

Compatibility

This project is compatible with Mac OS, Windows, and Linux (with X server installed).

OpenAI Rate Limiting Note

The gpt-4o model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here

Supported Models Summary

Here's a summary of all currently supported models and how to run them:

OpenAI GPT-4o (Default):
```
operate 
```
OpenAI o1-with-ocr:
```
operate -m o1-with-ocr 
```
OpenAI o3:
```
operate -m o3 
```
OpenAI o4-mini:
```
operate -m o4-mini 
```
OpenAI gpt-4.1:
```
operate -m gpt-4.1 
```
OpenAI gpt-4.1 mini:
```
operate -m gpt-4.1-mini 
```
OpenAI gpt-4.1 nano:
```
operate -m gpt-4.1-nano 
```
OpenAI gpt-4-with-ocr:
```
operate -m gpt-4-with-ocr 
```
OpenAI gpt-4-with-som:
```
operate -m gpt-4-with-som 
```
Google Gemini 1.5 Pro (latest):
```
operate -m gemini-1.5-pro-latest 
```
Google Gemini 2.5 Pro:
```
operate -m gemini-2.5-pro 
```
Google Gemini 2.5 Flash:
```
operate -m gemini-2.5-flash 
```
Anthropic Claude 3:
```
operate -m claude-3 
```
Qwen-VL:
```
operate -m qwen-vl 
```
LLaVa (via Ollama):
```
operate -m llava 
```
Gemma 3n (via Ollama):
```
operate -m gemma3n 
```
Gemma 3n:e2b (via Ollama):
```
operate -m gemma3n:e2b 
```
Gemma 3n:e4b (via Ollama):
```
operate -m gemma3n:e4b 
```
Gemma 3:12b (via Ollama):
```
operate -m gemma3:12b 
```
OpenRouter:
```
operate -m openrouter 
```
(Selecting this option will prompt you to enter the full OpenRouter model name, e.g., google/gemini-2.0-flash-001.)

Release Notes

Version 0.0.X (Latest)

New Features:

Interactive Model Selection: When running operate without specifying a model, a welcome screen is displayed, followed by an interactive menu to select your desired model.
Dynamic API Key Prompting: The application now intelligently prompts for required API keys (e.g., OpenAI, Google, Anthropic) only when a model requiring that key is selected and the key is not found in your environment variables or .env file.
Custom System Prompt: Users can now provide a custom system prompt from a file or an environment variable (CUSTOM_SYSTEM_PROMPT). If the environment variable is set, the option to load from it will be hidden.

Improvements:

Expanded Google Gemini Support: Added full support for gemini-2.5-pro and gemini-2.5-flash models.
Enhanced Ollama Integration: Improved handling for Ollama models, including setting http://localhost:11434 as the default host and providing more informative error messages when Ollama models are not found.
Gemma 3n Model Support: Integrated support for gemma3n, gemma3n:e2b, and gemma3n:e4b models via Ollama.
Robust Error Handling: Improved error handling for API calls to prevent unexpected fallbacks and provide clearer error messages.

Bug Fixes:

Resolved an issue where the application would incorrectly prompt for an OpenAI API key when a Google Gemini model was selected.
Fixed an issue where the application would attempt to use an incorrect model name for gemini-2.5-flash-lite (which is not a valid model name).
Addressed the "Extra data" JSON parsing error when receiving responses from Gemini models.

Name		Name	Last commit message	Last commit date
Latest commit History 692 Commits
.github		.github
operate		operate
readme		readme
.env-sample		.env-sample
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
evaluate.py		evaluate.py
requirements-audio.txt		requirements-audio.txt
requirements.txt		requirements.txt
sample-system-prompt.md		sample-system-prompt.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Operating Computer Framework

Key Features

Debugging

Adding New Models

Demo

Run `Self-Operating Computer`

Run from PyPI

Run from Source Code

Using `operate` Modes

OpenAI models

Multimodal Models `-m`

Try Claude `-m claude-3`

Try qwen `-m qwen-vl`

Try LLaVa Hosted Through Ollama `-m llava`

Try Gemma 3n Hosted Through Ollama `-m gemma3n`

Try Gemma 3:12b Hosted Through Ollama `-m gemma3:12b`

Voice Mode `--voice`

Optical Character Recognition Mode `-m gpt-4-with-ocr`

Set-of-Mark Prompting `-m gpt-4-with-som`

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

Supported Models Summary

Release Notes

Version 0.0.X (Latest)

About

Uh oh!

Releases 10

Packages

Languages

License

malah-code/self-ai-operating-computer

Folders and files

Latest commit

History

Repository files navigation

Self-Operating Computer Framework

Key Features

Debugging

Adding New Models

Demo

Run Self-Operating Computer

Run from PyPI

Run from Source Code

Using operate Modes

OpenAI models

Multimodal Models -m

Try Claude -m claude-3

Try qwen -m qwen-vl

Try LLaVa Hosted Through Ollama -m llava

Try Gemma 3n Hosted Through Ollama -m gemma3n

Try Gemma 3:12b Hosted Through Ollama -m gemma3:12b

Voice Mode --voice

Optical Character Recognition Mode -m gpt-4-with-ocr

Set-of-Mark Prompting -m gpt-4-with-som

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

Supported Models Summary

Release Notes

Version 0.0.X (Latest)

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Languages

Run `Self-Operating Computer`

Using `operate` Modes

Multimodal Models `-m`

Try Claude `-m claude-3`

Try qwen `-m qwen-vl`

Try LLaVa Hosted Through Ollama `-m llava`

Try Gemma 3n Hosted Through Ollama `-m gemma3n`

Try Gemma 3:12b Hosted Through Ollama `-m gemma3:12b`

Voice Mode `--voice`

Optical Character Recognition Mode `-m gpt-4-with-ocr`

Set-of-Mark Prompting `-m gpt-4-with-som`

Packages