Computer Use Framework (Operative)
Framework enabling AI agents to control computers through visual observation and programmatic actions -- mouse, keyboard, browser, terminal
ZIP-0422: Computer Use Framework (Operative)
Abstract
This proposal specifies Operative, a framework that enables AI agents to use computers the way humans do: by observing screen contents (screenshots, DOM trees), deciding on actions (click, type, scroll, navigate), and executing them through programmatic control of mouse, keyboard, and browser. Operative bridges the gap between AI tool use (ZIP-0412) and the millions of GUI-based applications that have no API.
Motivation
MCP (ZIP-0412) gives agents access to 260+ tools via API. But most of the world's software is GUI-only: web applications, desktop software, mobile apps. When a conservation researcher needs to:
- Log into a government wildlife database (GUI-only web portal)
- Download species survey data (click through menus)
- Process it in a desktop GIS application (GUI-only)
- Submit results to a conservation platform (web form)
...the agent needs computer use capabilities, not just API access.
Specification
Architecture
Agent (Zen-VL + MCP) ─────────> Operative Controller
│
┌──────────┼──────────┐
│ │ │
┌────┴────┐ ┌───┴───┐ ┌───┴────┐
│ Browser │ │Desktop│ │Terminal │
│ Control │ │Control│ │Control │
└────┬────┘ └───┬───┘ └───┬────┘
│ │ │
Playwright PyAutoGUI PTY
CDP Accessibility subprocess
API
Observation Space
| Observation Type | Source | Use |
|---|---|---|
| Screenshot | Screen capture | Visual understanding via Zen-VL |
| DOM snapshot | Browser CDP | Structured page understanding |
| Accessibility tree | OS API | Widget identification |
| Terminal output | PTY | Command result parsing |
Action Space
| Action | Parameters | Description |
|---|---|---|
| click | (x, y, button) | Mouse click at coordinates |
| type | (text) | Keyboard input |
| key | (key_combo) | Special key combination (Ctrl+C, etc.) |
| scroll | (x, y, delta) | Mouse scroll |
| navigate | (url) | Browser navigation |
| wait | (condition, timeout) | Wait for element/condition |
| screenshot | () | Capture current screen state |
Safety
- Sandboxed execution: All computer use happens in isolated containers
- Action approval: Destructive actions (delete, submit, purchase) require user approval
- Undo capability: All actions are logged and reversible where possible
- Rate limiting: Maximum actions per minute to prevent runaway agents
Research Papers
- hanzo-operative -- Operative framework specification
- hanzo-operate-computer -- Computer use architecture
- zen-voyager -- Zen-Voyager web navigation model
Implementation
- hanzo/operative: Production computer use framework
- hanzo/mcp: MCP integration for computer use actions
- hanzo/chat: Chat interface with computer use mode
Timeline
- Originated: September 2024 (Operative architecture)
- Research:
hanzo-operativepublished Q4 2024 - Implementation: Operative framework deployed Q4 2024