ZIPsZoo Proposals
ZIP-0274

Computer Use Framework (Operative)

Final

Framework enabling AI agents to control computers through visual observation and programmatic actions -- mouse, keyboard, browser, terminal

Type
Standards Track
Category
AI
Author
Zoo Labs Foundation
Created
2024-09-01
computer-useoperativegui-agentbrowser-automationdesktop-automation

ZIP-0422: Computer Use Framework (Operative)

Abstract

This proposal specifies Operative, a framework that enables AI agents to use computers the way humans do: by observing screen contents (screenshots, DOM trees), deciding on actions (click, type, scroll, navigate), and executing them through programmatic control of mouse, keyboard, and browser. Operative bridges the gap between AI tool use (ZIP-0412) and the millions of GUI-based applications that have no API.

Motivation

MCP (ZIP-0412) gives agents access to 260+ tools via API. But most of the world's software is GUI-only: web applications, desktop software, mobile apps. When a conservation researcher needs to:

  1. Log into a government wildlife database (GUI-only web portal)
  2. Download species survey data (click through menus)
  3. Process it in a desktop GIS application (GUI-only)
  4. Submit results to a conservation platform (web form)

...the agent needs computer use capabilities, not just API access.

Specification

Architecture

Agent (Zen-VL + MCP) ─────────> Operative Controller
                                      │
                           ┌──────────┼──────────┐
                           │          │          │
                      ┌────┴────┐ ┌───┴───┐ ┌───┴────┐
                      │ Browser │ │Desktop│ │Terminal │
                      │ Control │ │Control│ │Control  │
                      └────┬────┘ └───┬───┘ └───┬────┘
                           │          │          │
                      Playwright   PyAutoGUI    PTY
                      CDP          Accessibility subprocess
                                   API

Observation Space

Observation TypeSourceUse
ScreenshotScreen captureVisual understanding via Zen-VL
DOM snapshotBrowser CDPStructured page understanding
Accessibility treeOS APIWidget identification
Terminal outputPTYCommand result parsing

Action Space

ActionParametersDescription
click(x, y, button)Mouse click at coordinates
type(text)Keyboard input
key(key_combo)Special key combination (Ctrl+C, etc.)
scroll(x, y, delta)Mouse scroll
navigate(url)Browser navigation
wait(condition, timeout)Wait for element/condition
screenshot()Capture current screen state

Safety

  • Sandboxed execution: All computer use happens in isolated containers
  • Action approval: Destructive actions (delete, submit, purchase) require user approval
  • Undo capability: All actions are logged and reversible where possible
  • Rate limiting: Maximum actions per minute to prevent runaway agents

Research Papers

Implementation

  • hanzo/operative: Production computer use framework
  • hanzo/mcp: MCP integration for computer use actions
  • hanzo/chat: Chat interface with computer use mode

Timeline

  • Originated: September 2024 (Operative architecture)
  • Research: hanzo-operative published Q4 2024
  • Implementation: Operative framework deployed Q4 2024