computer-use.md

Computer Use Tool

Status: Planned

The most powerful tool — gives the daemon direct control over the computer.

Capabilities

  • Take screenshots
  • Move mouse cursor
  • Click (left, right, double)
  • Type text
  • Press keyboard shortcuts
  • Scroll
  • Drag and drop

Interface

typescript
interface ComputerUseTool extends Tool { name: "computer_use"; actions: { // Vision screenshot(): Promise<{ image: Buffer; dimensions: Size }>; // Mouse click(x: number, y: number): Promise<void>; doubleClick(x: number, y: number): Promise<void>; rightClick(x: number, y: number): Promise<void>; moveMouse(x: number, y: number): Promise<void>; drag(fromX: number, fromY: number, toX: number, toY: number): Promise<void>; scroll(direction: "up" | "down", amount: number): Promise<void>; // Keyboard type(text: string): Promise<void>; keyPress(key: string, modifiers?: string[]): Promise<void>; }; }

Implementation Options

macOS

  • AppleScript - Basic automation
  • Accessibility APIs - Full control (requires permissions)
  • CGEvent - Low-level input events

Cross-Platform

  • nut-tree/nut.js - Node.js native automation
  • RobotJS - Older but stable

Visual Understanding

  • MLX Grounding Model - Understand what's on screen
  • OCR - Extract text from screenshots

Security Considerations

  • Requires accessibility permissions on macOS
  • Should confirm before sensitive actions
  • Log all actions for audit trail

Open Questions

  1. How to handle multi-monitor setups?
  2. How to deal with high-DPI/Retina displays?
  3. Should we support window-specific actions?