Engineering a macOS AI Agent: Lessons from Building Fazm with ScreenCaptureKit and Swift

What We Learned Building a macOS AI Agent in Swift (ScreenCaptureKit, Accessibility APIs, Async Pipelines)

Matthew Diakonov and his team spent six months building Fazm, an open-source macOS agent controlled by voice. The system leverages ScreenCaptureKit to process raw pixel buffers directly from the GPU for high-performance UI monitoring. By utilizing local WhisperKit inference, the agent maintains privacy while delivering zero-latency transcription on Apple Silicon.

Why This Matters

Most AI agents rely on screenshot-and-OCR methods which fail in production due to resolution dependence and visual ambiguity. A button that is 40px wide on a Retina display shifts at different scales, and identical-looking elements often serve different functions, leading to significant fragility in automation pipelines. Transitioning to macOS Accessibility APIs provides a stable, hierarchical representation of the UI. This structured approach allows agents to query roles and labels directly, ensuring automation remains resilient even when an application undergoes a visual redesign or a theme change.

Key Insights

ScreenCaptureKit, introduced in macOS 12.3, replaces legacy CGWindowListCreateImage to offer hardware-accelerated capture and per-window filtering.
CMSampleBuffer output from ScreenCaptureKit allows raw pixel buffers to be fed directly to vision models without image format conversion overhead.
Accessibility APIs provide a hierarchical tree of roles such as AXButton and labels like AXTitle that remain stable across visual UI updates.
Swift’s AsyncStream is utilized to handle continuous frame flow from the capture pipeline while Actor isolation ensures thread-safe state management.
WhisperKit enables local voice transcription on Apple Silicon, eliminating the privacy risks and latency associated with cloud-based audio processing.
Adaptive capture frequency strategies balance real-time UI monitoring during active automation with battery preservation during idle states.

Practical Applications

Local Voice Transcription: Use WhisperKit on Apple Silicon for zero-latency commands. Pitfall: Using cloud-based services compromises privacy for agents with full screen access.
UI Element Targeting: Query the Accessibility Tree for roles like AXButton. Pitfall: Relying on vision-based OCR coordinates causes failure when UI scale factors change.
Performance Management: Implement adaptive ScreenCaptureKit frequencies. Pitfall: Maintaining high frame rates during idle periods leads to excessive battery drain.

References:

https://dev.to/m13v/what-we-learned-building-a-macos-ai-agent-in-swift-screencapturekit-accessibility-apis-async-28fb

On This Page

What We Learned Building a macOS AI Agent in Swift (ScreenCaptureKit, Accessibility APIs, Async Pipelines)

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

TITAN: A Zero-Dependency Token Compressor for AI Coding Agents

Combating AI Code Bloat: The Path to Zero-Slop Engineering

Moving from Capabilities to Responsibilities in High-Stakes Agentic AI