Microsoft UFO (UI-Focused Agent) is an open-source AI-powered agent framework designed to automate and orchestrate tasks within the Windows desktop environment. It enables users to control and navigate multiple Windows applications seamlessly using natural language commands. The system employs a multi-agent architecture where the HostAgent coordinates task goals and AppAgents operate individual apps by choosing appropriate GUI or system API actions, making it possible to automate complex workflows across diverse software. UFO aims to fundamentally improve how users interact with Windows by reducing reliance on traditional user interfaces. The latest version, UFO², introduces the AgentOS concept, allowing integration of multiple agents for advanced task management in a secure, efficient way.
Key Features
Deep integration with Windows OS combining UI Automation (UIA), Win32, and WinCOM for precise control detection and native command execution.
Hybrid action execution that prefers native APIs for speed and robustness, but falls back to GUI clicks and keystrokes when needed.
Speculative Multi-Action capability bundles predicted steps into one AI call, reducing latency and up to 51% fewer LLM queries.
Continuous Knowledge Substrate uses retrieval-augmented generation (RAG) by mixing documents, search results, user demos, and execution traces to improve agent learning over time.
Use Cases
Automate repetitive and multi-application Windows tasks such as file management, email composition, and data entry.
Seamlessly control and switch between different applications like Outlook, PowerPoint, File Explorer, and more via natural language.
Enable enhanced desktop productivity with AI agents that can handle complex workflows beyond simple UI automation.
Technical Specifications
Written in 100% Python, compatible with Windows 10 and later versions; requires Python 3.10+ for installation and operation.
Incorporates multiple agents including HostAgent (task coordination) and AppAgents (application-specific control with ReAct loops and multimodal perception).
Supports sandboxed execution with a Picture-in-Picture virtual desktop (coming soon) to isolate automation and keep the main workspace free and secure.