Achieving 91.4% on Android World: A New Approach to Mobile UI Automation

We recently achieved a state-of-the-art result of 91.4% on Android World, marking a significant leap forward in mobile UI automation. Starting from scratch with lessons learned from our original Droidrun agent, we built a fundamentally different architecture.
The Core Architectural Shift
The key innovation was replacing the traditional planner-executor pattern with a dynamic manager-executor feedback loop:
Old approach: Planner creates complete task list → Executor attempts all tasks sequentially
New approach: Manager → Executor (single action) → Manager → Executor → Manager → Complete
Instead of executing an entire plan blindly, the Manager creates high-level tasks, the Executor takes one concrete action toward the first task, then the Manager immediately reassesses and adjusts the plan based on what actually happened. This tight feedback loop allows for dynamic replanning as the environment changes.
While this requires more Manager invocations, the Executor can use a very fast model for single-action decisions, keeping overall performance reasonable.
Key Technical Improvements
1. Specialized Text Manipulation Agent
Text handling tasks (editing documents, composing messages, filling forms) were a major weakness. When the Manager identifies a text-intensive task, it marks the subgoal with Text_Task
. This triggers routing to a specialized text manipulation agent that:
- Operates with a Python shell and has access to a function that clears and replaces all text atomically
- Receives the accessibility tree (without screenshots) plus the current focused element’s text
- Gets both the current subgoal and full plan as context
- Can programmatically construct and manipulate text before insertion
2. Richer Contextual Awareness
Multiple improvements enhanced the agent’s understanding of state:
- Device date injection into the system prompt for temporal reasoning
- Screen stabilization: After each action, we wait 0.5s and compare screen states, repeating until stable or timeout to ensure the UI has fully updated
- Pointer location disabled: We turn off the developer option that shows touch coordinates on screen, reducing visual noise that could confuse the vision model
- Differential state tracking: The most recent user message contains the current screenshot and accessibility tree, while the previous message contains the prior accessibility tree, allowing the agent to observe exactly what changed
- App-specific knowledge: We built an agentic system that automatically extracts information about apps, generating descriptions of what each app does and providing the Manager with concrete knowledge of app capabilities
3. Transparent Communication Flow
The Executor now outputs three components for each action: thought process, chosen action, and description. All three are injected into the Manager’s context, ensuring the Manager understands not just what happened but why the Executor chose that action.
4. Enhanced Memory System
Rather than mentioning memory features once, we scattered guidance throughout the system prompt with repeated context in different sections. Memory is also injected into both the system prompt and the last user message, making it consistently available.
5. Expanded Action Space
The action primitives were significantly enhanced:
- click(index): Click on a UI element by index
- long_press(index): Long press on a UI element
- type(text, index): Type text into an input field, with index parameter to focus the correct element first
- system_button(button): Press system buttons (back, home, enter)
- swipe(coordinate, coordinate2): Scroll between two coordinates
- open_app(text): Open an application by name
- copy(text): Copy specified text to clipboard
- paste(index, clear): Paste clipboard content into a text field, with option to clear existing text first
6. Systematic Prompt Engineering
The most intensive work involved iterative prompt refinement: understanding model-specific optimization patterns, strategically distributing instructions across the system prompt rather than concentrating them, and extensive trial-and-error on phrasing that actually changed model behavior.
Results
AndroidWorld Benchmark Results
Success rates of leading AI agents on the 116-task AndroidWorld benchmark (03.10.2025)
Insights
The 91.4% score on Android World demonstrates that mobile UI automation benefits from:
- Tight feedback loops between planning and execution
- Task-specific routing (general executor vs. text manipulation agent)
- Rich observability of state changes between actions
- Dynamic replanning based on actual outcomes rather than assumptions
The architecture trades some latency for dramatically improved reliability, proving that adaptive, context-aware systems significantly outperform rigid plan-then-execute frameworks for complex, dynamic environments like mobile UIs.