Achieving 91.4% on Android World: A New Approach to Mobile UI Automation

We recently achieved a state-of-the-art result of 91.4% on Android World, marking a significant leap forward in mobile UI automation. Starting from scratch with lessons learned from our original Droidrun agent, we built a fundamentally different architecture.

The Core Architectural Shift

The key innovation was replacing the traditional planner-executor pattern with a dynamic manager-executor feedback loop:

Old approach: Planner creates complete task list → Executor attempts all tasks sequentially

New approach: Manager → Executor (single action) → Manager → Executor → Manager → Complete

Instead of executing an entire plan blindly, the Manager creates high-level tasks, the Executor takes one concrete action toward the first task, then the Manager immediately reassesses and adjusts the plan based on what actually happened. This tight feedback loop allows for dynamic replanning as the environment changes.

While this requires more Manager invocations, the Executor can use a very fast model for single-action decisions, keeping overall performance reasonable.

Key Technical Improvements

1. Specialized Text Manipulation Agent

Text handling tasks (editing documents, composing messages, filling forms) were a major weakness. When the Manager identifies a text-intensive task, it marks the subgoal with Text_Task. This triggers routing to a specialized text manipulation agent that:

Operates with a Python shell and has access to a function that clears and replaces all text atomically
Receives the accessibility tree (without screenshots) plus the current focused element’s text
Gets both the current subgoal and full plan as context
Can programmatically construct and manipulate text before insertion

2. Richer Contextual Awareness

Multiple improvements enhanced the agent’s understanding of state:

Device date injection into the system prompt for temporal reasoning
Screen stabilization: After each action, we wait 0.5s and compare screen states, repeating until stable or timeout to ensure the UI has fully updated
Pointer location disabled: We turn off the developer option that shows touch coordinates on screen, reducing visual noise that could confuse the vision model
Differential state tracking: The most recent user message contains the current screenshot and accessibility tree, while the previous message contains the prior accessibility tree, allowing the agent to observe exactly what changed
App-specific knowledge: We built an agentic system that automatically extracts information about apps, generating descriptions of what each app does and providing the Manager with concrete knowledge of app capabilities

3. Transparent Communication Flow

The Executor now outputs three components for each action: thought process, chosen action, and description. All three are injected into the Manager’s context, ensuring the Manager understands not just what happened but why the Executor chose that action.

4. Enhanced Memory System

Rather than mentioning memory features once, we scattered guidance throughout the system prompt with repeated context in different sections. Memory is also injected into both the system prompt and the last user message, making it consistently available.

5. Expanded Action Space

The action primitives were significantly enhanced:

click(index): Click on a UI element by index
long_press(index): Long press on a UI element
type(text, index): Type text into an input field, with index parameter to focus the correct element first
system_button(button): Press system buttons (back, home, enter)
swipe(coordinate, coordinate2): Scroll between two coordinates
open_app(text): Open an application by name
copy(text): Copy specified text to clipboard
paste(index, clear): Paste clipboard content into a text field, with option to clear existing text first

6. Systematic Prompt Engineering

The most intensive work involved iterative prompt refinement: understanding model-specific optimization patterns, strategically distributing instructions across the system prompt rather than concentrating them, and extensive trial-and-error on phrasing that actually changed model behavior.

Results

AndroidWorld Benchmark Results

Success rates of leading AI agents on the 116-task AndroidWorld benchmark (03.10.2025)

Droidrun

Other Agents

91.4% Success Rate

AndroidWorld benchmark results

Insights

The 91.4% score on Android World demonstrates that mobile UI automation benefits from:

Tight feedback loops between planning and execution
Task-specific routing (general executor vs. text manipulation agent)
Rich observability of state changes between actions
Dynamic replanning based on actual outcomes rather than assumptions

The architecture trades some latency for dramatically improved reliability, proving that adaptive, context-aware systems significantly outperform rigid plan-then-execute frameworks for complex, dynamic environments like mobile UIs.

View Full Benchmark Results