We're thrilled to share that Droidrun has reached state-of-the-art performance on the AndroidWorld benchmark, with a remarkable 63.0% success rate across 116 varied Android automation tasks. And the best part? We are completely open source.
The AndroidWorld Challenge
AndroidWorld is a comprehensive benchmark featuring 116 diverse tasks across 20 real-world Android applications. These tasks span multiple categories of mobile automation, from recording audio clips with specific filenames to managing contacts, creating expense entries, taking photos, managing music playlists, and handling calendar events.
The benchmark tests an agent's ability to understand Android interfaces, plan appropriate actions, and execute complex workflows across apps like Audio Recorder, Camera, Contacts, File managers, Simple Calendar, SMS apps, Music players, Drawing apps, Browser automation, Maps, and more.
The Setup Nightmare
Let's be honest, AndroidWorld is very hard to setup and still has some rough edges. We faced significant challenges in setting up the environment that required custom solutions.
We encountered numerous errors in setting up the AndroidWorld environment and needed to patch their official Docker Image to make it work properly. The original setup had compatibility issues that prevented proper execution. This wasn't just a minor configuration issue, it required deep debugging and custom patches to get a working environment.
Additionally, we needed to manually verify the Data Retrieval tasks because the automated evaluation system had reliability issues and couldn't properly assess certain task completions. This added significant overhead to our evaluation process.
Our Method
We took a different approach than the standard AndroidWorld setup. Instead of using their Accessibility Forwarding Service, we used our own Droidrun implementation with the Droidrun APK for direct access to the Accessibility API. This gave us better performance and reliability.
The agent used Gemini 2.5 Pro as the base model and leverages reasoning and reflection capabilities from the Droidrun framework for enhanced decision making and error recovery. This combination proved crucial for handling the complex, multi step tasks in AndroidWorld.
Our evaluation process involved loading the complete AndroidWorld Task Suite and integrating it with Droidrun, then evaluating performance using AndroidWorld's evaluation scripts with manual verification for edge cases and data retrieval tasks.
We've made our complete setup available as a Docker image for easy reproduction. You can find the full implementation and setup instructions in our droidrun-android-world repository .
The Execution Process
Task Loading & Integration: We loaded the complete AndroidWorld Task Suite and gave it to Droidrun for execution. The integration required custom adapters to bridge AndroidWorld's task format with Droidrun's execution engine.
Automated Evaluation: Tasks were evaluated using AndroidWorld's evaluation scripts, which check for task completion based on specific success criteria and state verification.
Manual Verification: Some tasks required manual verification because they evaluated incorrectly or couldn't be automatically assessed (particularly data retrieval tasks). We manually reviewed execution traces and final states to determine success.
App Setup Handling: Sometimes our agent had to set up the apps independently because the AndroidWorld environment didn't initialize them correctly. This required additional app state management and configuration steps.
The Results
AndroidWorld Benchmark Results
Success rates of leading AI agents on the 116-task AndroidWorld benchmark (2024-2025)
What's Next?
We're not stopping here! Our achievement on AndroidWorld demonstrates the potential of autonomous mobile agents, but there's still much work to be done. The mobile automation space is rapidly evolving, and we're committed to pushing the boundaries further.
Remember these results were achieved with our current setup, imagine what's possible as we continue to improve and optimize Droidrun! The fact that we're fully open source means the entire community can benefit from and contribute to these advances.