SPA-BENCH: A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION

SPA-BENCH:
A COMPREHENSIVE BENCHMARK FOR
SMARTPHONE AGENT EVALUATION

Derek Yuen^{1 *}, Jingxuan Chen^{1 *}, Bin Xie², Yuhao Yang¹, Gongwei Chen², Zhihao Wu¹, Yixing Li², Xurui Zhou², Weiwen Liu¹, Shuai Wang¹, Kaiwen Zhou¹, Rui Shao^{2 †}, Liqiang Nie², Yasheng Wang¹, Jianye Hao¹, Jun Wang³, Kun Shao^{1 †}

¹ Huawei Noah's Ark Lab
² Harbin Institute of Technology, Shenzhen
³ University College London
^*Equal Contribution
^†Corresponding authors: shaorui@hit.edu.cn, shaokun2@huawei.com

Dataset	Third-party app?	Cross-app?	Chinese app?	Difficulty level?	Number of tasks	Number of agents	Number of metrics	Free of hand-crafted validation?	Information for success detection
AndroidArena	✗	✓	✗	✗	221	1	4	✗	Action only
AndroidWorld	✓	✓	✗	✓	116	3	1	✗	State only
LlamaTouch	✓	✗	✗	✓	495	4	1	✗	State only
B-MoCA	✗	✗	✗	✗	60	3	1	✗	State only
MobileAgentBench	✗	✗	✗	✓	100	5	6	✗	Action and State
SPA-Bench	✓	✓	✓	✓	340	11	7	✓	Action and State

Dataset

Third-party app?

Cross-app?

Chinese app?

Difficulty level?

Number of tasks

Number of agents

Number of metrics

Free of hand-crafted validation?

Information for success detection

AndroidArena

✗

✓

✗

221

✗

Action only

AndroidWorld

✓

✗

✓

116

✗

State only

LlamaTouch

✓

✗

✓

495

✗

State only

B-MoCA

✗

State only

MobileAgentBench

✗

✓

100

✗

Action and State

SPA-Bench

✓

340

✓

Action and State

Abstract

Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-BENCH, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-BENCH offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications.

📋 Diverse and Realistic Task Design

Key Features:

📦 340 Tasks - 300 Single-app Tasks and 40 Cross-app Tasks
🌐 66 Apps – 52 Third-party Apps, 7 Google Apps and 7 System Apps
🌍 2 Languages – Chinese and English Apps
📊 Increased Difficulty Levels
🎨 Human Annotated Trajectories & Key Components

A sample set of single-app tasks within the Deliveroo app, demonstrated by human. In this example, simpler tasks form the foundation for more complex ones, resulting in shared trajectories in the initial stages. The final screenshots for tasks of all three difficulty levels are highlighted in corresponding colours.

A sample set of cross-app tasks, demonstrated by human.

🤖 Plug-and-Play Agent Framework

Key Features:

🧠 11 Smartphone Agents Ready for Evaluation
🧩 Easy Integration of Your Own Agents with Minimal Code Changes
📱 Scalable Design – Multi-device support & Emulator Compatibility
📸 Android Snapshot – Local Environment Setup and Data Reset for Consistent Testing

An overview of the agent framework using a multi-processing architecture. Each worker process connects an agent to an Android emulator, and they interact multiple times throughout the task (i.e., step 3 is repeated) until completion. The emulators are reset after the agent has executed all assigned tasks.

✅ Automatic and Scalable Evaluation Pipeline

Key Features:

🔍 7 Evaluation Metrics for a Comprehensive Analysis
📐 Coarse-and-Fine Success Detection Pipeline – Requires No Further Human Effort
🔀 Trajectory Splitting & Subtask Evaluation – Tailored for Long-Sequence Tasks
🏆 Single-app Tasks – Achieved F1-scores of 0.926 (English) and 0.884 (Chinese)
🌟 Cross-app Tasks – Achieved F1-scores of 0.833 (English) and 0.857 (Chinese)

Single-app success detection pipeline. It features coarse detection through key component matching on execution screenshots and pre-annotated key components, followed by fine detection using MLLM evaluation given action information.

Cross-app success detection pipeline that is based on subtasks instead of the entire task. The first stage involves splitting the full trajectory into segments, while the second stage checks the subtasks sequentially.

Results

Agent	Success (%)	Mean Step Ratio on Success	Termination Reason			Termination Inaccuracy		Mean Exec Time per Step (sec)	Mean Token Cost per Step (USD)
Agent	Success (%)	Mean Step Ratio on Success	SRC (%)	MSR (%)	Error (%)	Premature (%)	Overdue (%)	Mean Exec Time per Step (sec)	Mean Token Cost per Step (USD)
Off-the-Shelf Model (GPT-4o)
AppAgent	0.340	1.33	0.327	0.507	0.166	0.347	0.197	26.5	0.014
AutoDroid	0.327	1.10	0.593	0.340	0.067	0.494	0.078	34.0	0.008
MobileAgent	0.387	1.24	0.367	0.633	0	0.109	0.095	27.1	0.053
MobileAgentV2	0.433	1.05	0.580	0.420	0	0.333	0.111	56.1	0.067
M3A	0.640	0.92	0.847	0.153	0	0.244	0	19.3	0.092
T3A	0.487	1.04	0.707	0.293	0	0.368	0.136	9.6	0.116
SeeAct	0.393	1.60	0.200	0.773	0.027	0.100	0.276	41.2	0.046
Fine-tuned Model
Auto-UI	0.013	1.50	0.060	0.940	0	1.000	0.015	-	-
CogAgent	0.020	1.67	0.147	0.820	0.033	1.000	0.024	-	-
DigiRL	0.020	1.52	0.227	0.607	0.166	0.971	0.022	-	-
OdysseyAgent	0.053	2.00	0	1.000	0	-	0.013	-	-

Task performance on single-app English tasks. SRC and MSR refer to Self-Reported Completion and Maximum Steps Reached, respectively. The execution time and token costs of the last four agents are omitted because they use locally hosted open-source models.

For full results and more details, please refer to our paper.

BibTeX

@inproceedings{chen2025spabench, title={SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation}, author={Jingxuan Chen and Derek Yuen and Bin Xie and Yuhao Yang and Gongwei Chen and Zhihao Wu and Li Yixing and Xurui Zhou and Weiwen Liu and Shuai Wang and Kaiwen Zhou and Rui Shao and Liqiang Nie and Yasheng Wang and Jianye HAO and Jun Wang and Kun Shao}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, }

SPA-BENCH:A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION

Comparison to Existing Benchmarks:

Abstract

📋 Diverse and Realistic Task Design

Key Features:

A sample set of cross-app tasks, demonstrated by human.

🤖 Plug-and-Play Agent Framework

Key Features:

✅ Automatic and Scalable Evaluation Pipeline

Key Features:

Single-app success detection pipeline. It features coarse detection through key component matching on execution screenshots and pre-annotated key components, followed by fine detection using MLLM evaluation given action information.

Cross-app success detection pipeline that is based on subtasks instead of the entire task. The first stage involves splitting the full trajectory into segments, while the second stage checks the subtasks sequentially.

Results

Task performance on single-app English tasks. SRC and MSR refer to Self-Reported Completion and Maximum Steps Reached, respectively. The execution time and token costs of the last four agents are omitted because they use locally hosted open-source models.

For full results and more details, please refer to our paper.

BibTeX

SPA-BENCH:
A COMPREHENSIVE BENCHMARK FOR
SMARTPHONE AGENT EVALUATION