Comparison to Existing Benchmarks:
Dataset | Third-party app? | Cross-app? | Chinese app? | Difficulty level? | Number of tasks | Number of agents | Number of metrics | Free of hand-crafted validation? | Information for success detection |
---|---|---|---|---|---|---|---|---|---|
AndroidArena | ✗ | ✓ | ✗ | ✗ | 221 | 1 | 4 | ✗ | Action only |
AndroidWorld | ✓ | ✓ | ✗ | ✓ | 116 | 3 | 1 | ✗ | State only |
LlamaTouch | ✓ | ✗ | ✗ | ✓ | 495 | 4 | 1 | ✗ | State only |
B-MoCA | ✗ | ✗ | ✗ | ✗ | 60 | 3 | 1 | ✗ | State only |
MobileAgentBench | ✗ | ✗ | ✗ | ✓ | 100 | 5 | 6 | ✗ | Action and State |
SPA-Bench | ✓ | ✓ | ✓ | ✓ | 340 | 11 | 7 | ✓ | Action and State |