CocoaBench: Evaluating Unified Digital Agents in the Wild

arXivarX

A benchmarking framework designed to evaluate 'unified' digital agents across cross-domain, long-horizon tasks involving software engineering, GUI automation, and deep research.

View on arXiv

Defensibility

5.0/10

citations

co_authors

Platform Dominationmedium

Market Consolidationhigh

Displacement Horizon1-2 years

REASONING

CocoaBench enters a crowded but fragmented field of agent evaluations. Its primary value proposition is the 'unified' aspect—testing if an agent can pivot from writing code to browsing the web to manipulating a GUI in a single, coherent workflow. This is a step beyond domain-specific benchmarks like SWE-bench (coding) or WebArena (browsing). The quantitative signals (0 stars but 32 forks in 3 days) strongly indicate a 'hot' paper release where research labs are immediately cloning the repo to test their models, even before a social media marketing push. The defensibility of a benchmark is purely social and academic: it survives if frontier labs use it to brag about their scores in technical reports. However, the moat is shallow because benchmarks face 'saturation'—as models improve, the tasks become trivial, necessitating a 'v2'. Compared to GAIA or MLE-bench (OpenAI), CocoaBench has a niche in cross-capability integration. The high displacement risk stems from the fact that frontier labs (Anthropic with 'Computer Use', OpenAI with 'Operator') are building internal evaluation suites that they may release as the new industry standards, potentially sidelining independent academic benchmarks.

COMPOSABILITY

TECH STACK

PythonDockerPlaywrightLLM-scaffoldingvLLM/Ollama (inference)OS-level automation

INTEGRATION

reference_implementation

agent_evaluationmulti_domain_benchmarkingunified_agentslong_horizon_taskscross_tool_use

READINESS

Composabilityframework

Depth