Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

arXivarX

A structure-agnostic benchmark (CLI-Tool-Bench) designed to evaluate the capability of LLM agents to generate complete CLI applications from scratch ('0-to-1') using end-to-end black-box validation rather than predefined scaffolds.

View on arXiv

Defensibility

3.0/10

citations

co_authors

Platform Dominationmedium

Market Consolidationlow

Displacement Horizon1-2 years

REASONING

CLI-Tool-Bench addresses a specific gap in current LLM evaluation: the transition from snippet-level coding (HumanEval) or bug-fixing (SWE-bench) to greenfield repository construction. Its defensibility is currently low (Score: 3) because, like most benchmarks, its value is entirely dependent on community adoption and 'SOTA-chasing' by frontier labs. With 0 stars but 5 forks within 9 days of a paper release, it shows typical academic initial engagement. The 'moat' for a benchmark is purely social—if top labs start reporting CLI-Tool-Bench scores in their technical reports, it becomes infrastructure-grade; otherwise, it remains a niche research artifact. It faces competition from more established benchmarks like SWE-bench and BigCodeBench, though its focus on '0-to-1' generation and black-box E2E testing provides a unique angle. Platform domination risk is medium because while frontier labs (OpenAI/Anthropic) build these capabilities, they rely on third-party benchmarks for objective validation, though they often pivot to whichever benchmark gains the most academic citations.

COMPOSABILITY

TECH STACK

PythonDocker (implied for sandboxing)LLM APIs (OpenAI, Anthropic, etc.)Shell/CLI scripting

INTEGRATION

reference_implementation

llm_evaluationautomated_software_engineeringblack_box_testingcode_generation_benchmarking

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination