lookitsliao/bench-cleanser

GitHubGH

Identifies and filters out contaminated or unfair evaluation criteria within the SWE-bench benchmark to improve evaluation integrity for coding agents.

View on GitHub

Defensibility

2.0/10

stars

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon6 months

REASONING

Bench-cleanser addresses a critical pain point in current AI research: the reliability of the SWE-bench benchmark. SWE-bench is currently the gold standard for evaluating autonomous coding agents, but it suffers from 'unfair' tests where the evaluation requires specific implementation details rather than functional correctness. However, with only 1 star and no forks after a month, the project has zero market traction. Frontier labs like OpenAI and Anthropic, who heavily rely on SWE-bench for their technical reports, are already performing similar 'cleaning' and deduplication internally to ensure their results are robust. Furthermore, the official SWE-bench maintainers (Princeton/University of Chicago) are the natural owners of this functionality; any significant 'cleaning' logic would likely be absorbed into the main swe-bench repository or the 'verified' subset (SWE-bench Verified). The project is essentially a niche utility script that is easily reproducible and lacks a moat or community support.

COMPOSABILITY

TECH STACK

pythonswe-benchllm-based-evaluationast-parsing

INTEGRATION

cli_tool

benchmark_auditingdata_cleaningeval_integritysoftware_engineering_automation

READINESS

Composabilityalgorithm

Depthprototype

Noveltyincremental