Collected molecules will appear here. Add from search or explore.
Identifies and filters out contaminated or unfair evaluation criteria within the SWE-bench benchmark to improve evaluation integrity for coding agents.
Defensibility
stars
1
Bench-cleanser addresses a critical pain point in current AI research: the reliability of the SWE-bench benchmark. SWE-bench is currently the gold standard for evaluating autonomous coding agents, but it suffers from 'unfair' tests where the evaluation requires specific implementation details rather than functional correctness. However, with only 1 star and no forks after a month, the project has zero market traction. Frontier labs like OpenAI and Anthropic, who heavily rely on SWE-bench for their technical reports, are already performing similar 'cleaning' and deduplication internally to ensure their results are robust. Furthermore, the official SWE-bench maintainers (Princeton/University of Chicago) are the natural owners of this functionality; any significant 'cleaning' logic would likely be absorbed into the main swe-bench repository or the 'verified' subset (SWE-bench Verified). The project is essentially a niche utility script that is easily reproducible and lacks a moat or community support.
TECH STACK
INTEGRATION
cli_tool
READINESS