COMPOSITE-Stem

arXivarX

Expert-curated evaluation benchmark for AI agents performing complex STEM tasks in physics, biology, chemistry, and mathematics.

View on arXiv

Defensibility

4.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon1-2 years

REASONING

COMPOSITE-STEM addresses the critical 'benchmark saturation' problem where models like o1 or Claude 3.5 exceed the difficulty of current STEM evals like GPQA or MATH. Its moat is purely based on the high cost of doctoral-level human labor required to generate and verify the 70 tasks. The unusual metric of 0 stars but 23 forks within 7 days strongly suggests institutional distribution—likely a research lab or a specialized competition (e.g., NeurIPS or a frontier lab challenge) rather than organic open-source adoption. While difficult to create, the project is highly vulnerable to frontier lab 'data-eating'; labs like OpenAI and Anthropic are aggressively building internal, private versions of exactly this type of benchmark to avoid data contamination. As an open-source project, its primary value is as a snapshot in time for academic comparison, but it faces a high risk of being superseded by more comprehensive, platform-integrated leaderboards or becoming contaminated by future model training runs within 12-18 months.

COMPOSABILITY

TECH STACK

PythonLaTeXPytest (evaluation frameworks)LLM API integration (evaluation scripts)

INTEGRATION

reference_implementation

stem_evaluationagentic_benchmarkingscientific_reasoningexpert_curated_data

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltyincremental