Collected molecules will appear here. Add from search or explore.
Expert-curated evaluation benchmark for AI agents performing complex STEM tasks in physics, biology, chemistry, and mathematics.
Defensibility
citations
0
co_authors
23
COMPOSITE-STEM addresses the critical 'benchmark saturation' problem where models like o1 or Claude 3.5 exceed the difficulty of current STEM evals like GPQA or MATH. Its moat is purely based on the high cost of doctoral-level human labor required to generate and verify the 70 tasks. The unusual metric of 0 stars but 23 forks within 7 days strongly suggests institutional distribution—likely a research lab or a specialized competition (e.g., NeurIPS or a frontier lab challenge) rather than organic open-source adoption. While difficult to create, the project is highly vulnerable to frontier lab 'data-eating'; labs like OpenAI and Anthropic are aggressively building internal, private versions of exactly this type of benchmark to avoid data contamination. As an open-source project, its primary value is as a snapshot in time for academic comparison, but it faces a high risk of being superseded by more comprehensive, platform-integrated leaderboards or becoming contaminated by future model training runs within 12-18 months.
TECH STACK
INTEGRATION
reference_implementation
READINESS