Collected molecules will appear here. Add from search or explore.
A benchmarking suite designed to measure cognitive abilities in AI models, specifically focusing on 'Novel Schema Acquisition' (learning new rules on the fly) and 'Override-and-Plan' (executive function and error correction).
Defensibility
stars
0
The project is a hackathon entry (Google DeepMind AGI Hackathon) with zero stars, forks, or community traction. While the focus on 'contamination-safe' benchmarks is a critical and valid niche in the LLM evaluation space, the project currently lacks the 'prestige' moat required for a benchmark to be successful. In the world of AI evaluation, a benchmark's value is derived entirely from adoption by researchers and inclusion in model release reports (e.g., ARC-AGI, GSM8K, MMLU). Frontier labs like OpenAI and DeepMind are aggressively developing their own internal 'unseen' benchmarks to combat data contamination. Without a massive push for community adoption or validation from major labs, this project remains a personal experiment/prototype. The 'Override-and-Plan' task is a clever way to test executive function, but it is easily replicable by any lab with a prompt-engineering team. Its survival depends entirely on whether these specific tasks become a standard for measuring AGI, which is unlikely given the current lack of momentum.
TECH STACK
INTEGRATION
reference_implementation
READINESS