IvanArreolaa/agi-benchmark

GitHubGH

A benchmarking suite designed to measure cognitive abilities in AI models, specifically focusing on 'Novel Schema Acquisition' (learning new rules on the fly) and 'Override-and-Plan' (executive function and error correction).

View on GitHub

Defensibility

2.0/10

stars

Platform Dominationhigh

Market Consolidationmedium

Displacement Horizon6 months

REASONING

The project is a hackathon entry (Google DeepMind AGI Hackathon) with zero stars, forks, or community traction. While the focus on 'contamination-safe' benchmarks is a critical and valid niche in the LLM evaluation space, the project currently lacks the 'prestige' moat required for a benchmark to be successful. In the world of AI evaluation, a benchmark's value is derived entirely from adoption by researchers and inclusion in model release reports (e.g., ARC-AGI, GSM8K, MMLU). Frontier labs like OpenAI and DeepMind are aggressively developing their own internal 'unseen' benchmarks to combat data contamination. Without a massive push for community adoption or validation from major labs, this project remains a personal experiment/prototype. The 'Override-and-Plan' task is a clever way to test executive function, but it is easily replicable by any lab with a prompt-engineering team. Its survival depends entirely on whether these specific tasks become a standard for measuring AGI, which is unlikely given the current lack of momentum.

COMPOSABILITY

TECH STACK

pythonevaluation_frameworksllm_apis

INTEGRATION

reference_implementation

agi_benchmarkingcognitive_evaluationcontamination_preventionexecutive_function_testing

READINESS

Composabilityalgorithm

Depthprototype

Noveltynovel_combination