CORE FUNCTION

Benchmark suite for evaluating visual spatial reasoning and maze-solving capabilities in multimodal LLMs vs. textual brute-forcing.

TRACTION

stars

0.0 velocity

forks

0.0 velocity

REASONING

This is a research-oriented evaluation set associated with an arXiv paper. With only 2 stars and 110 maze samples, it functions as a specific experimental artifact rather than a robust tool. Frontier labs develop much larger internal benchmarks for spatial reasoning; the project's value lies in its specific inquiry into visual vs. token-space reasoning, but it lacks the scale or community to be a standard.

COMPOSABILITY

TECH STACK

PythonPyTorchPillowLLM APIs

INTEGRATION

reference_implementation

spatial_reasoningvllm_evaluationbenchmark_datasetvision_language_models

READINESS

Composabilitycomponent

Depthreference_implementation

Noveltyincremental