Collected molecules will appear here. Add from search or explore.
Benchmark suite for evaluating visual spatial reasoning and maze-solving capabilities in multimodal LLMs vs. textual brute-forcing.
stars
2
forks
0
This is a research-oriented evaluation set associated with an arXiv paper. With only 2 stars and 110 maze samples, it functions as a specific experimental artifact rather than a robust tool. Frontier labs develop much larger internal benchmarks for spatial reasoning; the project's value lies in its specific inquiry into visual vs. token-space reasoning, but it lacks the scale or community to be a standard.
TECH STACK
INTEGRATION
reference_implementation
READINESS