MARCA: A Checklist-Based Benchmark for Multilingual Web Search

arXivarX

A checklist-based benchmark (MARCA) to evaluate multilingual (English/Portuguese) LLM performance on web-based information seeking: searching, selecting evidence, and synthesizing answers.

View on arXiv

Defensibility

3.0/10

citations

Platform Dominationhigh

Market Consolidationmedium

Displacement Horizon1-2 years

REASONING

Quantitative signals show extremely early maturity: ~0 stars, 9 forks, and ~0 activity/hrs with age ~2 days. The fork count without star velocity is a weak adoption indicator—often consistent with early sharing/template forking or researcher experiments rather than sustained community pull. With no production traction signals (stars, velocity, documented usage) and a benchmark/paper framing, defensibility is limited. Defensibility score (3/10): MARCA is a benchmark specification—useful, but primarily an evaluation artifact. The defensibility would rely on (a) unique dataset/annotation quality, (b) strong ongoing maintenance and community adoption, and (c) any scoring/checklist methodology that becomes a de facto standard. From the provided information, we only know it is bilingual (English/Portuguese) and checklist-based; we do not see evidence of large-scale dataset release, standard scorer tooling, continuous integration, or adoption by model providers. In benchmark markets, assets are typically easy to replicate once the rubric and evaluation harness are described; the code/data can be cloned, and model vendors can simply add “benchmark-like” evaluation internally. Frontier risk (high): Frontier labs (OpenAI/Anthropic/Google) already build evaluation suites for web-browsing, tool use, and multilinguality. A benchmark centered on multilingual web information seeking with evidence selection is highly adjacent to capabilities these labs want to measure and improve. Even if they don’t build MARCA specifically, they can absorb the core idea—checklist-style evidence/evaluation—into their internal eval pipelines or create an equivalent benchmark quickly. Three-axis threat profile: 1) Platform domination risk: HIGH. Big platforms can (and typically do) internalize benchmark logic by integrating rubric-driven evals into their evaluation harnesses. Likely competitors/adjacent projects include: - General web-browsing / agent benchmarks (e.g., Toolformer-style agent evaluations, web-browsing benchmarks from major academic/industry efforts) - Multilingual evaluation suites (various multilingual QA/search evals) - Evidence-grounding / retrieval-augmented QA evaluations Since MARCA is a checklist-based benchmark rather than proprietary infrastructure, a platform can replicate it as part of a larger eval suite. If MARCA becomes popular, platform teams can also publish their own variants quickly. 2) Market consolidation risk: MEDIUM. Benchmark ecosystems tend to consolidate around a few widely adopted leaderboards, but consolidation is not guaranteed because benchmarks serve different tasks (web search vs. general QA; multilingual specifics; rubric types). MARCA could either remain niche (Portuguese-focused bilingual slice) or be absorbed into a more comprehensive multilingual web-search eval umbrella. 3) Displacement horizon: 1-2 years (high probability). Given benchmark replicability and the pace of LLM evaluation development, MARCA could be displaced within 1–2 years by either (a) a broader multilingual web-search benchmark from a major lab/community leader, or (b) internal platform eval suites that make external reproduction less relevant. Opportunities: If the paper truly defines a novel checklist rubric and provides high-quality, well-documented bilingual evidence/answer sets (with robust scoring and clear failure modes), MARCA could become a useful standard for multilingual web information seeking evaluation. The 9 forks suggest at least early researcher interest; turning that into (i) public dataset artifacts, (ii) a maintained eval harness, and (iii) leaderboard adoption could increase switching costs. Key risks: (1) Replicability: benchmarks are easy to clone if the scoring rubric is public and dataset is not uniquely curated/locked. (2) Lack of adoption signals: no stars and no velocity indicates limited community traction so far. (3) Platform absorption: frontier labs can implement equivalent checklists internally. Key factors that would raise defensibility (not observed yet from provided info): published dataset + strong licensing/ownership clarity, a reference implementation with automated scoring, demonstrated leaderboard uptake across multiple model families, and a measurable performance correlation that others rely on for research direction.

COMPOSABILITY

TECH STACK

paper-defined benchmark (arxiv/LaTeX artifact)likely dataset/annotation tooling (unspecified in provided info)

INTEGRATION

reference_implementation

multilingual_web_search_evaluationevidence_selectionchecklist_based_benchmarkingllm_information_seeking

READINESS

Composabilityframework

Depthprototype

Noveltyincremental