CORE FUNCTION

A framework for generating synthetic, contamination-free benchmarks specifically designed to evaluate LLMs on multi-step function calling tasks.

TRACTION

stars

0.0 velocity

forks

0.0 velocity

REASONING

FuncBenchGen is essentially a research artifact supporting an ICLR 2026 paper submission. While the methodology for creating 'contamination-free' benchmarks is academically relevant given the data-leakage issues in LLM training, the project lacks any significant adoption or developer moat. With only 4 stars and zero forks over 6 months, it functions as a reference implementation rather than a living tool. It faces intense competition from established leaderboards like the Berkeley Function Calling Leaderboard (BFCL) and ToolBench, which have significantly higher community gravity and data diversity. Frontier labs (OpenAI, Anthropic) have also internalized function-calling evaluations as core product features, making a third-party synthetic generator redundant for anyone but researchers. The project's value is in its methodology, which can be easily replicated or absorbed into more comprehensive evaluation platforms like Weights & Biases or LangSmith.

COMPOSABILITY

TECH STACK

pythonopenai-apianthropic-apilangchain-style-interfacespytest

INTEGRATION

reference_implementation

llm_evaluationfunction_callingsynthetic_data_generationmulti_step_reasoningbenchmark_decontamination

READINESS

Composabilityalgorithm

Depthreference_implementation