CORE FUNCTION

A dynamic evaluation framework for Code LLMs designed to mitigate data contamination by generating novel, structurally varied programming tasks and test cases.

TRACTION

stars

236

0.0 velocity

forks

0.0 velocity

REASONING

DyCodeEval addresses a critical 'crisis of trust' in LLM evaluation: data leakage. With 236 stars and an ICML 2025 acceptance, it carries significant academic weight and early community adoption. Its primary moat is the specific methodology for generating dynamic test cases that bypass the memorization typical of static benchmarks like HumanEval or MBPP. However, as a research artifact, its defensibility is limited by the fast-moving nature of the field. Competitors like LiveCodeBench (which uses real-time contest data) and SWE-bench (which uses GitHub issues) solve the same problem via different vectors. Frontier labs (OpenAI, Anthropic) are highly likely to implement similar 'dynamic' or 'metamorphic' testing internally to validate their own models, potentially making public benchmarks less relevant for internal development but still vital for independent auditing. The project's displacement risk is moderate because benchmarking standards shift every 12-18 months as models saturate existing tasks. The low velocity suggests it is currently a static research release rather than a living software product, which lowers its long-term defensibility against more actively maintained evaluation platforms.

COMPOSABILITY

TECH STACK

PythonPyTorchTransformersOpenAI APICodeLLM backends

INTEGRATION

cli_tool

code_evaluationcontamination_detectiondynamic_benchmarkingllm_reasoning_analysis

READINESS

Composabilityframework

Depthreference_implementation

Noveltynovel_combination