Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

arXivarX

Perturbation-based diagnostic framework for quantifying data leakage and memorization in Code LLMs across multiple benchmarks.

View on arXiv

Defensibility

3.0/10

citations

co_authors

Platform Dominationmedium

Market Consolidationlow

Displacement Horizon1-2 years

REASONING

The project addresses a critical bottleneck in LLM development: data contamination. As models are increasingly trained on vast swaths of the internet (GitHub), distinguishing between 'reasoning' and 'memorization' is vital for trust. With 6 forks in just 2 days despite 0 stars, there is clear academic/technical interest in the methodology. The defensibility is low (3) because the 'moat' is purely the research methodology and the specific set of 19 benchmarks; the code itself is a reference implementation of the paper's findings rather than a production-grade tool. Frontier labs face medium risk here—while they care deeply about evaluation, they often develop proprietary, internal-only contamination detection pipelines using raw training data logs which are more accurate than external perturbation-based methods. This tool is most valuable for third-party auditors and open-source model developers. The primary risk is displacement by more robust 'membership inference' or 'unlearning' techniques that are currently a hotbed of research (e.g., Min-K% Prob). It will likely remain an influential paper/benchmark suite rather than a standalone software product.

COMPOSABILITY

TECH STACK

pythonpytorchtransformershuggingfacevllm

INTEGRATION

reference_implementation

memorization_detectiondata_contamination_analysiscode_llm_evaluationbenchmark_perturbation

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination