THU-KEG/DICE

GitHubGH

Detects if a given data sample was present in an LLM's training set (contamination) by analyzing the model's internal hidden states rather than just output probabilities.

View on GitHub

Defensibility

2.0/10

stars

Platform Dominationhigh

Market Consolidationlow

Displacement Horizon6 months

REASONING

DICE originates from the prestigious THU-KEG lab at Tsinghua University, which gives it academic credibility. However, as an open-source project, it functions purely as a static research artifact. With only 11 stars and zero forks over nearly two years, it has failed to garner any developer community or industry adoption. The project is effectively a 'ghost repo' with zero velocity, which in the rapidly evolving LLM space makes it largely legacy code. From a competitive standpoint, training data attribution and contamination detection are 'frontier' problems that major labs (OpenAI, Google, Anthropic) are solving natively with access to actual training logs and larger-scale internal state analysis. Furthermore, newer techniques like Min-K% Prob or 'Goldfish' watermarking offer more efficient or robust ways to handle this problem. The defensibility is near zero because the core logic is a specific algorithmic approach that can be reimplemented in a few hours by any competent ML engineer reading the associated paper. There is no moat, no data gravity, and no community lock-in. For an investor or user, this is a reference point for a specific technique rather than a viable tool or platform.

COMPOSABILITY

TECH STACK

pythonpytorchtransformershuggingface

INTEGRATION

reference_implementation

data_contamination_detectionmodel_interpretabilitytraining_data_attributionllm_evaluation

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination