Collected molecules will appear here. Add from search or explore.
Detects if a given data sample was present in an LLM's training set (contamination) by analyzing the model's internal hidden states rather than just output probabilities.
Defensibility
stars
11
DICE originates from the prestigious THU-KEG lab at Tsinghua University, which gives it academic credibility. However, as an open-source project, it functions purely as a static research artifact. With only 11 stars and zero forks over nearly two years, it has failed to garner any developer community or industry adoption. The project is effectively a 'ghost repo' with zero velocity, which in the rapidly evolving LLM space makes it largely legacy code. From a competitive standpoint, training data attribution and contamination detection are 'frontier' problems that major labs (OpenAI, Google, Anthropic) are solving natively with access to actual training logs and larger-scale internal state analysis. Furthermore, newer techniques like Min-K% Prob or 'Goldfish' watermarking offer more efficient or robust ways to handle this problem. The defensibility is near zero because the core logic is a specific algorithmic approach that can be reimplemented in a few hours by any competent ML engineer reading the associated paper. There is no moat, no data gravity, and no community lock-in. For an investor or user, this is a reference point for a specific technique rather than a viable tool or platform.
TECH STACK
INTEGRATION
reference_implementation
READINESS