gnueaj/unlearning-depth-score

GitHubGH

Provides a methodology and implementation for measuring the 'depth' of unlearning in LLMs by using activation patching to identify where internal representations of deleted information persist.

View on GitHub

Defensibility

2.0/10

stars

Platform Dominationhigh

Market Consolidationlow

Displacement Horizon6 months

REASONING

The project addresses the critical 'unlearning' problem in LLMs—ensuring that sensitive or copyrighted data is truly removed rather than just masked. It uses activation patching, a standard technique in mechanistic interpretability, to quantify how much a model still 'knows' about a concept at different layer depths. While the application to unlearning is timely, the project lacks any structural defensibility. It is a fresh repository (0 days old, 2 stars) with no community, likely serving as a code supplement for a research paper. The logic is easily reproducible by any ML engineer familiar with TransformerLens or nnsight. Frontier labs (OpenAI, Anthropic) are the primary stakeholders for unlearning and are actively developing their own internal auditing tools; they are likely to implement similar or superior diagnostics as part of their safety and alignment pipelines. There is no moat here beyond the initial research insight, and the displacement horizon is very short due to the high velocity of the AI safety and unlearning research fields.

COMPOSABILITY

TECH STACK

pythonpytorchtransformer_lensnnsight

INTEGRATION

algorithm_implementable

llm_unlearningmechanistic_interpretabilitymodel_auditingactivation_patching

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltyincremental