Collected molecules will appear here. Add from search or explore.
Provides a methodology and implementation for measuring the 'depth' of unlearning in LLMs by using activation patching to identify where internal representations of deleted information persist.
Defensibility
stars
2
The project addresses the critical 'unlearning' problem in LLMs—ensuring that sensitive or copyrighted data is truly removed rather than just masked. It uses activation patching, a standard technique in mechanistic interpretability, to quantify how much a model still 'knows' about a concept at different layer depths. While the application to unlearning is timely, the project lacks any structural defensibility. It is a fresh repository (0 days old, 2 stars) with no community, likely serving as a code supplement for a research paper. The logic is easily reproducible by any ML engineer familiar with TransformerLens or nnsight. Frontier labs (OpenAI, Anthropic) are the primary stakeholders for unlearning and are actively developing their own internal auditing tools; they are likely to implement similar or superior diagnostics as part of their safety and alignment pipelines. There is no moat here beyond the initial research insight, and the displacement horizon is very short due to the high velocity of the AI safety and unlearning research fields.
TECH STACK
INTEGRATION
algorithm_implementable
READINESS