Collected molecules will appear here. Add from search or explore.
Method for injecting stealthy backdoors into LLM weights by compiling activation steering vectors into model parameters using null-space constraints to preserve normal behavior on non-trigger inputs.
Defensibility
citations
0
co_authors
11
The project introduces a technical method for converting dynamic activation steering (typically done at runtime) into permanent weight modifications (backdoors). The 'null-space constraints' aspect is the key technical differentiator, ensuring that the edits only affect the model's behavior in the presence of specific triggers while remaining invisible during standard benchmarking. With 0 stars and 11 forks just 3 days after release, the project shows immediate interest from the research community (likely peer reviewers or researchers in the mechanistic interpretability space), but it remains a research artifact rather than a product. Its defensibility is low because the technique, once published, is easily replicable by any safety researcher or adversary. Frontier labs like Anthropic or OpenAI are the primary 'competitors' or users here; they are unlikely to build this as a product but will almost certainly integrate the findings into their red-teaming and safety-alignment pipelines (Frontier Risk: Medium). The project competes conceptually with model editing techniques like ROME and MEMIT, but focuses on the security/vulnerability angle rather than factual updates. Displacement is likely within 1-2 years as more sophisticated weight-editing or 'unlearning' techniques emerge.
TECH STACK
INTEGRATION
reference_implementation
READINESS