Collected molecules will appear here. Add from search or explore.
A defense mechanism for Small Language Models (SLMs) that detects and mitigates jailbreak attacks by monitoring internal token activation patterns rather than surface-level text.
citations
0
co_authors
4
The project is a fresh research implementation (12 days old, 0 stars) associated with an arXiv paper. While the focus on SLM-specific internal representations is timely, the technique of using hidden states for safety is an established research direction. Frontier labs and model providers (Microsoft, Google) are actively building native safety guardrails and refusal training into their SLMs (Phi, Gemma), making standalone activation-based defense tools highly susceptible to being superseded by platform-level features.
TECH STACK
INTEGRATION
reference_implementation
READINESS