Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

arXivarX

Research code and methodology for implementing backdoor attacks in Reinforcement Learning with Verifiable Rewards (RLVR) systems, demonstrating how poisoning data can subvert LLM reasoning despite objective reward signals.

View on arXiv

Defensibility

2.0/10

citations

co_authors

Platform Dominationlow

Market Consolidationlow

Displacement Horizon6 months

REASONING

This project is a classic research artifact demonstrating a new vulnerability in a high-interest area (RLVR, popularized by models like DeepSeek-R1 and OpenAI o1). While the quantitative signals are low (0 stars, though 6 forks suggest some initial developer/researcher interest), the value lies in the conceptual breakthrough: showing that deterministic verifiers (like code compilers or math evaluators) do not provide immunity against data poisoning during the RL phase. As a defensible project, it scores low (2) because it is a discovery/exploit rather than a tool with a moat. Frontier labs (OpenAI, Anthropic) are unlikely to compete with this; rather, they are the 'customers' of this research, as they must build defenses against such vulnerabilities. The primary risk to the project's relevance is the rapid pace of AI safety research; newer alignment techniques (like constitutional AI or more robust RLHF) could make these specific backdoor methods obsolete within months. It is functionally similar to previous works like 'BadNet' or 'Sleeper Agents' but specifically tailored to the emerging RL-for-reasoning paradigm.

COMPOSABILITY

TECH STACK

PythonPyTorchReinforcement LearningLLM Fine-tuning

INTEGRATION

reference_implementation

adversarial_attackrl_safetyllm_jailbreakingdata_poisoning

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination