Collected molecules will appear here. Add from search or explore.
An algorithmic modification to Direct Preference Optimization (DPO) that addresses distribution shifts between the reference model and the learning policy to improve alignment stability.
stars
59
forks
6
DPO-Shift is an academic/research implementation targeting a specific mathematical nuance in LLM alignment. While it addresses a valid technical problem (distribution shift), the repository has low traction and the technique is a refinement of the standard DPO algorithm which is easily absorbed into major training frameworks like Hugging Face TRL or proprietary lab stacks.
TECH STACK
INTEGRATION
algorithm_implementable
READINESS