Selfdistillation Policy Optimization
1 mentions across 1 person
All mentions
guestrin
Recommendedpaper · 2026-05-13
“We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO)”
SDPO: Overcoming RL Credit-Assignment Bottlenecks via Self-Distillation from Ric ↗