Stableopd
1 mentions across 0 people
All mentions
Unknown speaker
Recommendedpaper · 2026-04-10
“To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.”
On-Policy Distillation: Addressing Length Inflation and Instability with StableO ↗