📄 paper

Stableopd

1 mentions across 0 people

All mentions

Unknown speaker

paper · 2026-04-10

Recommended

“To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.”

On-Policy Distillation: Addressing Length Inflation and Instability with StableO ↗