On-Policy Distillation Pipeline
Table of Contents
- On-Policy Distillation Pipeline
Overview
On-Policy Distillation (OPD) is a training method that combines online learning and knowledge distillation. By having the student model learn the teacher model's behavior on its own generated trajectories, OPD achieves efficient model compression and capability transfer.
This pipeline provides the following core advantages:
- Efficient Training: Compared to reinforcement learning (RL), OPD provides dense reward signals, enabling more efficient training
- Teacher as Reward Model: Directly uses the teacher model's log probabilities to compute rewards, eliminating the need to train a separate Reward Model
- Online Learning Advantage: The student model learns on its own state distribution, avoiding distribution shift issues
- Full Reuse of RLVR Pipeline: Built on the RLVR architecture, simple configuration, easy to use
- Support for Mixed Mode: Can simultaneously use OPD rewards and external rewards (e.g., math verification, code execution)
Core Principles
What is On-Policy Distillation?
The core idea of On-Policy Distillation is: sample trajectories from the student model, then use a high-performance teacher model to score each token in the trajectory.
┌─────────────────────────────────────────────────────────────────┐
│ On-Policy Distillation Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Sample Trajectories │
│ ┌──────────┐ ┌──────────────────────────────────┐ │
│ │ Prompt │ ──▶ │ Student Model (rollout) │ │
│ └──────────┘ │ Generate trajectories + │ │
│ │ student_log_probs │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 2. Compute Teacher Log Probs │
│ ┌──────────────────────────────────┐ │
│ │ Teacher Model (forward) │ │
│ │ Compute teacher_log_probs │ │
│ └──────────────────────────────────┘ │
│ │ │