Distill Pipeline
Table of Contents
- Distill Pipeline
✨️Overview
This pipeline offers the following core advantages:
Various distillation losses: Support for training the model with different distillation losses and finer-grained configuration via the corresponding parameters.
Comprehensive Performance Monitoring: Fine-grained metric tracking system that monitors performance metrics, providing comprehensive visualization and analysis capabilities for the model training process.
Efficient Distributed Computing: Leverages the Ray framework to implement efficient distributed training on large-scale GPU clusters, significantly improving training speed and resource utilization.
✨️Core Components
Main Module (DistillPipeline
)
DistillPipeline
(located in roll/pipeline/distill/distill_pipeline.py
) is the primary coordinator for the entire distill training process. It manages the complete training workflow, including:
- Initializing and managing distributed workers (Student and Teacher workers).
- Coordinating data collection and processing.
- Executing model training steps.
- Handling checkpoint saving.
- Recording metrics and experiment tracking.
Source code: roll/pipeline/distill/distill_pipeline.py
Configuration File (DistillConfig
)
DistillConfig
(defined in roll/pipeline/distill/distill_config.py
) is a Pydantic/dataclass-based configuration object used to specify all parameters for running the distill pipeline. This configuration system is flexibly designed, supporting configuration via YAML files and managed using the Hydra framework.
Configuration File Structure and Organization
Configuration files (such as examples/qwen2.5-7B-distill_megatron/distill_megatron.yaml
) are organized by functional modules, containing the following main sections:
Experiment Basic Settings
exp_name
: Experiment name, used to identify a specific training runlogging_dir
: Path for saving log filesoutput_dir
: Path for saving model checkpoints and output files
Training Control Parameters
max_steps
: Maximum number of training stepssave_steps
: Frequency for saving model checkpointslogging_steps
: Frequency for recording training metricsresume_from_checkpoint
: Whether to continue training from a checkpoint. Set it to the checkpoint path if you want to resume; otherwise, set it toFalse
.
Model Configuration
student_pretrain
: Path to pre-trained weights for Student modelteacher_pretrain
: Path to pre-trained weights for Teacher model
Distill Algorithm Parameters
distill_loss_weight
: Fraction of the total loss assigned to the distillation term (SFT loss weight is 1 − this value).kd_temperature
: Soft-max temperature applied to the student logits during knowledge distillation.teacher_temperature
: Temperature applied to the teacher logits to control their softness.kd_objective
: Divergence measure used to compare student and teacher distributions (e.g.,forward_kl
,reverse_kl
).adaptive_kl_alpha
: Weighting factor that blends forward and reverse KL whenkd_objective
isadaptive_kl
.skew_lambda
: Skewing coefficient applied inskewed_forward_kl
orskewed_reverse_kl
objectives.
Worker Configuration Each worker (
student
,teacher
) configuration contains:- Model Parameters (
model_args
)model_type
: Model type (e.g.,causal_lm
)dtype
: Computation precision (e.g.,bf16
,fp16
)- ...
- Training Parameters (
training_args
)learning_rate
: Learning rateper_device_train_batch_size
: Training batch size per devicegradient_accumulation_steps
: Gradient accumulation stepsweight_decay
: Weight decay coefficientmax_grad_norm
: Gradient clipping threshold- ...
- Distributed Strategy (
strategy_args
)strategy_name
: Distributed strategy to use (e.g.,megatron_train
,deepspeed_infer
)- Strategy-specific parameters: e.g.,
tp_size
(tensor parallelism size),pp_size
(pipeline parallelism size) gpu_memory_utilization
: GPU memory utilization (vLLM-specific)
- Device Mapping (
device_mapping
)- Specifies which GPU devices the worker should use
- Model Parameters (
✨️Data Preparation
Data Format
The distill pipeline expects the training data to be stored in JSON files.
Required Columns
Each data sample must contain a question and its corresponding answer.
In the YAML file, use the keys question_key
and answer_key
to specify the field names that hold these two pieces of data.
✨️Running the Pipeline
Method 1: Using Python Launcher Script
The primary method is to use the examples/start_distill_pipeline.py
script. This script uses Hydra to load and manage configurations.
Select or Create a Configuration File
Start with an example YAML (e.g.,examples/qwen2.5-7B-distill_megatron/distill_megatron.yaml
) or create your own configuration.Execute the Python Launcher Script
# Make sure you are in the root directory of the ROLL project
# export PYTHONPATH=$(pwd):$PYTHONPATH
python examples/start_distill_pipeline.py \
--config_path examples/qwen2.5-7B-distill_megatron \
--config_name distill_megatron--config_path
– Directory containing your YAML configuration.--config_name
– Filename (without.yaml
).
Method 2: Using Helper Shell Scripts
The examples
directory typically contains shell scripts that wrap the Python launcher.
Example structure:
#!/bin/bash
# Example: examples/qwen2.5-7B-distill_megatron/run_distill_pipeline.sh
CONFIG_NAME="distill_megatron" # distill_megatron.yaml
CONFIG_PATH="examples/qwen2.5-7B-distill_megatron"
# Set environment variables and other configurations
python examples/start_distill_pipeline.py \
--config_path $CONFIG_PATH \
--config_name $CONFIG_NAME \
"$@" # Pass any additional parameters
Run using:
bash examples/qwen2.5-7B-distill_megatron/run_distill_pipeline.sh
✨️Step-by-Step Example
Step 1: Configure Settings
File:
examples/qwen2.5-7B-distill_megatron/distill_megatron.yaml
Key sections includeexp_name
,seed
,output_dir
, model paths,student
andteacher
configurations.Pay special attention to these configuration sections:
- Data configuration:
student.data_args.file_name
- Model configuration:
student_pretrain
andteacher_pretrain
paths (The distill pipeline currently only supports student and teacher models of the same type, for example, both the student and teacher models are Qwen.) - Distributed strategies:
strategy_args
anddevice_mapping
for each worker (The distillation pipeline currently only supports scenarios where the student and teacher models use the same strategy (e.g., the student uses megatron_train while the teacher uses megatron_infer) with identical parallel configurations, as we utilize CudaIPC to transfer logits from the teacher to the student.)
- Data configuration:
Step 2: Prepare Environment and Dependencies
Ensure all necessary dependencies are installed:
pip install -r requirements.txt
Verify that all model paths in the configuration are accessible.
Prepare training datasets, ensuring they conform to the data format requirements described above.
Step 3: Launch the Pipeline
python examples/start_distill_pipeline.py \
--config_path examples/qwen2.5-7B-distill_megatron \
--config_name distill_megatron
Step 4: Monitoring
Console Output – Observe Hydra, Ray, and pipeline logs.
Log Files – Check the
logging_dir
specified in the YAML.TensorBoard
tensorboard --logdir <your_log_dir>
Step 5: Outputs and Results
- Trained Models – Checkpoints are saved in the
output_dir
. - Evaluation Metrics – Recorded in TensorBoard and the console.
Happy experimenting!