Off-Policy Algorithms Configuration Guide
The ROLL framework supports multiple Off-Policy algorithm variants for reinforcement learning training. This document provides detailed configuration methods and usage examples for various algorithms.
Supported Algorithm Variants
The ROLL framework currently supports the following Off-Policy algorithms:
- vanilla - Basic Policy Gradient algorithm
- ppo - Proximal Policy Optimization
- tis - Truncated Importance Sampling
- topr - Tapered off-policy REINFORCE
- cispo - Clipped Importance Sampling Policy Optimization
- kimi15 - Kimi15 algorithm
Basic Configuration
Core Parameters
Set the basic parameters for Off-Policy algorithms in the configuration file:
# Select algorithm variant
pg_variant: topr # Options: vanilla, tis, topr, cispo, kimi15, ppo
# Training configuration
max_steps: 500
save_steps: 100
logging_steps: 1
eval_steps: 10
# Data configuration
rollout_batch_size: 128
prompt_length: 2048
response_length: 8192
num_return_sequences_in_group: 8
# Common training parameters
ppo_epochs: 1
adv_estimator: "reinforce"
whiten_advantages: true
Worker Configuration
Use the specialized ActorPGWorker to handle Off-Policy algorithms:
actor_train:
worker_cls: roll.pipeline.rlvr.actor_pg_worker.ActorPGWorker
pg_variant: topr # Keep consistent with global configuration
model_args:
flash_attn: fa2
disable_gradient_checkpointing: false
dtype: bf16
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 64
warmup_steps: 20
num_train_epochs: 50
strategy_args:
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
use_distributed_optimizer: true
recompute_granularity: full
device_mapping: list(range(0,16))
Detailed Algorithm Configuration
1. Vanilla Policy Gradient
The most basic policy gradient algorithm, directly using the product of log probability and advantage function as loss.
Configuration Features:
- No additional parameters required
- High computational efficiency
- Suitable for simple reinforcement learning tasks
pg_variant: vanilla
# No additional configuration parameters needed
2. PPO (Proximal Policy Optimization)
Proximal Policy Optimization algorithm that stabilizes training by clipping importance sampling ratios.
Key Parameters:
pg_variant: ppo
# PPO specific parameters
pg_clip: 0.2 # Clipping range
pg_clip_low: 0.2 # Lower bound clipping (optional)
pg_clip_high: 0.2 # Upper bound clipping (optional)
use_pg_clip_range: false # Whether to use asymmetric clipping
dual_clip_loss: true # Whether to enable dual clipping
Configuration Example:
pg_variant: ppo
pg_clip: 0.2
dual_clip_loss: true
3. TIS (Truncated Importance Sampling)
Truncated Importance Sampling algorithm that limits importance sampling ratios to the range [0, 1].
Key Parameters:
pg_variant: tis
# TIS specific parameters
tis_lower_bound: 0.0 # Lower bound
tis_upper_bound: 1.0 # Upper bound
Configuration Example:
pg_variant: tis
tis_lower_bound: 0.0
tis_upper_bound: 1.0
4. TOPR (Tapered off-policy REINFORCE)
Tapered off-policy reinforcement learning algorithm that adopts different update strategies based on positive and negative rewards.
Algorithm Features:
- Positive samples: Direct SFT update without importance sampling
- Negative samples: TIS update with importance sampling ratio limited to [0, 1]
Key Parameters:
pg_variant: topr
# TOPR specific parameters
topr_positive_weight: 1.0 # Positive sample weight
topr_negative_weight: 1.0 # Negative sample weight
Configuration Example:
pg_variant: topr
topr_positive_weight: 1.0
topr_negative_weight: 1.0
5. CISPO (Clipped Importance Sampling Policy Optimization)
Clipped Importance Sampling Policy Optimization algorithm that uses clipped importance sampling weights and stop-gradient operations.
Key Parameters:
pg_variant: cispo
# CISPO specific parameters
cispo_epsilon_low: 0.1 # Lower bound clipping parameter
cispo_epsilon_high: 0.1 # Upper bound clipping parameter
cispo_use_unified_mask: false # Whether to use unified mask
Configuration Example:
pg_variant: cispo
cispo_epsilon_low: 0.1
cispo_epsilon_high: 0.1
cispo_use_unified_mask: false
6. Kimi15
Policy gradient algorithm based on KL regularization, adding KL divergence regularization term to the policy gradient term.
Key Parameters:
pg_variant: kimi15
# Kimi15 specific parameters
kimi15_tau: 0.1 # Regularization parameter
Configuration Example:
pg_variant: kimi15
kimi15_tau: 0.1
Complete Configuration Example
For a complete RLVR Off-Policy configuration example, please refer to:
Configuration File: examples/qwen2.5-7B-rlvr-offpolicy/rlvr_config.yaml
This configuration file contains all necessary parameter settings and supports switching between different algorithm variants by modifying the pg_variant parameter:
pg_variant: topr # Options: topr, vanilla, tis, cispo, kimi15, ppo
Key Configuration Points
- Worker Configuration: Use
ActorPGWorkerclass to handle Off-Policy algorithms - Algorithm Selection: Specify algorithm variant through
pg_variantparameter - Model Configuration: Support Megatron training and SGLang inference
- Reward Configuration: Include mathematical rule reward model configuration
Usage
- Copy the configuration file to your working directory
- Modify
pg_variantand other parameters as needed - Run the training script:
python examples/start_rlvr_pipeline.py --config-path your_config.yaml
Algorithm Selection Recommendations
Selection Based on Task Characteristics
Simple Tasks: Use
vanillaorppo- Low computational overhead
- Fast convergence
Complex Reasoning Tasks: Use
toprorcispo- Better stability
- Suitable for long sequence generation
Tasks Requiring Exploration: Use
tisorkimi15- Better exploration capability
- Suitable for sparse reward environments
Selection Based on Data Distribution
- Balanced Positive/Negative Samples: Use
ppoorvanilla - More Negative Samples: Use
topr, can adjust negative sample weights - Need Regularization: Use
kimi15, control regularization intensity through tau parameter
Monitoring and Debugging
Key Metrics
Different algorithms output different monitoring metrics:
- Common Metrics:
pg_loss,kl_loss,entropy_loss - PPO Specific:
ppo_ratio_clipfrac,ppo_ratio_low_clipfrac,ppo_ratio_high_clipfrac - TIS Specific:
tis_lower_clipfrac,tis_upper_clipfrac,tis_total_clipfrac - TOPR Specific:
topr_positive_samples,topr_negative_samples,topr_negative_total_clipfrac - CISPO Specific:
cispo_total_clipfrac,cispo_clipped_ratio - Kimi15 Specific:
kimi15_policy_grad_magnitude,kimi15_kl_reg_magnitude
Debugging Recommendations
- Monitor Clipping Ratios: High clipping ratios may indicate learning rate is too large
- Observe Sample Distribution: TOPR algorithm focuses on positive/negative sample ratios
- Adjust Hyperparameters: Tune algorithm-specific parameters based on task characteristics
- Use TensorBoard: Visualize metric changes during training
Frequently Asked Questions
Q: How to choose the appropriate pg_variant?
A: It's recommended to start with topr, as it performs well on most tasks. Then adjust based on specific task characteristics.
Q: What is the computational complexity of each algorithm?
A: vanilla < ppo < tis < kimi15 < cispo < topr
Q: Can I switch algorithms during training?
A: It's not recommended to switch algorithms during training, as this can cause training instability.
Q: How to adjust algorithm-specific parameters?
A: Refer to the configuration examples for each algorithm and tune based on validation set performance. It's recommended to start with small adjustments.