ROLL Asynchronous Training User Guide
The ROLL framework now supports asynchronous training for both RLVR and Agentic pipelines, significantly improving training efficiency. This document provides detailed instructions on how to use this feature.
Asynchronous Training Overview
In traditional synchronous training, the training and inference processes run serially, meaning that the next batch of inference can only start after the current batch completes and rewards are collected. In asynchronous training, however, training and inference can run in parallel. The inference process can generate multiple batches of data in advance, and the training process can use this pre-generated data for learning.
Enabling Asynchronous Training
To enable asynchronous training, set the async_generation_ratio parameter in your configuration file. This parameter has consistent meaning and usage across both RLVR and Agentic pipelines.
Configuration Parameters
The async_generation_ratio parameter is defined in roll/configs/base_config.py:
async_generation_ratio: float = field(
default=0,
metadata={
"help": "The ratio of ahead generation requests in pipeline, "
"0 means synchronous pipeline. currently only integer is supported."
},
)
Configuration Examples
Agentic Asynchronous Training Configuration
Here is a complete Agentic asynchronous training configuration example (from examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_async.yaml):
# Enable asynchronous training
async_generation_ratio: 1
# Other related configurations
rollout_batch_size: 1024
val_batch_size: 1024
sequence_length: 8192
# Training parameters
max_steps: 1024
save_steps: 10000
logging_steps: 1
eval_steps: 10
# PPO parameters
ppo_epochs: 1
adv_estimator: "grpo"
whiten_advantages: true
# Model configuration
pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct
# Actor configuration
actor_train:
model_args:
attn_implementation: fa2
disable_gradient_checkpointing: false
dtype: bf16
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 2
gradient_accumulation_steps: 128
warmup_steps: 10
strategy_args:
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
use_distributed_optimizer: true
recompute_granularity: full
device_mapping: list(range(0,4))
infer_batch_size: 2
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
generating_args:
max_new_tokens: 128
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: 1
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.8
block_size: 16
load_format: auto
device_mapping: list(range(4,8))
RLVR Asynchronous Training Configuration
Here is a complete RLVR asynchronous training configuration example (from examples/qwen2.5-7B-rlvr_megatron/rlvr_config_async.yaml):
# Enable asynchronous training
async_generation_ratio: 1
# Other related configurations
rollout_batch_size: 64
prompt_length: 2048
response_length: 8192
# Training parameters
max_steps: 1000
save_steps: 100
logging_steps: 1
# RLVR specific parameters
is_num_return_sequences_expand: true
num_return_sequences_in_group: 8
ppo_epochs: 1
adv_estimator: "reinforce"
# Model configuration
pretrain: /data/cpfs_0/common/models/Qwen2.5-7B
reward_pretrain: /data/cpfs_0/common/models/Qwen2.5-7B
# Actor configuration
actor_train:
model_args:
dtype: bf16
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 64
warmup_steps: 1
data_args:
template: qwen2_5
file_name:
- data/math_deepmath_deal.jsonl
strategy_args:
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
sequence_parallel: true
use_distributed_optimizer: true
device_mapping: list(range(0,16))
infer_batch_size: 2
actor_infer:
model_args:
dtype: fp16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
strategy_args:
strategy_name: sglang
strategy_config:
mem_fraction_static: 0.85
load_format: dummy
device_mapping: list(range(16,24))
infer_batch_size: 1
How Asynchronous Training Works
- When
async_generation_ratiois set to a value greater than 0, the framework starts asynchronous training mode - The inference process generates
async_generation_ratiotimes more data than needed for training in advance - The training process can use this pre-generated data for learning without waiting for the current batch of inference to complete
- This parallel processing can significantly improve training efficiency, especially when inference is time-consuming
Supported Algorithms
Agentic Pipeline
- Supports GRPO and other policy gradient algorithms
- Suitable for environment interaction tasks, such as games, dialogues, etc.
- Configuration example:
examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_async.yaml
RLVR Pipeline
- Supports Reinforce and other algorithms
- Suitable for language modeling tasks, such as mathematical reasoning, code generation, etc.
- Configuration example:
examples/qwen2.5-7B-rlvr_megatron/rlvr_config_async.yaml
Off-Policy Algorithms
ROLL also supports various Off-Policy algorithms. For detailed information, please refer to: docs_roll/docs/English/UserGuide/algorithms/offpolicy_setting.md
Configuration example: examples/qwen2.5-7B-rlvr-offpolicy/rlvr_config.yaml
Supported algorithm variants include:
toprvanillatiscispokimi15ppo
Usage Recommendations
- Adjust the value of
async_generation_ratioaccording to hardware resources and task characteristics - Ensure separate deployment of training and inference roles
- Monitor resource usage during training to avoid resource bottlenecks
- Asynchronous generation is paused during validation and resumes after validation is complete
- For RLVR tasks, you can further optimize performance by combining
is_num_return_sequences_expandandnum_return_sequences_in_groupparameters - For Off-Policy algorithms, ensure correct configuration of the
pg_variantparameter and corresponding worker class