ROLL Asynchronous Training User Guide

The ROLL framework now supports asynchronous training for both RLVR and Agentic pipelines, significantly improving training efficiency. This document provides detailed instructions on how to use this feature.

Asynchronous Training Overview

In traditional synchronous training, the training and inference processes run serially, meaning that the next batch of inference can only start after the current batch completes and rewards are collected. In asynchronous training, however, training and inference can run in parallel. The inference process can generate multiple batches of data in advance, and the training process can use this pre-generated data for learning.

Enabling Asynchronous Training

To enable asynchronous training, set the async_generation_ratio parameter in your configuration file. This parameter has consistent meaning and usage across both RLVR and Agentic pipelines.

Configuration Parameters

The async_generation_ratio parameter is defined in roll/configs/base_config.py:

async_generation_ratio: float = field(
    default=0,
    metadata={
        "help": "The ratio of ahead generation requests in pipeline, "
        "0 means synchronous pipeline. currently only integer is supported."
    },
)

Configuration Examples

Agentic Asynchronous Training Configuration

Here is a complete Agentic asynchronous training configuration example (from examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_async.yaml):

# Enable asynchronous training
async_generation_ratio: 1

# Other related configurations
rollout_batch_size: 1024
val_batch_size: 1024
sequence_length: 8192

# Training parameters
max_steps: 1024
save_steps: 10000
logging_steps: 1
eval_steps: 10

# PPO parameters
ppo_epochs: 1
adv_estimator: "grpo"
whiten_advantages: true

# Model configuration
pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct

# Actor configuration
actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 2
    gradient_accumulation_steps: 128
    warmup_steps: 10
  strategy_args:
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
      use_distributed_optimizer: true
      recompute_granularity: full
  device_mapping: list(range(0,4))
  infer_batch_size: 2

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 128
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      load_format: auto
  device_mapping: list(range(4,8))

RLVR Asynchronous Training Configuration

Here is a complete RLVR asynchronous training configuration example (from examples/qwen2.5-7B-rlvr_megatron/rlvr_config_async.yaml):

# Enable asynchronous training
async_generation_ratio: 1

# Other related configurations
rollout_batch_size: 64
prompt_length: 2048
response_length: 8192

# Training parameters
max_steps: 1000
save_steps: 100
logging_steps: 1

# RLVR specific parameters
is_num_return_sequences_expand: true
num_return_sequences_in_group: 8
ppo_epochs: 1
adv_estimator: "reinforce"

# Model configuration
pretrain: /data/cpfs_0/common/models/Qwen2.5-7B
reward_pretrain: /data/cpfs_0/common/models/Qwen2.5-7B

# Actor configuration
actor_train:
  model_args:
    dtype: bf16
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 64
    warmup_steps: 1
  data_args:
    template: qwen2_5
    file_name:
      - data/math_deepmath_deal.jsonl
  strategy_args:
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 2
      pipeline_model_parallel_size: 1
      sequence_parallel: true
      use_distributed_optimizer: true
  device_mapping: list(range(0,16))
  infer_batch_size: 2

actor_infer:
  model_args:
    dtype: fp16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  strategy_args:
    strategy_name: sglang
    strategy_config:
      mem_fraction_static: 0.85
      load_format: dummy
  device_mapping: list(range(16,24))
  infer_batch_size: 1

How Asynchronous Training Works

When async_generation_ratio is set to a value greater than 0, the framework starts asynchronous training mode
The inference process generates async_generation_ratio times more data than needed for training in advance
The training process can use this pre-generated data for learning without waiting for the current batch of inference to complete
This parallel processing can significantly improve training efficiency, especially when inference is time-consuming

Supported Algorithms

Agentic Pipeline

Supports GRPO and other policy gradient algorithms
Suitable for environment interaction tasks, such as games, dialogues, etc.
Configuration example: examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_async.yaml

RLVR Pipeline

Supports Reinforce and other algorithms
Suitable for language modeling tasks, such as mathematical reasoning, code generation, etc.
Configuration example: examples/qwen2.5-7B-rlvr_megatron/rlvr_config_async.yaml

Off-Policy Algorithms

ROLL also supports various Off-Policy algorithms. For detailed information, please refer to: docs_roll/docs/English/UserGuide/algorithms/offpolicy_setting.md

Configuration example: examples/qwen2.5-7B-rlvr-offpolicy/rlvr_config.yaml

Supported algorithm variants include:

topr
vanilla
tis
cispo
kimi15
ppo

Usage Recommendations

Adjust the value of async_generation_ratio according to hardware resources and task characteristics
Ensure separate deployment of training and inference roles
Monitor resource usage during training to avoid resource bottlenecks
Asynchronous generation is paused during validation and resumes after validation is complete
For RLVR tasks, you can further optimize performance by combining is_num_return_sequences_expand and num_return_sequences_in_group parameters
For Off-Policy algorithms, ensure correct configuration of the pg_variant parameter and corresponding worker class

ROLL Asynchronous Training User Guide

Asynchronous Training Overview​

Enabling Asynchronous Training​

Configuration Parameters​

Configuration Examples​

Agentic Asynchronous Training Configuration​

RLVR Asynchronous Training Configuration​

How Asynchronous Training Works​

Supported Algorithms​

Agentic Pipeline​

RLVR Pipeline​

Off-Policy Algorithms​

Usage Recommendations​