Agentic Pipeline
Table of Contents
- Agentic Pipeline
✨️ Overview
Agentic Pipeline is ROLL's core pipeline for agent training, supporting multiple algorithms such as PPO, GRPO, and more. It provides the following core advantages:
- Gym-like Environment Definition: Supports various environment types, including FrozenLake, Sokoban, etc., and can easily extend custom environments through gym-like interfaces.
- Rich Learning Granularity: Supports TrajectoryWise form (StarPO) and StepWise (GiGPO) training forms.
- Asynchronous Parallel Rollout at Environment Granularity: Independent trajectory sampling across environments improves sampling efficiency.
- Asynchronous Training: Decoupling of rollout/training supports asynchronous training.
- Multi-turn Interaction Support for Local Debugging: Multi-turn interaction rollout supports local debugging, improving development efficiency for multi-turn interaction business.
- Flexible Policy Configuration: Supports multiple distributed training strategies such as Megatron, DeepSpeed, vLLM, etc., allowing flexible configuration based on hardware resources.
✨️ Core Components
Main Module (AgenticPipeline)
AgenticPipeline (located at roll/pipeline/agentic/agentic_pipeline.py) is the main process for the entire agent training. It manages the complete training workflow, including:
- Initializing and managing distributed worker processes (Actor, Critic, Reference, etc.).
- Coordinating environment interaction and data collection.
- Executing model training steps.
- Handling checkpoint saving.
- Recording metrics and experiment tracking.
Source Code: roll/pipeline/agentic/agentic_pipeline.py
Configuration File (AgenticConfig)
AgenticConfig (defined in roll/pipeline/agentic/agentic_config.py) is a configuration object based on Pydantic/dataclass used to specify all parameters for running AgenticPipeline. This configuration system supports YAML file configuration and uses the Hydra framework for management.
For configuration system description, see config_system
Configuration Structure and Organization
Configuration files (such as examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml) are organized by functional modules and mainly include the following sections:
-
Basic Experiment Settings
exp_name: Experiment name, used to identify a specific training taskseed: Random seed to ensure reproducible experimentslogging_dir: Path to save log filesoutput_dir: Path to save model checkpoints and output filesrender_save_dir: Path to save rendered frames (for environment visualization)
-
Training Control Parameters
max_steps: Maximum training stepssave_steps: Frequency of saving model checkpointslogging_steps: Frequency of recording training metricseval_steps: Frequency of performing validation evaluationresume_from_checkpoint: Whether to resume training from a checkpoint. To continue training, set to its path; otherwise, set toFalse.
-
Model Configuration
pretrain: Pretrained model pathreward_pretrain: Reward model pretrained weights path
-
Algorithm Parameters
adv_estimator: Advantage estimator type (such asgae,grpo,reinforce)ppo_epochs: Number of optimization epochs per sample batchgamma: Discount factor for calculating returnslambd: Lambda parameter in GAEpg_clip: Clipping range for PPO policy gradient lossinit_kl_coef: Initial coefficient for KL penaltytarget_kl: Target KL value for adaptive KL controlwhiten_advantages: Whether to whiten advantagesentropy_loss_coef: Coefficient for entropy loss
-
Worker Process Configuration Each worker process (
actor_train,actor_infer,critic,reference) configuration includes:- Model Parameters (
model_args)model_type: Model type (such ascausal_lm)dtype: Computation precision (such asbf16,fp16)attn_implementation: Attention implementation (such asfa2)disable_gradient_checkpointing: Whether to disable gradient checkpointing
- Training Parameters (
training_args)learning_rate: Learning rateper_device_train_batch_size: Training batch size per devicegradient_accumulation_steps: Gradient accumulation stepsweight_decay: Weight decay coefficientwarmup_steps: Learning rate warmup stepslr_scheduler_type: Learning rate scheduler type
- Generation Parameters (
generating_args)max_new_tokens: Maximum new tokens to generatetop_p: Nucleus sampling parametertemperature: Temperature parameternum_return_sequences: Number of return sequences
- Distributed Strategy (
strategy_args)strategy_name: Distributed strategy used (such asmegatron_train,vllm,hf_infer)- Strategy-specific parameters: such as
tp_size(tensor parallel size),pp_size(pipeline parallel size) gpu_memory_utilization: GPU memory utilization (specific to vLLM)
- Device Mapping (
device_mapping)- Specifies which GPU devices the worker process should use
- Model Parameters (
-
Environment Manager Configuration
train_env_manager: Training environment manager configurationval_env_manager: Validation environment manager configuration- Environment-related parameters:
num_env_groups: Number of environment groupsgroup_size: Number of environments per grouptags: List of environment tagsnum_groups_partition: Group allocation for each environment typemax_env_num_per_worker: Maximum number of environments per worker
✨️ Environment Preparation
Environment Types
Agentic Pipeline supports various environment types, including but not limited to:
- FrozenLake: Classic reinforcement learning environment where the agent needs to find a path to the goal on ice.
- Sokoban: Box-pushing game environment where the agent needs to push boxes to designated positions.
- WebShop: Simulated online shopping environment where the agent needs to find suitable products based on user requirements.
- More environment support...
Environment Configuration
In the configuration file, custom environments are defined through the custom_envs field. Each environment configuration includes:
env_type: Environment typeenv_config: Specific environment configuration parametersmax_tokens_per_step: Maximum tokens per step
✨️ Running the Pipeline
Method 1: Using Python Startup Script
The main method is to use the examples/start_agentic_pipeline.py script. This script uses Hydra to load and manage configurations.
-
Select or Create a Configuration File
Start with example YAML (such asexamples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml) or create your own configuration. -
Execute the Python Startup Script
# Make sure you are in the ROLL project root directory
# export PYTHONPATH=$(pwd):$PYTHONPATH
python examples/start_agentic_pipeline.py \
--config_path examples/qwen2.5-0.5B-agentic \
--config_name agent_val_frozen_lake--config_path– Directory containing the YAML configuration.--config_name– File name (without.yaml).
Method 2: Using Helper Shell Script
The examples directory typically contains shell scripts that wrap the Python launcher.
Example structure:
#!/bin/bash
# Example: examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh
CONFIG_PATH=$(basename $(dirname $0))
python examples/start_agentic_pipeline.py \
--config_path $CONFIG_PATH \
--config_name agent_val_frozen_lake
Running method:
bash examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh
✨️ Step-by-Step Example
Step 1: Configuration Setup
-
File:
examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
Key sections includeexp_name,seed,output_dir, model paths, and worker process configurations. -
Pay special attention to these configuration sections:
- Model configuration:
pretrainpath - Algorithm parameters:
adv_estimator,ppo_epochs, etc. - Distributed strategy:
strategy_argsanddevice_mappingfor each worker process - Environment configuration:
train_env_managerandval_env_manager
- Model configuration:
Step 2: Environment and Dependency Preparation
-
Ensure all necessary dependencies are installed, it's recommended to start from image launch:
pip install -r requirements.txt -
Confirm all model paths in the configuration are accessible.
-
Prepare the training environment and ensure support for the selected environment types.
Step 3: Starting the Pipeline
python examples/start_agentic_pipeline.py \
--config_path examples/qwen2.5-0.5B-agentic \
--config_name agent_val_frozen_lake
Step 4: Monitoring
-
Console Output – Observe Hydra, Ray, and Pipeline logs.
-
Log Files – Check the
logging_dirspecified in the YAML. -
TensorBoard
tensorboard --logdir <your_log_dir>
Step 5: Output and Results
- Trained Model – Checkpoints are saved in
checkpoint_config, refer to documentation checkpoint_and_resume for details. - Evaluation Metrics – Recorded in TensorBoard and terminal.
- Rendered Frames – If
render_save_diris configured, environment rendered frames will be saved in that directory, facilitating visualization of the interaction process.
Happy experimenting!