Agentic Pipeline
Table of Contents
- Agentic Pipeline
✨️ Overview
Agentic Pipeline is ROLL's core pipeline for agent training, supporting multiple algorithms such as PPO, GRPO, and more. It provides the following core advantages:
- Gym-like Environment Definition: Supports various environment types, including FrozenLake, Sokoban, etc., and can easily extend custom environments through gym-like interfaces.
- Rich Learning Granularity: Supports TrajectoryWise form (StarPO) and StepWise (GiGPO) training forms.
- Asynchronous Parallel Rollout at Environment Granularity: Independent trajectory sampling across environments improves sampling efficiency.
- Asynchronous Training: Decoupling of rollout/training supports asynchronous training.
- Multi-turn Interaction Support for Local Debugging: Multi-turn interaction rollout supports local debugging, improving development efficiency for multi-turn interaction business.
- Flexible Policy Configuration: Supports multiple distributed training strategies such as Megatron, DeepSpeed, vLLM, etc., allowing flexible configuration based on hardware resources.
✨️ Core Components
Main Module (AgenticPipeline
)
AgenticPipeline
(located at roll/pipeline/agentic/agentic_pipeline.py
) is the main process for the entire agent training. It manages the complete training workflow, including:
- Initializing and managing distributed worker processes (Actor, Critic, Reference, etc.).
- Coordinating environment interaction and data collection.
- Executing model training steps.
- Handling checkpoint saving.
- Recording metrics and experiment tracking.
Source Code: roll/pipeline/agentic/agentic_pipeline.py
Configuration File (AgenticConfig
)
AgenticConfig
(defined in roll/pipeline/agentic/agentic_config.py
) is a configuration object based on Pydantic/dataclass used to specify all parameters for running AgenticPipeline. This configuration system supports YAML file configuration and uses the Hydra framework for management.
For configuration system description, see config_system
Configuration Structure and Organization
Configuration files (such as examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
) are organized by functional modules and mainly include the following sections:
Basic Experiment Settings
exp_name
: Experiment name, used to identify a specific training taskseed
: Random seed to ensure reproducible experimentslogging_dir
: Path to save log filesoutput_dir
: Path to save model checkpoints and output filesrender_save_dir
: Path to save rendered frames (for environment visualization)
Training Control Parameters
max_steps
: Maximum training stepssave_steps
: Frequency of saving model checkpointslogging_steps
: Frequency of recording training metricseval_steps
: Frequency of performing validation evaluationresume_from_checkpoint
: Whether to resume training from a checkpoint. To continue training, set to its path; otherwise, set toFalse
.
Model Configuration
pretrain
: Pretrained model pathreward_pretrain
: Reward model pretrained weights path
Algorithm Parameters
adv_estimator
: Advantage estimator type (such asgae
,grpo
,reinforce
)ppo_epochs
: Number of optimization epochs per sample batchgamma
: Discount factor for calculating returnslambd
: Lambda parameter in GAEpg_clip
: Clipping range for PPO policy gradient lossinit_kl_coef
: Initial coefficient for KL penaltytarget_kl
: Target KL value for adaptive KL controlwhiten_advantages
: Whether to whiten advantagesentropy_loss_coef
: Coefficient for entropy loss
Worker Process Configuration Each worker process (
actor_train
,actor_infer
,critic
,reference
) configuration includes:- Model Parameters (
model_args
)model_type
: Model type (such ascausal_lm
)dtype
: Computation precision (such asbf16
,fp16
)attn_implementation
: Attention implementation (such asfa2
)disable_gradient_checkpointing
: Whether to disable gradient checkpointing
- Training Parameters (
training_args
)learning_rate
: Learning rateper_device_train_batch_size
: Training batch size per devicegradient_accumulation_steps
: Gradient accumulation stepsweight_decay
: Weight decay coefficientwarmup_steps
: Learning rate warmup stepslr_scheduler_type
: Learning rate scheduler type
- Generation Parameters (
generating_args
)max_new_tokens
: Maximum new tokens to generatetop_p
: Nucleus sampling parametertemperature
: Temperature parameternum_return_sequences
: Number of return sequences
- Distributed Strategy (
strategy_args
)strategy_name
: Distributed strategy used (such asmegatron_train
,vllm
,hf_infer
)- Strategy-specific parameters: such as
tp_size
(tensor parallel size),pp_size
(pipeline parallel size) gpu_memory_utilization
: GPU memory utilization (specific to vLLM)
- Device Mapping (
device_mapping
)- Specifies which GPU devices the worker process should use
- Model Parameters (
Environment Manager Configuration
train_env_manager
: Training environment manager configurationval_env_manager
: Validation environment manager configuration- Environment-related parameters:
num_env_groups
: Number of environment groupsgroup_size
: Number of environments per grouptags
: List of environment tagsnum_groups_partition
: Group allocation for each environment typemax_env_num_per_worker
: Maximum number of environments per worker
✨️ Environment Preparation
Environment Types
Agentic Pipeline supports various environment types, including but not limited to:
- FrozenLake: Classic reinforcement learning environment where the agent needs to find a path to the goal on ice.
- Sokoban: Box-pushing game environment where the agent needs to push boxes to designated positions.
- WebShop: Simulated online shopping environment where the agent needs to find suitable products based on user requirements.
- More environment support...
Environment Configuration
In the configuration file, custom environments are defined through the custom_envs
field. Each environment configuration includes:
env_type
: Environment typeenv_config
: Specific environment configuration parametersmax_tokens_per_step
: Maximum tokens per step
✨️ Running the Pipeline
Method 1: Using Python Startup Script
The main method is to use the examples/start_agentic_pipeline.py
script. This script uses Hydra to load and manage configurations.
Select or Create a Configuration File
Start with example YAML (such asexamples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
) or create your own configuration.Execute the Python Startup Script
# Make sure you are in the ROLL project root directory
# export PYTHONPATH=$(pwd):$PYTHONPATH
python examples/start_agentic_pipeline.py \
--config_path examples/qwen2.5-0.5B-agentic \
--config_name agent_val_frozen_lake--config_path
– Directory containing the YAML configuration.--config_name
– File name (without.yaml
).
Method 2: Using Helper Shell Script
The examples
directory typically contains shell scripts that wrap the Python launcher.
Example structure:
#!/bin/bash
# Example: examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh
CONFIG_PATH=$(basename $(dirname $0))
python examples/start_agentic_pipeline.py \
--config_path $CONFIG_PATH \
--config_name agent_val_frozen_lake
Running method:
bash examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh
✨️ Step-by-Step Example
Step 1: Configuration Setup
File:
examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
Key sections includeexp_name
,seed
,output_dir
, model paths, and worker process configurations.Pay special attention to these configuration sections:
- Model configuration:
pretrain
path - Algorithm parameters:
adv_estimator
,ppo_epochs
, etc. - Distributed strategy:
strategy_args
anddevice_mapping
for each worker process - Environment configuration:
train_env_manager
andval_env_manager
- Model configuration:
Step 2: Environment and Dependency Preparation
Ensure all necessary dependencies are installed, it's recommended to start from image launch:
pip install -r requirements.txt
Confirm all model paths in the configuration are accessible.
Prepare the training environment and ensure support for the selected environment types.
Step 3: Starting the Pipeline
python examples/start_agentic_pipeline.py \
--config_path examples/qwen2.5-0.5B-agentic \
--config_name agent_val_frozen_lake
Step 4: Monitoring
Console Output – Observe Hydra, Ray, and Pipeline logs.
Log Files – Check the
logging_dir
specified in the YAML.TensorBoard
tensorboard --logdir <your_log_dir>
Step 5: Output and Results
- Trained Model – Checkpoints are saved in
checkpoint_config
, refer to documentation checkpoint_and_resume for details. - Evaluation Metrics – Recorded in TensorBoard and terminal.
- Rendered Frames – If
render_save_dir
is configured, environment rendered frames will be saved in that directory, facilitating visualization of the interaction process.
Happy experimenting!