RAFT++ (Reward rAnked Fine-Tuning)
Introduction
RAFT++ (Reward rAnked Fine-Tuning) is a ranking-based reinforcement learning algorithm that optimizes policies by comparing rewards of different responses. RAFT++ works as follows:
- Group Sampling: For a given problem, the model generates multiple possible solutions, forming a "group" of outputs.
- Reward Ranking: Each solution is evaluated and assigned a reward based on its correctness or quality, then ranked according to the rewards.
- Policy Update: The model updates its parameters by comparing rewards of different solutions within the group, reinforcing strategies that obtain higher rewards.
RAFT++ Configuration Parameters
In ROLL, the RAFT++ algorithm-specific configuration parameters are as follows (roll.pipeline.rlvr.rlvr_config.RLVRConfig
):
# RAFT++ core config
adv_estimator: "reinforce"
# normalize
reward_norm: None
reward_shift: false
reward_scale: false
# advantage
whiten_advantages: false
# ppo related, other parts are compatible with GRPO/PPO settings
rollout_batch_size: 64 # prompt
num_return_sequences_in_group: 8
prompt_length: 2048
response_length: 4096
ppo_epochs: 1
use_kl_loss: true
kl_loss_coef: 0.001
loss_agg_mode: "seq-mean-token-sum"
# advantage
advantage_clip: 2.0
dual_clip_loss: true
# clip
reward_clip: 10
# reward
add_token_level_kl: false
Core Parameter Descriptions
adv_estimator
: Advantage estimator type, set to "reinforce", which is the core configuration of RAFT++ algorithmreward_norm
: Reward normalization type, optional values are "batch", "group", "running", null, default value is nullreward_shift
: Whether to only subtract mean in reward normalization, default value is falsereward_scale
: Whether to only divide by standard deviation in reward normalization, default value is falsewhiten_advantages
: Whether to whiten advantage values, default value is false
PPO Related Parameters
The following parameters are common configuration items for PPO-class algorithms:
rollout_batch_size
: Number of prompts per rollout_batch_size, default value is 64num_return_sequences_in_group
: Number of responses generated per prompt (group size), the total number of samples trained per pipeline step is (rollout_batch_size * num_return_sequences_in_group), default value is 8prompt_length
: Maximum length of prompts, default value is 2048response_length
: Maximum length of responses, default value is 4096ppo_epochs
: Number of optimization rounds per batch of samples, default value is 1use_kl_loss
: Whether to use KL divergence loss, default value is truekl_loss_coef
: KL-loss coefficient, default value is 0.001loss_agg_mode
: Loss aggregation mode, default is "seq-mean-token-sum", optional values are "token-mean", "seq-mean-token-sum", "seq-mean-token-mean", "seq-mean-token-sum-norm"advantage_clip
: Advantage value clipping range, default value is 2.0dual_clip_loss
: Whether to use dual clipping loss, default value is truereward_clip
: Reward value clipping range, default value is 10add_token_level_kl
: Whether to add token-level KL penalty, default value is false
Reference Example
You can refer to the following configuration file to set up RAFT++ training:
./examples/docs_examples/example_raft_pp.yaml
References
[1] Xiong, W.; Yao, J.; Xu, Y.; Pang, B.; Wang, L.; Sahoo, D.; Li, J.; Jiang, N.; Zhang, T.; Xiong, C.; Dong, H. A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforce. arXiv April 15, 2025. https://doi.org/10.48550/arXiv.2504.11343.