SEQUENCE PACKING IN ROLL
The ROLL framework now supports Sequence Packing, a feature that eliminates padding tokens by packing variable-length sequences together, thereby improving computational efficiency. This document provides a detailed explanation of the implementation rationale and configuration methods for this feature.
Note: Currently, only
megatron_strategysupportssequence_packing.
1. Introduction
In reinforcement learning (RL) training scenarios, rollout data typically exhibits a long-tailed distribution. In conventional training pipelines, samples within a micro-batch are padded to a fixed maximum sequence length before being grouped into a batch for training. This approach wastes significant computational resources on processing padding tokens and slows down training.
To address this issue, ROLL introduces Sequence Packing, which:
- Packs sequences of varying lengths within each micro-batch to eliminate padding tokens.
- Employs optimized packing algorithms to improve packing efficiency, reduce the number of micro-batches, and accelerate training.
2. Implementation Principles
2.1 Data Partitioning Hierarchy
In distributed training, data is organized in the following hierarchical structure:
GLOBAL BATCH (Global Batch)
├── DP RANK 0 → BATCH 0
│ └── MINI BATCH 0 (used for one gradient update)
│ ├── MICRO BATCH 0 (smallest computation unit)
│ ├── MICRO BATCH 1
│ └── ...
├── DP RANK 1 → BATCH 1
│ └── MINI BATCH 0
│ ├── MICRO BATCH 0
│ └── ...
└── ...
- GLOBAL BATCH: The complete rollout results generated by
actor_infer. - BATCH: A subset of the Global Batch assigned to a specific Data Parallel (DP) rank.
- MINI BATCH: A portion of a Batch used for a single gradient update (considering gradient accumulation).
- MICRO BATCH: The smallest computational unit derived from a Mini Batch, used in a single forward/backward pass.
In standard training, all samples within a micro-batch are padded to a fixed length, leading to substantial computational waste. Sequence Packing solves this by packing sequences at the micro-batch level.
2.2 Core Mechanism of Sequence Packing
The primary goal of Sequence Packing is to eliminate padding tokens while ensuring correct and efficient execution under complex distributed training configurations—particularly when Context Parallelism (CP) and Tensor Parallelism (TP) are enabled. To achieve this, the packing process must satisfy specific alignment constraints critical for both correctness and performance.
2.2.1 Alignment Requirement: Multiple of 2 × CP_SIZE × TP_SIZE
When Context Parallelism (CP) and Tensor Parallelism (TP) are enabled, the packed sequence length must be a multiple of 2 × CP_SIZE × TP_SIZE.
This requirement stems from the needs of both parallelism strategies:
-
TENSOR PARALLELISM (TP): When Sequence Parallelism is enabled, sequences are split across TP ranks during the forward pass. Thus, the sequence length must be divisible by
TP_SIZE. -
CONTEXT PARALLELISM (CP): To achieve load balancing in CP, sequences must be logically divided into
2 × CP_SIZEchunks. Hence, the sequence length must also be divisible by2 × CP_SIZE.
Combining these two requirements, the sequence length must be a multiple of 2 × CP_SIZE × TP_SIZE to ensure compatibility with both TP and CP.
2.2.2 Why the Factor of 2? Detailed Explanation of CP Load Balancing
In Context Parallel (CP) training, the asymmetric nature of causal attention leads to severe load imbalance.
Root Cause – Asymmetry in Causal Attention
Consider a sequence of length 6: [0, 1, 2, 3, 4, 5], with CP=2:
Full causal attention mask:
0 1 2 3 4 5
0 [ 1 0 0 0 0 0 ]
1 [ 1 1 0 0 0 0 ]
2 [ 1 1 1 0 0 0 ]
3 [ 1 1 1 1 0 0 ]
4 [ 1 1 1 1 1 0 ]
5 [ 1 1 1 1 1 1 ]
Problem with Naive Partitioning:
If the sequence is simply split evenly:
- CP0 handles:
[0, 1, 2] - CP1 handles:
[3, 4, 5]
The actual computational loads become:
- CP0: Only computes attention weights for its own positions (6 weight computations).
- CP1: Must compute attention weights from its positions to all preceding positions (15 weight computations).
Load ratio: 6:15 = 2:5 — CP1 bears 2.5× more computation than CP0!
Solution – 2×CP Interleaved Chunking
Megatron-Core resolves this by splitting the sequence into 2 × CP chunks and applying an interleaved assignment strategy:
Original sequence: [0, 1, 2, 3, 4, 5]
Split into 4 chunks: |[0,1]|[2,3]|[4,5]|[p,p]| (padded to multiple of 4)
Interleaved assignment:
- Chunk 0 [0,1] → CP0
- Chunk 1 [2,3] → CP1
- Chunk 2 [4,5] → CP1
- Chunk 3 [p,p] → CP0
Final assignment:
- CP0: [0,1] + [p,p]
- CP1: [2,3] + [4,5]
This carefully designed assignment balances the computational load between CP ranks, avoiding performance bottlenecks.
Thus, the factor of 2 is essential for CP load balancing, ensuring roughly equal workloads across CP ranks under causal attention.
2.2.3 Complete Packing Example
Assume a micro-batch contains the following samples (original max sequence length = 8):
| Sample ID | Original Sequence | Valid Length |
|---|---|---|
| 0 | [0, 0, p, p, p, p, p, p] | 2 |
| 1 | [1, 1, 1, 1, p, p, p, p] | 4 |
| 2 | [2, 2, 2, 2, 2, 2, p, p] | 6 |
| 3 | [3, p, p, p, p, p, p, p] | 1 |
Configuration: CP_SIZE=2, TP_SIZE=1
Step 1: Remove original padding
Sample 0: [0, 0]
Sample 1: [1, 1, 1, 1]
Sample 2: [2, 2, 2, 2, 2, 2]
Sample 3: [3]
Step 2: Re-pad to alignment boundary
- Alignment factor = 2 × CP_SIZE × TP_SIZE = 2 × 2 × 1 = 4
Re-padded sequences:
Sample 0: [0, 0, p, p] → length 4
Sample 1: [1, 1, 1, 1] → length 4
Sample 2: [2, 2, 2, 2, 2, 2, p, p] → length 8
Sample 3: [3, p, p, p] → length 4
Step 3: Detailed CP Chunking Process
With CP_SIZE=2, each sequence is logically split into 2 × CP_SIZE = 4 segments and assigned via interleaving:
For any sequence of length L under CP_SIZE=2:
- Split into 4 consecutive segments: seg0, seg1, seg2, seg3
- Each segment has length L/4
- Assignment rule:
- CP0: seg0 + seg3
- CP1: seg1 + seg2
Applied to our example:
-
Sample 0
[0, 0, p, p](length 4):- seg0:
[0], seg1:[0], seg2:[p], seg3:[p] - CP0 gets: seg0 + seg3 =
[0] + [p]→ processes[0, p] - CP1 gets: seg1 + seg2 =
[0] + [p]→ processes[0, p]
- seg0:
-
Sample 1
[1, 1, 1, 1](length 4):- seg0:
[1], seg1:[1], seg2:[1], seg3:[1] - CP0:
[1] + [1]→[1, 1] - CP1:
[1] + [1]→[1, 1]
- seg0:
-
Sample 2
[2, 2, 2, 2, 2, 2, p, p](length 8):- seg0:
[2, 2], seg1:[2, 2], seg2:[2, 2], seg3:[p, p] - CP0:
[2, 2] + [p, p]→[2, 2, p, p] - CP1:
[2, 2] + [2, 2]→[2, 2, 2, 2]
- seg0:
-
Sample 3
[3, p, p, p](length 4):- seg0:
[3], seg1:[p], seg2:[p], seg3:[p] - CP0:
[3] + [p]→[3, p] - CP1:
[p] + [p]→[p, p]
- seg0:
Step 4: Final Packed Input per CP Rank
- CP0’s full input:
[0, p, 1, 1, 2, 2, p, p, 3, p] - CP1’s full input:
[0, p, 1, 1, 2, 2, 2, 2, p, p]
Step 5: Cumulative Sequence Lengths
Padded cumulative lengths: [0, 4, 8, 16, 20]
2.3 Loss Computation Workflow
Under Sequence Packing, loss calculation requires special handling:
-
Unpack Model Outputs: Use
_unpack_sequencesto restore individual sequences from the packed output.- Compute start/end positions of each sequence on the current CP rank using
cu_seqlens_padded. seq_starts = cu_seqlens_padded[:-1] // cp_sizeseq_ends = cu_seqlens_padded[1:] // cp_size
- Compute start/end positions of each sequence on the current CP rank using
-
Per-Sequence Loss Calculation:
- Apply the loss function to each unpacked sequence individually.
- Adjust original data to match the actual sequence length using
adjust_sequence_length. - Accumulate losses from all sequences.
-
Result Aggregation:
- Sum all per-sequence losses to obtain the total loss.
- Aggregate metrics across sequences.
- Apply loss scaling if enabled.
This per-sequence approach ensures correct loss computation even under complex combinations of CP, TP, and packing.
2.4 Load Balancing Optimization
To maximize the effectiveness of Sequence Packing, ROLL applies the Karmarkar-Karp algorithm at multiple levels for load balancing.
Karmarkar-Karp Algorithm Overview: A classical multi-way partitioning algorithm that divides a set of numbers into k subsets with sums as balanced as possible. In Sequence Packing, it ensures computational loads across processing units remain balanced, preventing bottlenecks.
Key optimizations include:
- GLOBAL BATCH → DP RANK Load Balancing: Ensures each DP rank receives a similar total number of tokens.
- MINI BATCH → MICRO BATCH Load Balancing: Balances computational load across micro-batches.
Implementation details and responsibility allocation are described in Section 3.2.
3. Implementation Workflow
3.1 Core Packing and Unpacking Logic
Packing logic resides primarily in the strategy layer. When use_sequence_packing is enabled, the strategy automatically packs micro-batches and unpacks logits for loss computation.
Core packing function _pack_sequences performs:
- Removes original padding and extracts valid tokens.
- Computes cumulative sequence lengths (both original and padded).
- Re-pads sequences to a multiple of
2 * cp_size * tp_size. - Handles CP chunking and assignment.
- Concatenates sequences and creates
PackedSeqParams.
Loss computation is handled by loss_wrapper, which unpacks outputs and computes per-sequence losses.
3.2 Load Balancing Responsibility Allocation
Load balancing in ROLL follows a clear division of responsibilities:
-
GLOBAL BATCH → DP RANK Load Balancing:
- Responsible Module: Pipeline layer (
batch_balancefunction) - Objective: Equalize total token count per DP rank
- Method: Apply Karmarkar-Karp algorithm before data distribution
- Responsible Module: Pipeline layer (
-
MINI BATCH → MICRO BATCH Load Balancing:
- Responsible Module: Strategy layer (
make_micro_batch_iter_for_sequence_packing) - Objective: Balance computational load across micro-batches
- Method: Apply Karmarkar-Karp during micro-batch generation
- Responsible Module: Strategy layer (
-
Preservation of Randomness:
- The division from Batch → Mini Batch retains randomness (for shuffling) and thus does not apply load balancing.
This layered optimization ensures balanced workloads from global to local levels, maximizing hardware utilization.
4. Configuration Parameters
4.1 How to Enable Sequence Packing
To use Sequence Packing, simply set use_sequence_packing: true in your configuration file.
4.2 Parameter Details (Plain Language)
algorithm (Packing Algorithm)
none: Default simple packing—sequences are packed in their original order.load_balance: Intelligent load-balanced packing—reorders data to balance computational load across micro-batches. Recommended.
max_packed_sequence_length_train (Max Packed Length for Training)
- Controls the maximum allowed length of a packed sequence during training.
- E.g., setting to 8192 means no packed sequence will exceed 8192 tokens.
- Choose a reasonable value to avoid out-of-memory errors while maintaining packing efficiency.
max_packed_sequence_length_forward (Max Packed Length for Inference)
- Same as above, but applied during inference.
- Typically set to the same value as the training parameter.
min_num_micro_batches_train (Minimum Micro-Batches for Training)
- Specifies the minimum number of micro-batches per mini-batch during training.
- Setting to 1 means no constraint—the system auto-determines optimal splitting.
- Increase this value if facing GPU memory issues to reduce micro-batch size.
min_num_micro_batches_forward (Minimum Micro-Batches for Inference)
- Same as above, but for inference.
4.3 Full Configuration Example
actor_train:
# Enable sequence packing
use_sequence_packing: True
# Sequence packing configuration
sequence_packing_args:
# Use load-balancing algorithm for better performance
algorithm: load_balance
# Max packed sequence length during training
max_packed_sequence_length_train: 8192
# Max packed sequence length during inference
max_packed_sequence_length_forward: 8192
# Minimum 1 micro-batch during training (no constraint)
min_num_micro_batches_train: 1
# Minimum 1 micro-batch during inference
min_num_micro_batches_forward: 1
# Sequence packing requires megatron strategy
strategy_args:
strategy_name: megatron_train
4.4 Usage Recommendations
- Mandatory Condition: Only supported under
megatron_trainormegatron_inferstrategies. - Recommended Setting: Use
algorithm: load_balancefor optimal performance. - Length Tuning: Set
max_packed_sequence_lengthbased on your GPU memory capacity—typically equal to the model’s maximum supported sequence length. - Custom Loss Functions: If using a custom loss function with sequence packing, refer to the custom loss documentation and ensure
apply_loss_scaleis correctly configured.
With proper configuration, Sequence Packing significantly boosts training efficiency—especially in RL scenarios with highly variable sequence lengths—while maintaining model performance.