SEQUENCE PACKING IN ROLL

The ROLL framework now supports Sequence Packing, a feature that eliminates padding tokens by packing variable-length sequences together, thereby improving computational efficiency. This document provides a detailed explanation of the implementation rationale and configuration methods for this feature.

Note: Currently, only megatron_strategy supports sequence_packing.

1. Introduction

In reinforcement learning (RL) training scenarios, rollout data typically exhibits a long-tailed distribution. In conventional training pipelines, samples within a micro-batch are padded to a fixed maximum sequence length before being grouped into a batch for training. This approach wastes significant computational resources on processing padding tokens and slows down training.

To address this issue, ROLL introduces Sequence Packing, which:

Packs sequences of varying lengths within each micro-batch to eliminate padding tokens.
Employs optimized packing algorithms to improve packing efficiency, reduce the number of micro-batches, and accelerate training.

2. Implementation Principles

2.1 Data Partitioning Hierarchy

In distributed training, data is organized in the following hierarchical structure:

GLOBAL BATCH (Global Batch)
├── DP RANK 0 → BATCH 0
│   └── MINI BATCH 0 (used for one gradient update)
│       ├── MICRO BATCH 0 (smallest computation unit)
│       ├── MICRO BATCH 1
│       └── ...
├── DP RANK 1 → BATCH 1  
│   └── MINI BATCH 0
│       ├── MICRO BATCH 0
│       └── ...
└── ...

GLOBAL BATCH: The complete rollout results generated by actor_infer.
BATCH: A subset of the Global Batch assigned to a specific Data Parallel (DP) rank.
MINI BATCH: A portion of a Batch used for a single gradient update (considering gradient accumulation).
MICRO BATCH: The smallest computational unit derived from a Mini Batch, used in a single forward/backward pass.

In standard training, all samples within a micro-batch are padded to a fixed length, leading to substantial computational waste. Sequence Packing solves this by packing sequences at the micro-batch level.

2.2 Core Mechanism of Sequence Packing

The primary goal of Sequence Packing is to eliminate padding tokens while ensuring correct and efficient execution under complex distributed training configurations—particularly when Context Parallelism (CP) and Tensor Parallelism (TP) are enabled. To achieve this, the packing process must satisfy specific alignment constraints critical for both correctness and performance.

2.2.1 Alignment Requirement: Multiple of 2 × CP_SIZE × TP_SIZE

When Context Parallelism (CP) and Tensor Parallelism (TP) are enabled, the packed sequence length must be a multiple of 2 × CP_SIZE × TP_SIZE.

This requirement stems from the needs of both parallelism strategies:

TENSOR PARALLELISM (TP): When Sequence Parallelism is enabled, sequences are split across TP ranks during the forward pass. Thus, the sequence length must be divisible by TP_SIZE.
CONTEXT PARALLELISM (CP): To achieve load balancing in CP, sequences must be logically divided into 2 × CP_SIZE chunks. Hence, the sequence length must also be divisible by 2 × CP_SIZE.

Combining these two requirements, the sequence length must be a multiple of 2 × CP_SIZE × TP_SIZE to ensure compatibility with both TP and CP.

2.2.2 Why the Factor of 2? Detailed Explanation of CP Load Balancing

In Context Parallel (CP) training, the asymmetric nature of causal attention leads to severe load imbalance.

Root Cause – Asymmetry in Causal Attention

Consider a sequence of length 6: [0, 1, 2, 3, 4, 5], with CP=2:

Full causal attention mask:
1  2  3  4  5
[ 1  0  0  0  0  0 ]
[ 1  1  0  0  0  0 ]  
[ 1  1  1  0  0  0 ]
[ 1  1  1  1  0  0 ]
[ 1  1  1  1  1  0 ]
[ 1  1  1  1  1  1 ]

Problem with Naive Partitioning:

If the sequence is simply split evenly:

CP0 handles: [0, 1, 2]
CP1 handles: [3, 4, 5]

The actual computational loads become:

CP0: Only computes attention weights for its own positions (6 weight computations).
CP1: Must compute attention weights from its positions to all preceding positions (15 weight computations).

Load ratio: 6:15 = 2:5 — CP1 bears 2.5× more computation than CP0!

Solution – 2×CP Interleaved Chunking

Megatron-Core resolves this by splitting the sequence into 2 × CP chunks and applying an interleaved assignment strategy:

Original sequence: [0, 1, 2, 3, 4, 5]
Split into 4 chunks: |[0,1]|[2,3]|[4,5]|[p,p]|  (padded to multiple of 4)

Interleaved assignment:
- Chunk 0 [0,1] → CP0
- Chunk 1 [2,3] → CP1  
- Chunk 2 [4,5] → CP1
- Chunk 3 [p,p] → CP0

Final assignment:
- CP0: [0,1] + [p,p]
- CP1: [2,3] + [4,5]

This carefully designed assignment balances the computational load between CP ranks, avoiding performance bottlenecks.

Thus, the factor of 2 is essential for CP load balancing, ensuring roughly equal workloads across CP ranks under causal attention.

2.2.3 Complete Packing Example

Assume a micro-batch contains the following samples (original max sequence length = 8):

Sample ID	Original Sequence	Valid Length
0	`[0, 0, p, p, p, p, p, p]`	2
1	`[1, 1, 1, 1, p, p, p, p]`	4
2	`[2, 2, 2, 2, 2, 2, p, p]`	6
3	`[3, p, p, p, p, p, p, p]`	1

Configuration: CP_SIZE=2, TP_SIZE=1

Step 1: Remove original padding

Sample 0: [0, 0]
Sample 1: [1, 1, 1, 1]  
Sample 2: [2, 2, 2, 2, 2, 2]
Sample 3: [3]

Step 2: Re-pad to alignment boundary

Alignment factor = 2 × CP_SIZE × TP_SIZE = 2 × 2 × 1 = 4

Re-padded sequences:

Sample 0: [0, 0, p, p] → length 4
Sample 1: [1, 1, 1, 1] → length 4  
Sample 2: [2, 2, 2, 2, 2, 2, p, p] → length 8
Sample 3: [3, p, p, p] → length 4

Step 3: Detailed CP Chunking Process

With CP_SIZE=2, each sequence is logically split into 2 × CP_SIZE = 4 segments and assigned via interleaving:

For any sequence of length L under CP_SIZE=2:

Split into 4 consecutive segments: seg0, seg1, seg2, seg3
Each segment has length L/4
Assignment rule:
- CP0: seg0 + seg3
- CP1: seg1 + seg2

Applied to our example:

Sample 0 [0, 0, p, p] (length 4):
- seg0: [0], seg1: [0], seg2: [p], seg3: [p]
- CP0 gets: seg0 + seg3 = [0] + [p] → processes [0, p]
- CP1 gets: seg1 + seg2 = [0] + [p] → processes [0, p]
Sample 1 [1, 1, 1, 1] (length 4):
- seg0: [1], seg1: [1], seg2: [1], seg3: [1]
- CP0: [1] + [1] → [1, 1]
- CP1: [1] + [1] → [1, 1]
Sample 2 [2, 2, 2, 2, 2, 2, p, p] (length 8):
- seg0: [2, 2], seg1: [2, 2], seg2: [2, 2], seg3: [p, p]
- CP0: [2, 2] + [p, p] → [2, 2, p, p]
- CP1: [2, 2] + [2, 2] → [2, 2, 2, 2]
Sample 3 [3, p, p, p] (length 4):
- seg0: [3], seg1: [p], seg2: [p], seg3: [p]
- CP0: [3] + [p] → [3, p]
- CP1: [p] + [p] → [p, p]

Step 4: Final Packed Input per CP Rank

CP0’s full input: [0, p, 1, 1, 2, 2, p, p, 3, p]
CP1’s full input: [0, p, 1, 1, 2, 2, 2, 2, p, p]

Step 5: Cumulative Sequence Lengths

Padded cumulative lengths: [0, 4, 8, 16, 20]

2.3 Loss Computation Workflow

Under Sequence Packing, loss calculation requires special handling:

Unpack Model Outputs: Use _unpack_sequences to restore individual sequences from the packed output.
- Compute start/end positions of each sequence on the current CP rank using cu_seqlens_padded.
- seq_starts = cu_seqlens_padded[:-1] // cp_size
- seq_ends = cu_seqlens_padded[1:] // cp_size
Per-Sequence Loss Calculation:
- Apply the loss function to each unpacked sequence individually.
- Adjust original data to match the actual sequence length using adjust_sequence_length.
- Accumulate losses from all sequences.
Result Aggregation:
- Sum all per-sequence losses to obtain the total loss.
- Aggregate metrics across sequences.
- Apply loss scaling if enabled.

This per-sequence approach ensures correct loss computation even under complex combinations of CP, TP, and packing.

2.4 Load Balancing Optimization

To maximize the effectiveness of Sequence Packing, ROLL applies the Karmarkar-Karp algorithm at multiple levels for load balancing.

Karmarkar-Karp Algorithm Overview: A classical multi-way partitioning algorithm that divides a set of numbers into k subsets with sums as balanced as possible. In Sequence Packing, it ensures computational loads across processing units remain balanced, preventing bottlenecks.

Key optimizations include:

GLOBAL BATCH → DP RANK Load Balancing: Ensures each DP rank receives a similar total number of tokens.
MINI BATCH → MICRO BATCH Load Balancing: Balances computational load across micro-batches.

Implementation details and responsibility allocation are described in Section 3.2.

3. Implementation Workflow

3.1 Core Packing and Unpacking Logic

Packing logic resides primarily in the strategy layer. When use_sequence_packing is enabled, the strategy automatically packs micro-batches and unpacks logits for loss computation.

Core packing function _pack_sequences performs:

Removes original padding and extracts valid tokens.
Computes cumulative sequence lengths (both original and padded).
Re-pads sequences to a multiple of 2 * cp_size * tp_size.
Handles CP chunking and assignment.
Concatenates sequences and creates PackedSeqParams.

Loss computation is handled by loss_wrapper, which unpacks outputs and computes per-sequence losses.

3.2 Load Balancing Responsibility Allocation

Load balancing in ROLL follows a clear division of responsibilities:

GLOBAL BATCH → DP RANK Load Balancing:
- Responsible Module: Pipeline layer (batch_balance function)
- Objective: Equalize total token count per DP rank
- Method: Apply Karmarkar-Karp algorithm before data distribution
MINI BATCH → MICRO BATCH Load Balancing:
- Responsible Module: Strategy layer (make_micro_batch_iter_for_sequence_packing)
- Objective: Balance computational load across micro-batches
- Method: Apply Karmarkar-Karp during micro-batch generation
Preservation of Randomness:
- The division from Batch → Mini Batch retains randomness (for shuffling) and thus does not apply load balancing.

This layered optimization ensures balanced workloads from global to local levels, maximizing hardware utilization.

4. Configuration Parameters

4.1 How to Enable Sequence Packing

To use Sequence Packing, simply set use_sequence_packing: true in your configuration file.

4.2 Parameter Details (Plain Language)

`algorithm` (Packing Algorithm)

none: Default simple packing—sequences are packed in their original order.
load_balance: Intelligent load-balanced packing—reorders data to balance computational load across micro-batches. Recommended.

`max_packed_sequence_length_train` (Max Packed Length for Training)

Controls the maximum allowed length of a packed sequence during training.
E.g., setting to 8192 means no packed sequence will exceed 8192 tokens.
Choose a reasonable value to avoid out-of-memory errors while maintaining packing efficiency.

`max_packed_sequence_length_forward` (Max Packed Length for Inference)

Same as above, but applied during inference.
Typically set to the same value as the training parameter.

`min_num_micro_batches_train` (Minimum Micro-Batches for Training)

Specifies the minimum number of micro-batches per mini-batch during training.
Setting to 1 means no constraint—the system auto-determines optimal splitting.
Increase this value if facing GPU memory issues to reduce micro-batch size.

`min_num_micro_batches_forward` (Minimum Micro-Batches for Inference)

Same as above, but for inference.

4.3 Full Configuration Example

actor_train:
  # Enable sequence packing
  use_sequence_packing: True
  
  # Sequence packing configuration
  sequence_packing_args:
    # Use load-balancing algorithm for better performance
    algorithm: load_balance
    
    # Max packed sequence length during training
    max_packed_sequence_length_train: 8192
    
    # Max packed sequence length during inference
    max_packed_sequence_length_forward: 8192
    
    # Minimum 1 micro-batch during training (no constraint)
    min_num_micro_batches_train: 1
    
    # Minimum 1 micro-batch during inference
    min_num_micro_batches_forward: 1
  
  # Sequence packing requires megatron strategy
  strategy_args:
    strategy_name: megatron_train

4.4 Usage Recommendations

Mandatory Condition: Only supported under megatron_train or megatron_infer strategies.
Recommended Setting: Use algorithm: load_balance for optimal performance.
Length Tuning: Set max_packed_sequence_length based on your GPU memory capacity—typically equal to the model’s maximum supported sequence length.
Custom Loss Functions: If using a custom loss function with sequence packing, refer to the custom loss documentation and ensure apply_loss_scale is correctly configured.

With proper configuration, Sequence Packing significantly boosts training efficiency—especially in RL scenarios with highly variable sequence lengths—while maintaining model performance.

1. Introduction​

2. Implementation Principles​

2.1 Data Partitioning Hierarchy​

2.2 Core Mechanism of Sequence Packing​

2.2.1 Alignment Requirement: Multiple of 2 × CP_SIZE × TP_SIZE​

2.2.2 Why the Factor of 2? Detailed Explanation of CP Load Balancing​

2.2.3 Complete Packing Example​

2.3 Loss Computation Workflow​

2.4 Load Balancing Optimization​

3. Implementation Workflow​

3.1 Core Packing and Unpacking Logic​

3.2 Load Balancing Responsibility Allocation​

4. Configuration Parameters​

4.1 How to Enable Sequence Packing​

4.2 Parameter Details (Plain Language)​

algorithm (Packing Algorithm)​

max_packed_sequence_length_train (Max Packed Length for Training)​

max_packed_sequence_length_forward (Max Packed Length for Inference)​

min_num_micro_batches_train (Minimum Micro-Batches for Training)​

min_num_micro_batches_forward (Minimum Micro-Batches for Inference)​

4.3 Full Configuration Example​

4.4 Usage Recommendations​