GPU Time-Division Multiplexing Control Guide

The ROLL framework implements GPU time-division multiplexing functionality, which allows flexible sharing of GPU resources between different roles through offload/reload capabilities. This document will provide detailed instructions on how to use this feature.

Time-Division Multiplexing Overview

In the ROLL framework, different roles (such as actor_train, actor_infer, critic, reference, and rewards) may need to use the same GPU resources. To improve resource utilization, the framework implements GPU time-division multiplexing functionality, which allows model states to be switched between GPU and CPU at different time points.

Offload/Reload Control Mechanism

Automatic Control

Taking RLVRPipeline as an example, the framework automatically manages the offload and reload of model states:

# Example in rlvr_pipeline.py
ref_log_probs = self.reference.compute_log_probs(batch, blocking=True)

By default, when executing RPC calls to a worker, the framework will first reload the GPU-related state of the current worker onto the GPU, and after execution is completed, it will offload the state to memory.

Manual Control

You can also manually intervene in model state management by setting batch.meta_info["is_offload_states"]:

# Example in rlvr_pipeline.py
self.actor_train.offload_states(blocking=True)

When is_offload_states is set to False, the model state will not be automatically offloaded to CPU after the RPC call is completed, and the model will continue to remain on the GPU.

You can also directly use worker.offload_states() and worker.reload_states() for more direct control over offload and reload timing.

Usage Example

The following is an example of using offload/reload control in rlvr_pipeline.py:

# After the inference phase, manually offload reward model states
if not self.pipeline_config.async_pipeline:
    for reward_cluster in self.rewards.values():
        reward_cluster.offload_states()

# When computing reference model log probs, control whether to offload states
if self.is_lora:
    batch.meta_info["disable_adapter"] = True
    batch.meta_info["is_offload_states"] = False
    ref_log_probs = self.actor_train.compute_log_probs(batch, blocking=True)
else:
    ref_log_probs = self.reference.compute_log_probs(batch, blocking=True)

Context Manager Support

The ROLL framework also provides the state_offload_manager context manager to simplify state management:

from roll.utils.context_managers import state_offload_manager

with state_offload_manager(strategy, metrics, metric_infix, is_offload_states=True):
    # Execute operations that require GPU state within this context
    yield

This context manager automatically handles:

Loading model states to GPU
Executing operations
Deciding whether to offload states to CPU based on the is_offload_states parameter

Memory Monitoring

The framework also provides memory usage monitoring functionality:

from roll.utils.context_managers import log_gpu_memory_usage

# Record GPU memory usage
log_gpu_memory_usage(head="model_loading", logger=logger, rank=None)

Usage Recommendations

In resource-constrained situations, properly using the offload/reload feature can significantly improve GPU utilization
In pipeline implementation, arrange the execution order of different roles to maximize resource utilization efficiency, such as parallel computation of ref/reward models
In asynchronous training, properly arrange the execution order of different roles to maximize resource utilization efficiency

GPU Time-Division Multiplexing Control Guide

Time-Division Multiplexing Overview​

Offload/Reload Control Mechanism​

Automatic Control​

Manual Control​

Usage Example​

Context Manager Support​

Memory Monitoring​

Usage Recommendations​