Megatron Inference and Training Backend Configuration Guide
Megatron is NVIDIA's large-scale language model training and inference framework that supports efficient distributed training and inference. This document will provide detailed instructions on how to configure and use the Megatron backend in the ROLL framework.
Megatron Introduction
Megatron provides efficient model parallel and data parallel strategies, particularly suitable for training and inference of ultra-large-scale language models. It supports multiple parallel strategies including tensor parallelism, pipeline parallelism, and expert parallelism.
Configuring Megatron Strategy
In the ROLL framework, Megatron training and inference strategies can be configured by setting strategy_args
in the YAML configuration file.
Training Configuration Example
The following is a typical Megatron training configuration example (from examples/qwen3-30BA3B-rlvr_megatron/rlvr_config_sglang.yaml
):
actor_train:
model_args:
disable_gradient_checkpointing: false
dtype: bf16
model_type: ~
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
warmup_steps: 20
num_train_epochs: 50
strategy_args:
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 2
virtual_pipeline_model_parallel_size: 8
expert_model_parallel_size: 4
context_parallel_size: 1
use_distributed_optimizer: true
sequence_parallel: true
moe_token_dispatcher_type: "alltoall"
moe_grouped_gemm: true
moe_layer_recompute: true
device_mapping: list(range(0,32))
infer_batch_size: 2
Inference Configuration Example
The following is a typical Megatron inference configuration example:
reference:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
model_type: ~
strategy_args:
strategy_name: megatron_infer
strategy_config:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
expert_model_parallel_size: 4
moe_token_dispatcher_type: "alltoall"
moe_grouped_gemm: true
device_mapping: list(range(0,32))
infer_batch_size: 2
Configuration Parameter Details
strategy_name:
megatron_train
for trainingmegatron_infer
for inference
strategy_config: All parallel optimization configurations provided by mcore can be used. For detailed information, please refer to the usage of megatron. Here are some common Megatron configuration parameters:
tensor_model_parallel_size
: Tensor model parallelism degree, partitioning intra-layer computation and memory across multiple GPUspipeline_model_parallel_size
: Pipeline model parallelism degree, assigning different layers of the model to different GPUsvirtual_pipeline_model_parallel_size
: Virtual pipeline parallelism degree, used to improve pipeline efficiencyexpert_model_parallel_size
: Expert model parallelism degree, assigning different experts to different GPUs in MoE modelscontext_parallel_size
: Context parallelism degree, used for processing ultra-long sequencesuse_distributed_optimizer
: Whether to use distributed optimizersequence_parallel
: Whether to enable sequence parallel optimizationmoe_token_dispatcher_type
: Token dispatcher type in MoE models ('allgather' or 'alltoall')moe_grouped_gemm
: Whether to enable grouped GEMM for MoE expertsmoe_layer_recompute
: Whether to checkpoint MoE layers to save activation memoryrecompute_granularity
: Activation value recomputation granularity ('full' or 'selective')overlap_grad_reduce
: Whether to overlap gradient All-reduce process with backward propagation computation in distributed optimizer
device_mapping: Specify the list of GPU device IDs to use
infer_batch_size: Batch size during inference
Integration with Other Components
In the configuration example, we can see:
actor_train
uses Megatron for trainingactor_infer
may use other inference backends (such as vLLM or SGLang)reference
uses Megatron for inference- Reward models may use different inference backends
This design allows different components to choose the most suitable backend according to their needs.
Notes
- Megatron requires specific versions of dependency libraries, please ensure compatible versions are installed
- Parallel strategy settings need to be adjusted according to hardware configuration and model size
- In resource-constrained environments, carefully balance resource allocation among different components
By properly configuring the Megatron backend, you can fully leverage the performance advantages of the ROLL framework in large-scale language model training and inference.