Megatron Inference and Training Backend Configuration Guide

Megatron is NVIDIA's large-scale language model training and inference framework that supports efficient distributed training and inference. This document will provide detailed instructions on how to configure and use the Megatron backend in the ROLL framework.

Megatron Introduction

Megatron provides efficient model parallel and data parallel strategies, particularly suitable for training and inference of ultra-large-scale language models. It supports multiple parallel strategies including tensor parallelism, pipeline parallelism, and expert parallelism.

Configuring Megatron Strategy

In the ROLL framework, Megatron training and inference strategies can be configured by setting strategy_args in the YAML configuration file.

Training Configuration Example

The following is a typical Megatron training configuration example (from examples/qwen3-30BA3B-rlvr_megatron/rlvr_config_sglang.yaml):

actor_train:
  model_args:
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 32
    warmup_steps: 20
    num_train_epochs: 50
  strategy_args:
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 2
      pipeline_model_parallel_size: 2
      virtual_pipeline_model_parallel_size: 8
      expert_model_parallel_size: 4
      context_parallel_size: 1
      use_distributed_optimizer: true
      sequence_parallel: true
      moe_token_dispatcher_type: "alltoall"
      moe_grouped_gemm: true
      moe_layer_recompute: true
  device_mapping: list(range(0,32))
  infer_batch_size: 2

Inference Configuration Example

The following is a typical Megatron inference configuration example:

reference:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  strategy_args:
    strategy_name: megatron_infer
    strategy_config:
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 4
      moe_token_dispatcher_type: "alltoall"
      moe_grouped_gemm: true
  device_mapping: list(range(0,32))
  infer_batch_size: 2

Configuration Parameter Details

strategy_name:
- megatron_train for training
- megatron_infer for inference
strategy_config: All parallel optimization configurations provided by mcore can be used. For detailed information, please refer to the usage of megatron. Here are some common Megatron configuration parameters:
- tensor_model_parallel_size: Tensor model parallelism degree, partitioning intra-layer computation and memory across multiple GPUs
- pipeline_model_parallel_size: Pipeline model parallelism degree, assigning different layers of the model to different GPUs
- virtual_pipeline_model_parallel_size: Virtual pipeline parallelism degree, used to improve pipeline efficiency
- expert_model_parallel_size: Expert model parallelism degree, assigning different experts to different GPUs in MoE models
- context_parallel_size: Context parallelism degree, used for processing ultra-long sequences
- use_distributed_optimizer: Whether to use distributed optimizer
- sequence_parallel: Whether to enable sequence parallel optimization
- moe_token_dispatcher_type: Token dispatcher type in MoE models ('allgather' or 'alltoall')
- moe_grouped_gemm: Whether to enable grouped GEMM for MoE experts
- moe_layer_recompute: Whether to checkpoint MoE layers to save activation memory
- recompute_granularity: Activation value recomputation granularity ('full' or 'selective')
- overlap_grad_reduce: Whether to overlap gradient All-reduce process with backward propagation computation in distributed optimizer
device_mapping: Specify the list of GPU device IDs to use
infer_batch_size: Batch size during inference

Integration with Other Components

In the configuration example, we can see:

actor_train uses Megatron for training
actor_infer may use other inference backends (such as vLLM or SGLang)
reference uses Megatron for inference
Reward models may use different inference backends

This design allows different components to choose the most suitable backend according to their needs.

Notes

Megatron requires specific versions of dependency libraries, please ensure compatible versions are installed
Parallel strategy settings need to be adjusted according to hardware configuration and model size
In resource-constrained environments, carefully balance resource allocation among different components

By properly configuring the Megatron backend, you can fully leverage the performance advantages of the ROLL framework in large-scale language model training and inference.

Megatron Inference and Training Backend Configuration Guide

Megatron Introduction​

Configuring Megatron Strategy​

Training Configuration Example​

Inference Configuration Example​

Configuration Parameter Details​

Integration with Other Components​

Notes​