SGLang Inference Backend Configuration Guide
SGLang is a fast and easy-to-use inference engine, particularly suitable for inference tasks of large-scale language models. This document will provide detailed instructions on how to configure and use the SGLang inference backend in the ROLL framework.
SGLang Introduction
SGLang is a structured generation language specifically designed for inference of large language models. It provides efficient inference performance and flexible programming interfaces.
Configuring SGLang Strategy
In the ROLL framework, SGLang inference strategy can be configured by setting strategy_args in the YAML configuration file.
Basic Configuration Example
The following is a typical SGLang configuration example (from examples/qwen3-30BA3B-rlvr_megatron/rlvr_config_sglang.yaml):
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
strategy_args:
strategy_name: sglang
strategy_config:
mem_fraction_static: 0.7
load_format: dummy
num_gpus_per_worker: 2
device_mapping: list(range(0,24))
Configuration Parameter Details
-
strategy_name: Set to
sglangto use the SGLang inference backend -
strategy_config: SGLang-specific configuration parameters. For more SGLang configuration parameters, see the official documentation. The strategy_config is passed through directly to SGLang.
mem_fraction_static: GPU memory utilization ratio for static memory such as model weights and KV cache- Increase this value if KV cache building fails
- Decrease this value if CUDA memory is insufficient
load_format: Format for loading model weights- Since the model will be "updated" at the beginning, this value can be set to
dummy
- Since the model will be "updated" at the beginning, this value can be set to
-
num_gpus_per_worker: Number of GPUs allocated per worker
- SGLang can utilize multiple GPUs for parallel inference
-
device_mapping: Specify the list of GPU device IDs to use
-
infer_batch_size: Batch size during inference
Integration with Other Components
In the above example, we can see:
actor_inferuses SGLang as the inference backendactor_trainuses Megatron for trainingreferenceuses Megatron for inference- Reward models use different inference backends (such as
hf_infer)
This design allows different components to choose the most suitable inference engine according to their needs.
Performance Optimization Recommendations
-
Memory Management:
- Properly set the
mem_fraction_staticparameter to balance performance and memory usage - Monitor GPU memory usage to avoid memory overflow
- Properly set the
-
Parallel Processing:
- Appropriately increase
num_gpus_per_workerto utilize multiple GPUs for model loading and parallel inference - Adjust
device_mappingaccording to hardware configuration. The number of SGLang engines islen(device_mapping) // num_gpus_per_worker
- Appropriately increase
-
Batch Processing Optimization:
infer_batch_sizeis not effective, as continuous batching is automatically performed- Consider the impact of sequence length on batch size
Notes
- SGLang requires specific versions of dependency libraries, please ensure compatible versions are installed
- In resource-constrained environments, carefully balance resource allocation among different components
- Integration of SGLang with training frameworks like Megatron may require additional configuration