vLLM Inference Backend Configuration Guide

vLLM is a fast and easy-to-use large language model inference library that efficiently manages attention key-value cache through PagedAttention technology. This document will provide detailed instructions on how to configure and use the vLLM inference backend in the ROLL framework.

vLLM Introduction

vLLM is a high-performance inference engine with the following features:

Fast Inference: Efficiently manages attention key-value cache through PagedAttention technology
Memory Efficient: Reduces memory usage through quantization and optimization
Easy to Use: Provides simple API interfaces
Scalability: Supports distributed inference

Configuring vLLM Strategy

In the ROLL framework, vLLM inference strategy can be configured by setting strategy_args in the YAML configuration file.

Configuration Example

The following is a typical vLLM configuration example (from examples/qwen2.5-7B-rlvr_megatron/rlvr_config.yaml):

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      max_model_len: 8000
  device_mapping: list(range(0,12))
  infer_batch_size: 1

Configuration Parameter Details

strategy_name: Set to vllm to use the vLLM inference backend
strategy_config: vLLM-specific configuration parameters. For more vLLM optimization configurations, please refer to the vLLM official documentation. The strategy_config is passed through directly.
- gpu_memory_utilization: GPU memory utilization ratio for the model executor
  - For example, 0.8 means using 80% of GPU memory
  - Adjust this value according to model size and hardware configuration
- block_size: Token block size for contiguous chunks of tokens
  - Affects vLLM's internal memory management efficiency
  - Usually set to 16 or 32
- max_model_len: Model context length
  - If not specified, it will be automatically derived from the model configuration
  - Ensure it does not exceed hardware limitations
- load_format: Format for loading model weights
  - Since the model will be "updated" at the beginning, this value can be set to dummy
- sleep_level: Sleep level when sleeping the model
  - 1 (default): Only destroys KV cache, retains model weights
  - 2: Destroys both model weights and KV cache after generation, thus saving memory
device_mapping: Specify the list of GPU device IDs to use
infer_batch_size: Batch size during inference

Integration with Other Components

In the configuration example, we can see:

actor_infer uses vLLM as the inference backend
actor_train uses Megatron for training
reference uses Megatron for inference
Reward models use different inference backends (such as hf_infer)

This design allows different components to choose the most suitable inference engine according to their needs.

Performance Optimization Recommendations

Memory Management:
- Properly set the gpu_memory_utilization parameter to balance performance and memory usage
- Monitor GPU memory usage to avoid memory overflow
Batch Processing Optimization:
- Adjust infer_batch_size according to model size and hardware capabilities
- Consider the impact of sequence length on batch size
Context Length:
- Properly set max_model_len to match task requirements
- Avoid setting excessively large context lengths that could cause memory insufficiency

Notes

vLLM requires specific versions of dependency libraries, please ensure compatible versions are installed
In resource-constrained environments, carefully balance resource allocation among different components
Integration of vLLM with training frameworks like Megatron may require additional configuration

By properly configuring the vLLM inference backend, you can fully leverage the performance advantages of the ROLL framework in large-scale language model inference.

vLLM Inference Backend Configuration Guide

vLLM Introduction​

Configuring vLLM Strategy​

Configuration Example​

Configuration Parameter Details​

Integration with Other Components​

Performance Optimization Recommendations​

Notes​