FP8 Quantization Configuration Guide
This document describes how to use FP8 quantization in ROLL to optimize inference performance and VRAM usage.
Overview
FP8 quantization is an efficient numerical precision optimization technique that can significantly reduce model VRAM footprint and improve inference speed. ROLL supports FP8 quantization configuration for actor_infer and llm_judge components.
actor_infer FP8 Configuration
Basic Configuration
actor_infer:
strategy_args:
strategy_name: vllm
strategy_config:
quantization: fp8
Dense Model Configuration
For Dense models, configuration requirements differ based on quantization method:
Dense + Per Tensor Quantization (Default)
actor_infer:
strategy_args:
strategy_name: vllm
strategy_config:
quantization: fp8
Dense + Per Block Quantization
actor_infer:
strategy_args:
strategy_name: vllm
strategy_config:
quantization: fp8
hf_overrides:
quantization_config:
activation_scheme: dynamic
fmt: e4m3
quant_method: fp8
weight_block_size: [128, 128] # Required: per block quantization
Configuration Description:
activation_scheme: dynamic: Use dynamic activation schemefmt: e4m3: Specify FP8 format as E4M3quant_method: fp8: Set quantization method to FP8weight_block_size: [128, 128]: Required for per block quantization, specifies weight block size
Note: When specifying weight_block_size, you must also provide activation_scheme, fmt, and quant_method parameters, otherwise an error will occur.
MoE Model Configuration
For MoE (Mixture of Experts) models, hf_overrides/quantization_config must be configured, and only per block quantization is supported:
actor_infer:
strategy_args:
strategy_name: vllm
strategy_config:
quantization: fp8
hf_overrides:
quantization_config:
activation_scheme: dynamic
fmt: e4m3
quant_method: fp8
weight_block_size: [128, 128] # Required: MoE models must use per block quantization
Note: MoE models must use per block quantization. The weight_block_size parameter is required, and you must also provide activation_scheme, fmt, and quant_method parameters.
llm_judge FP8 Configuration
LLM as judge model also supports FP8 quantization. Note that the judge model requires independent GPU resources and cannot share GPU with actor_infer:
llm_judge:
# NOTE: llm as judge also needs GPU, cannot share GPU with actor infer
worker_cls: roll.pipeline.rlvr.rewards.llm_judge_reward_worker.LLMJudgeRewardWorker
judge_prompt: Qwen2.5-7B-Instruct-RLVR-prompt
judge_model_type: inference
tag_included: [RLVR]
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.8
quantization: fp8
max_model_len: 8000
load_format: auto
Configuration Description:
gpu_memory_utilization: 0.8: Set VRAM utilization to 80%quantization: fp8: Enable FP8 quantizationmax_model_len: 8000: Maximum model length limitload_format: auto: Automatically select loading format
Configuration Notes
- GPU Resource Isolation: llm_judge requires independent GPU and cannot share with actor_infer
- MoE Model Limitations: MoE models must use per block quantization, per tensor quantization is not supported
- Memory Optimization: FP8 quantization can significantly reduce memory usage, recommended for VRAM-constrained scenarios
- Performance Trade-off: While FP8 quantization improves performance, it may slightly affect model accuracy, requiring trade-offs based on specific scenarios
Complete Example
# Configuration example: FP8 quantization for actor_infer and llm_judge
actor_infer:
strategy_args:
strategy_name: vllm
strategy_config:
quantization: fp8
hf_overrides:
quantization_config:
activation_scheme: dynamic
fmt: e4m3
quant_method: fp8
weight_block_size: [128, 128]
llm_judge:
worker_cls: roll.pipeline.rlvr.rewards.llm_judge_reward_worker.LLMJudgeRewardWorker
judge_prompt: Qwen2.5-7B-Instruct-RLVR-prompt
judge_model_type: inference
tag_included: [RLVR]
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.8
quantization: fp8
max_model_len: 8000
load_format: auto
With the above configuration, you can successfully enable FP8 quantization in ROLL to achieve better inference performance and VRAM efficiency.