FP8 Quantization Configuration Guide

This document describes how to use FP8 quantization in ROLL to optimize inference performance and VRAM usage.

Overview

FP8 quantization is an efficient numerical precision optimization technique that can significantly reduce model VRAM footprint and improve inference speed. ROLL supports FP8 quantization configuration for actor_infer and llm_judge components.

actor_infer FP8 Configuration

Basic Configuration

actor_infer:
  strategy_args:
    strategy_name: vllm
    strategy_config:
      quantization: fp8

Dense Model Configuration

For Dense models, configuration requirements differ based on quantization method:

Dense + Per Tensor Quantization (Default)

actor_infer:
  strategy_args:
    strategy_name: vllm
    strategy_config:
      quantization: fp8

Dense + Per Block Quantization

actor_infer:
  strategy_args:
    strategy_name: vllm
    strategy_config:
      quantization: fp8
      hf_overrides:
        quantization_config:
          activation_scheme: dynamic
          fmt: e4m3
          quant_method: fp8
          weight_block_size: [128, 128]  # Required: per block quantization

Configuration Description:

activation_scheme: dynamic: Use dynamic activation scheme
fmt: e4m3: Specify FP8 format as E4M3
quant_method: fp8: Set quantization method to FP8
weight_block_size: [128, 128]: Required for per block quantization, specifies weight block size

Note: When specifying weight_block_size, you must also provide activation_scheme, fmt, and quant_method parameters, otherwise an error will occur.

MoE Model Configuration

For MoE (Mixture of Experts) models, hf_overrides/quantization_config must be configured, and only per block quantization is supported:

actor_infer:
  strategy_args:
    strategy_name: vllm
    strategy_config:
      quantization: fp8
      hf_overrides:
        quantization_config:
          activation_scheme: dynamic
          fmt: e4m3
          quant_method: fp8
          weight_block_size: [128, 128]  # Required: MoE models must use per block quantization

Note: MoE models must use per block quantization. The weight_block_size parameter is required, and you must also provide activation_scheme, fmt, and quant_method parameters.

llm_judge FP8 Configuration

LLM as judge model also supports FP8 quantization. Note that the judge model requires independent GPU resources and cannot share GPU with actor_infer:

llm_judge:
  # NOTE: llm as judge also needs GPU, cannot share GPU with actor infer
  worker_cls: roll.pipeline.rlvr.rewards.llm_judge_reward_worker.LLMJudgeRewardWorker
  judge_prompt: Qwen2.5-7B-Instruct-RLVR-prompt
  judge_model_type: inference
  tag_included: [RLVR]  
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      quantization: fp8
      max_model_len: 8000
      load_format: auto

Configuration Description:

gpu_memory_utilization: 0.8: Set VRAM utilization to 80%
quantization: fp8: Enable FP8 quantization
max_model_len: 8000: Maximum model length limit
load_format: auto: Automatically select loading format

Configuration Notes

GPU Resource Isolation: llm_judge requires independent GPU and cannot share with actor_infer
MoE Model Limitations: MoE models must use per block quantization, per tensor quantization is not supported
Memory Optimization: FP8 quantization can significantly reduce memory usage, recommended for VRAM-constrained scenarios
Performance Trade-off: While FP8 quantization improves performance, it may slightly affect model accuracy, requiring trade-offs based on specific scenarios

Complete Example

# Configuration example: FP8 quantization for actor_infer and llm_judge
actor_infer:
  strategy_args:
    strategy_name: vllm
    strategy_config:
      quantization: fp8
      hf_overrides:
        quantization_config:
          activation_scheme: dynamic
          fmt: e4m3
          quant_method: fp8
          weight_block_size: [128, 128]

llm_judge:
  worker_cls: roll.pipeline.rlvr.rewards.llm_judge_reward_worker.LLMJudgeRewardWorker
  judge_prompt: Qwen2.5-7B-Instruct-RLVR-prompt
  judge_model_type: inference
  tag_included: [RLVR]  
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      quantization: fp8
      max_model_len: 8000
      load_format: auto

With the above configuration, you can successfully enable FP8 quantization in ROLL to achieve better inference performance and VRAM efficiency.

Overview​

actor_infer FP8 Configuration​

Basic Configuration​

Dense Model Configuration​

MoE Model Configuration​

llm_judge FP8 Configuration​

Configuration Notes​

Complete Example​