30 production operators with 440 hot shapes, importance weights from 1,303 production profiles, roofline bounds across 2 hardware platforms, and deployed kernel baselines.
| Operator ▴▾ | dtype | Shapes | Importance |
|---|---|---|---|
unified_attention atrex_030 | bf16 | 25 | 39.5% |
fused_moe atrex_009 | bf16 | 23 | 11.4% |
block_scaled_mm atrex_002 | fp8_e4m3 | 24 | 9.3% |
fp8_blockscale_fused_moe atrex_006 | fp8_e4m3 | 6 | 5.1% |
paged_attention_decode atrex_024 | bf16 | 8 | 4.4% |
reshape_and_cache atrex_026 | bf16 | 8 | 4.4% |
topk_filter atrex_029 | fp32 | 8 | 3.5% |
gated_delta_rule_update atrex_013 | bf16 | 17 | 3.5% |
fused_qkv_rope atrex_011 | fp16 | 6 | 3.4% |
rms_norm atrex_027 | bf16 | 56 | 2.8% |
mla_decode_attention atrex_018 | bf16 | 3 | 2.3% |
attention_forward atrex_001 | bf16 | 21 | 2.2% |
causal_conv1d atrex_003 | bf16 | 9 | 2.2% |
moe_topk_gating_softmax atrex_022 | fp32 | 10 | 2.1% |
fused_qk_rmsnorm atrex_010 | fp16 | 4 | 1.9% |
silu_and_mul atrex_028 | bf16 | 37 | 1.8% |
chunk_gated_delta_rule_state atrex_005 | bf16 | 25 | 1.6% |
fused_add_rms_norm atrex_008 | bf16 | 19 | 1.1% |
moe_align_block_size atrex_019 | int32 | 6 | 1.1% |
mrope atrex_023 | bf16 | 5 | 1.0% |
chunk_delta_rule_output atrex_004 | bf16 | 16 | 0.8% |
fp8_dynamic_per_token_quant atrex_007 | fp8_e4m3 | 20 | 0.8% |
linear_sigmoid_mul atrex_017 | bf16 | 9 | 0.7% |
per_token_group_quant_fp8 atrex_025 | fp8_e4m3 | 19 | 0.5% |
gated_rms_norm atrex_014 | bf16 | 7 | 0.5% |
l2_norm atrex_015 | bf16 | 13 | 0.5% |
moe_count_and_sort atrex_020 | int32 | 5 | 0.5% |
fused_rmsnorm_quant atrex_012 | fp8_e4m3 | 11 | 0.3% |
moe_sum_reduce atrex_021 | bf16 | 7 | 0.3% |
layer_norm atrex_016 | bf16 | 13 | 0.1% |
Importance score reflects trace-derived GPU time share across production workloads
H20
XPU-A
TSOL is the theoretical Speed-of-Light lower bound: max(W/Ppeak, Q/Bpeak). TProd is the median measured time of the deployed kernel on the selected hardware. All times shown for the median shape of each operator.
| Operator | Regime | AI | TSOL(H20) | TSOL(XPU-A) | TProd |
|---|---|---|---|---|---|
unified_attention | C | 102.8 | 3.43 ms | 2.18 ms | 29.4 ms |
fused_moe | C | 460.2 | 4.07 ms | 2.59 ms | 10.6 ms |
block_scaled_mm | C | 371.3 | 27.18 us | 17.29 us | 118 us |
fp8_blockscale_fused_moe | C | 4.1 | 1.81 ms | 1.36 ms | 2.1 ms |
paged_attention_decode | M | 2.6 | 0.80 us | 0.61 us | 16 us |
reshape_and_cache | I | 0.0 | 6.28 us | 4.74 us | 21 us |
topk_filter | I | 0.0 | 33.57 us | 25.34 us | 805 us |
gated_delta_rule_update | C | 0.4 | 0.53 us | 0.40 us | 9 us |
fused_qkv_rope | M | 0.6 | 0.01 us | 0.00 us | 9 us |
rms_norm | M | 0.7 | 0.01 us | 0.01 us | 8 us |
mla_decode_attention | C | 34.2 | 0.08 us | 0.06 us | 19 us |
attention_forward | C | 15.4k | 104.97 ms | 66.78 ms | 124.2 ms |
causal_conv1d | M | 3.0 | 4.21 us | 3.18 us | 272 us |
moe_topk_gating_softmax | M | 0.9 | 0.29 us | 0.22 us | 13 us |
fused_qk_rmsnorm | M | 0.7 | 0.01 us | 0.00 us | 8 us |
silu_and_mul | M | 0.8 | 0.79 us | 0.59 us | 8 us |
chunk_gated_delta_rule_state | C | 46.8 | 44.04 us | 28.02 us | 1.2 ms |
fused_add_rms_norm | M | 0.5 | 5.72 us | 4.31 us | 14 us |
moe_align_block_size | I | 0.0 | 0.04 us | 0.03 us | 122 us |
mrope | M | 0.6 | 42.32 us | 31.94 us | 139 us |
chunk_delta_rule_output | C | 51.4 | 5.03 us | 3.20 us | 47 us |
fp8_dynamic_per_token_quant | M | 0.3 | 0.39 us | 0.29 us | 8 us |
linear_sigmoid_mul | C | 1.1k | 1.49 ms | 948.55 us | 1.3 ms |
per_token_group_quant_fp8 | M | 0.3 | 4.79 ms | 3.62 ms | 7.1 ms |
gated_rms_norm | M | 1.3 | 30.98 us | 23.38 us | 120 us |
l2_norm | M | 0.8 | 17.89 us | 13.50 us | 59 us |
moe_count_and_sort | I | 0.0 | 0.01 us | 0.01 us | 17 us |
fused_rmsnorm_quant | M | 0.7 | 93.50 us | 70.56 us | 141 us |
moe_sum_reduce | M | 0.4 | 46.48 us | 35.08 us | 180 us |
layer_norm | M | 1.7 | 17.83 us | 13.46 us | 64 us |
C = compute-bound (AI > 10 FLOP/byte), M = memory-bound, I = indexing/structural. TSOL = Speed-of-Light theoretical minimum. TProd = deployed kernel median time on selected hardware.
Click an operator to expand per-shape details. S = TSOL / TProd — higher is better (100% = hardware limit).
| Operator | Shapes | S (median) | |||
|---|---|---|---|---|---|
| ▶ | moe_align_block_size | 6 | 0.0% | ||
| #0 (vLLM TP1), MoE routing align_block_size, tokens=1 top_k=8 block_size=128 | 0.0% | ||||
| #2 (vLLM TP4), MoE routing align_block_size, tokens=520 top_k=8 block_size=128 | 0.0% | ||||
| #3 (vLLM TP4), MoE routing align_block_size, tokens=1019 top_k=8 block_size=128 | 0.0% | ||||
| #4 (vLLM TP4), MoE routing align_block_size, tokens=2044 top_k=8 block_size=128 | 0.0% | ||||
| #5 (vLLM TP4), MoE routing align_block_size, tokens=3936 top_k=8 block_size=128 | 0.0% | ||||
| #6 (vLLM TP4), MoE routing align_block_size, tokens=8192 top_k=8 block_size=128 | 0.0% | ||||
| ▶ | moe_count_and_sort | 5 | 0.1% | ||
| #0 (rtp-llm TP1), MoE routing count_and_sort, tokens=1 top_k=8 experts=128 | 0.0% | ||||
| #1 (vLLM TP4), MoE routing count_and_sort, tokens=76 top_k=8 experts=128 | 0.0% | ||||
| #2 (vLLM TP4), MoE routing count_and_sort, tokens=520 top_k=8 experts=128 | 0.1% | ||||
| #3 (vLLM TP4), MoE routing count_and_sort, tokens=1019 top_k=8 experts=128 | 0.1% | ||||
| #4 (vLLM TP4), MoE routing count_and_sort, tokens=2044 top_k=8 experts=128 | 0.1% | ||||
| ▶ | fused_qkv_rope | 6 | 0.1% | ||
| #0 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=32/4 head_dim=128 | 0.0% | ||||
| #1 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=40/8 head_dim=128 | 0.1% | ||||
| #2 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=64/8 head_dim=128 | 0.1% | ||||
| #3 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=32/8 head_dim=128 | 0.0% | ||||
| #5 (rtp-llm TP2), text decoder fusedQKV+RoPE, heads=32/4 head_dim=128 | 0.0% | ||||
| #6 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=40/8 head_dim=128 | 0.1% | ||||
| ▶ | fused_qk_rmsnorm | 4 | 0.1% | ||
| #0 (rtp-llm TP1), text decoder fused QK RMSNorm, heads=64/8 head_dim=128 | 0.1% | ||||
| #1 (rtp-llm TP1), text decoder fused QK RMSNorm, heads=32/8 head_dim=128 | 0.0% | ||||
| #3 (rtp-llm TP2), text decoder fused QK RMSNorm, heads=32/4 head_dim=128 | 0.0% | ||||
| #4 (rtp-llm TP1), text decoder fused QK RMSNorm, heads=40/8 head_dim=128 | 0.1% | ||||
| ▶ | gated_delta_rule_update | 17 | 0.7% | ||
| #0 (SGLang TP1), gated delta rule update, tokens=1 heads=16/32 dim=128/128 | 16.6% | ||||
| #1 (vLLM TP1), gated delta rule update, tokens=128 heads=16/32 dim=128/128 | 0.7% | ||||
| #2 (vLLM TP1), gated delta rule update, tokens=258 heads=16/32 dim=128/128 | 0.7% | ||||
| #3 (vLLM TP1), gated delta rule update, tokens=544 heads=16/32 dim=128/128 | 0.7% | ||||
| #4 (vLLM TP1), gated delta rule update, tokens=1033 heads=16/32 dim=128/128 | 0.7% | ||||
| #5 (vLLM TP1), gated delta rule update, tokens=2051 heads=16/32 dim=128/128 | 0.7% | ||||
| #6 (vLLM TP1), gated delta rule update, tokens=4120 heads=16/32 dim=128/128 | 0.7% | ||||
| #7 (vLLM TP1), gated delta rule update, tokens=8192 heads=16/32 dim=128/128 | 0.6% | ||||
| #8 (vLLM TP4), gated delta rule update, tokens=1 heads=4/8 dim=128/128 | 4.5% | ||||
| #9 (SGLang TP8), gated delta rule update, tokens=1 heads=8/16 dim=128/128 | 9.1% | ||||
| #10 (SGLang TP8), gated delta rule update, tokens=129 heads=8/16 dim=128/128 | 0.4% | ||||
| #12 (SGLang), gated delta rule update, tokens=130 heads=16/32 dim=128/128 | 0.7% | ||||
| #13 (vLLM), gated delta rule update, tokens=256 heads=16/32 dim=128/128 | 0.7% | ||||
| #14 (SGLang), gated delta rule update, tokens=512 heads=16/32 dim=128/128 | 0.7% | ||||
| #15 (SGLang), gated delta rule update, tokens=1029 heads=16/32 dim=128/128 | 0.7% | ||||
| #16 (SGLang), gated delta rule update, tokens=1831 heads=16/32 dim=128/128 | 0.7% | ||||
| #17 (SGLang), gated delta rule update, tokens=3251 heads=16/32 dim=128/128 | 0.7% | ||||
| ▶ | causal_conv1d | 9 | 1.2% | ||
| #0 (SGLang TP1), causal depthwise conv1d, tokens=1 dim=4096 width=4 | 0.3% | ||||
| #1 (vLLM TP1), causal depthwise conv1d, tokens=128 dim=4096 width=4 | 1.1% | ||||
| #2 (vLLM TP1), causal depthwise conv1d, tokens=256 dim=4096 width=4 | 1.0% | ||||
| #3 (vLLM TP1), causal depthwise conv1d, tokens=514 dim=4096 width=4 | 1.6% | ||||
| #4 (vLLM TP1), causal depthwise conv1d, tokens=1024 dim=4096 width=4 | 1.2% | ||||
| #5 (SGLang TP8), causal depthwise conv1d, tokens=1 dim=8192 width=4 | 0.4% | ||||
| #6 (SGLang), causal depthwise conv1d, tokens=4195 dim=8192 width=4 | 1.9% | ||||
| #7 (SGLang), causal depthwise conv1d, tokens=11027 dim=8192 width=4 | 2.0% | ||||
| #8 (SGLang), causal depthwise conv1d, tokens=14807 dim=8192 width=4 | 2.0% | ||||
| ▶ | topk_filter | 8 | 1.5% | ||
| #0 (vLLM), top-k logit filter, shape=(1,151936) | 0.1% | ||||
| #1 (vLLM), top-k logit filter, shape=(1,152064) | 0.1% | ||||
| #2 (vLLM), top-k logit filter, shape=(922,2048) | 1.5% | ||||
| #3 (SGLang), top-k logit filter, shape=(3,4002) | 0.1% | ||||
| #4 (SGLang), top-k logit filter, shape=(4098,4096) | 3.1% | ||||
| #5 (SGLang), top-k logit filter, shape=(8179,4096) | 2.5% | ||||
| #6 (SGLang), top-k logit filter, shape=(15381,4096) | 3.2% | ||||
| #7 (SGLang), top-k logit filter, shape=(3,4215) | 0.1% | ||||
| ▶ | moe_topk_gating_softmax | 10 | 1.7% | ||
| #0 (rtp-llm TP1), MoE routing topk_gating_softmax, tokens=1 experts=128 | 0.0% | ||||
| #1 (vLLM TP1), MoE routing topk_gating_softmax, tokens=129 experts=128 | 0.1% | ||||
| #2 (vLLM TP1), MoE routing topk_gating_softmax, tokens=256 experts=128 | 0.2% | ||||
| #3 (vLLM TP4), MoE routing topk_gating_softmax, tokens=520 experts=128 | 0.5% | ||||
| #4 (vLLM TP4), MoE routing topk_gating_softmax, tokens=1019 experts=128 | 0.8% | ||||
| #5 (vLLM TP4), MoE routing topk_gating_softmax, tokens=2044 experts=128 | 1.7% | ||||
| #6 (SGLang TP8), MoE routing topk_gating_softmax, tokens=4083 experts=128 | 2.6% | ||||
| #7 (vLLM TP4), MoE routing topk_gating_softmax, tokens=8192 experts=128 | 3.5% | ||||
| #8 (vLLM TP inferred), MoE routing topk_softmax, tokens=8037 experts=256 | 4.2% | ||||
| #9 (vLLM TP inferred), MoE routing topk_softmax, tokens=13526 experts=256 | 4.9% | ||||
| ▶ | mla_decode_attention | 3 | 1.7% | ||
| #0 (sglang TP2), MLA decode attention, nhead=128 kv_lora_rank=512 rope_dim=64 | 1.7% | ||||
| #1 (SGLang TP8), MLA decode attention, grid=(0,0,0) | 0.3% | ||||
| #2 (SGLang TP8), MLA decode attention, grid=(7,1,1) | 2.1% | ||||
| ▶ | paged_attention_decode | 8 | 3.9% | ||
| #0 (vLLM), decode paged attention, heads=16/16 head_dim=128 ctx_len=64 | 17.1% | ||||
| #1 (SGLang TP1), decode paged attention, heads=16/2 head_dim=256 | 0.9% | ||||
| #2 (vLLM), decode paged attention, heads=3/3 head_dim=128 ctx_len=64 | 10.1% | ||||
| #3 (rtp-llm TP1), decode paged attention, heads=32/4 head_dim=128 | 2.0% | ||||
| #4 (rtp-llm TP1), decode paged attention, heads=32/8 head_dim=128 | 3.9% | ||||
| #5 (SGLang TP8), decode paged attention, heads=4/1 head_dim=128 | 0.3% | ||||
| #6 (rtp-llm TP1), decode paged attention, heads=40/8 head_dim=128 | 3.9% | ||||
| #7 (rtp-llm TP1), decode paged attention, heads=64/8 head_dim=128 | 4.0% | ||||
| ▶ | chunk_gated_delta_rule_state | 25 | 4.5% | ||
| #0 (vLLM TP1), chunk state update, tokens=2048 heads=1/2 dim=128/128 | 1.3% | ||||
| #1 (vLLM TP1), chunk state update, tokens=2048 heads=10/20 dim=128/128 | 6.3% | ||||
| #2 (vLLM TP1), chunk state update, tokens=2048 heads=11/22 dim=128/128 | 4.8% | ||||
| #3 (vLLM TP1), chunk state update, tokens=2048 heads=12/24 dim=128/128 | 4.8% | ||||
| #4 (vLLM TP1), chunk state update, tokens=2048 heads=13/26 dim=128/128 | 5.3% | ||||
| #5 (vLLM TP1), chunk state update, tokens=2048 heads=14/28 dim=128/128 | 5.6% | ||||
| #6 (vLLM TP1), chunk state update, tokens=2048 heads=15/30 dim=128/128 | 6.0% | ||||
| #7 (vLLM TP1), chunk state update, tokens=2048 heads=16/32 dim=128/128 | 4.5% | ||||
| #8 (vLLM TP1), chunk state update, tokens=2048 heads=2/4 dim=128/128 | 2.5% | ||||
| #9 (vLLM TP1), chunk state update, tokens=4096 heads=2/4 dim=128/128 | 2.6% | ||||
| #10 (vLLM TP1), chunk state update, tokens=8192 heads=2/4 dim=128/128 | 2.7% | ||||
| #11 (vLLM TP1), chunk state update, tokens=16384 heads=2/4 dim=128/128 | 2.4% | ||||
| #12 (vLLM TP1), chunk state update, tokens=24576 heads=2/4 dim=128/128 | 2.3% | ||||
| #13 (vLLM TP1), chunk state update, tokens=2048 heads=3/6 dim=128/128 | 3.7% | ||||
| #14 (vLLM TP4), chunk state update, tokens=512 heads=4/8 dim=128/128 | 4.0% | ||||
| #15 (vLLM TP4), chunk state update, tokens=1024 heads=4/8 dim=128/128 | 4.4% | ||||
| #16 (vLLM TP1), chunk state update, tokens=2048 heads=4/8 dim=128/128 | 4.8% | ||||
| #17 (vLLM TP1), chunk state update, tokens=2048 heads=5/10 dim=128/128 | 6.1% | ||||
| #18 (vLLM TP1), chunk state update, tokens=2048 heads=6/12 dim=128/128 | 3.8% | ||||
| #19 (vLLM TP1), chunk state update, tokens=2048 heads=7/14 dim=128/128 | 4.5% | ||||
| #20 (SGLang TP8), chunk state update, tokens=1024 heads=8/16 dim=128/128 | 4.7% | ||||
| #21 (SGLang TP8), chunk state update, tokens=2048 heads=8/16 dim=128/128 | 5.0% | ||||
| #22 (vLLM TP1), chunk state update, tokens=2048 heads=9/18 dim=128/128 | 5.8% | ||||
| #23 (SGLang/vLLM), chunk state update, tokens=4195 heads=16/32 dim=128/128 | 4.3% | ||||
| #24 (SGLang/vLLM), chunk state update, tokens=11027 heads=16/32 dim=128/128 | 4.3% | ||||
| ▶ | unified_attention | 25 | 7.4% | ||
| #0 (vLLM TP4), prefill paged attention, seqs=130 heads=10/1 head_dim=128 | 7.3% | ||||
| #1 (vLLM TP4), prefill paged attention, seqs=290 heads=10/1 head_dim=128 | 7.3% | ||||
| #2 (vLLM TP4), prefill paged attention, seqs=484 heads=10/1 head_dim=128 | 7.5% | ||||
| #3 (vLLM TP4), prefill paged attention, seqs=1 heads=16/1 head_dim=128 | 0.1% | ||||
| #4 (SGLang), causal attention, seq_len=1688 heads=16/16 head_dim=192 | 6.7% | ||||
| #5 (vLLM), causal attention, seq_len=11840 heads=16/16 head_dim=72 | 3.6% | ||||
| #6 (rtp-llm), causal attention, seq_len=24752 heads=16/16 head_dim=80 | 9.6% | ||||
| #7 (vLLM), causal attention, seq_len=1144 heads=16/2 head_dim=128 | 8.9% | ||||
| #8 (vLLM), causal attention, seq_len=875 heads=16/2 head_dim=256 | 6.5% | ||||
| #9 (vLLM TP1), prefill paged attention, seqs=324 heads=16/4 head_dim=128 | 7.5% | ||||
| #10 (vLLM TP1), prefill paged attention, seqs=699 heads=16/4 head_dim=128 | 7.5% | ||||
| #11 (vLLM TP1), prefill paged attention, seqs=1032 heads=16/4 head_dim=128 | 7.5% | ||||
| #12 (vLLM TP1), prefill paged attention, seqs=1882 heads=16/4 head_dim=128 | 7.4% | ||||
| #13 (vLLM TP1), prefill paged attention, seqs=3095 heads=16/4 head_dim=128 | 7.4% | ||||
| #14 (vLLM), causal attention, seq_len=72 heads=16/4 head_dim=256 | 0.4% | ||||
| #15 (vLLM), causal attention, seq_len=7 heads=20/20 head_dim=64 | 0.1% | ||||
| #16 (vLLM TP1), prefill paged attention, seqs=18 heads=28/4 head_dim=128 | 7.0% | ||||
| #17 (vLLM TP1), prefill paged attention, seqs=118 heads=28/4 head_dim=128 | 7.4% | ||||
| #18 (vLLM TP1), prefill paged attention, seqs=1192 heads=28/4 head_dim=128 | 7.4% | ||||
| #19 (vLLM TP1), prefill paged attention, seqs=2056 heads=28/4 head_dim=128 | 7.4% | ||||
| #20 (vLLM TP1), prefill paged attention, seqs=1 heads=32/4 head_dim=128 | 14.3% | ||||
| #21 (vLLM TP1), prefill paged attention, seqs=1 heads=32/8 head_dim=128 | 12.2% | ||||
| #22 (vLLM), causal attention, seq_len=49152 heads=4/4 head_dim=72 | 3.7% | ||||
| #23 (vLLM), causal attention, seq_len=145 heads=40/8 head_dim=128 | 2.1% | ||||
| #24 (SGLang), causal attention, seq_len=5926 heads=8/1 head_dim=256 | 10.9% | ||||
| ▶ | chunk_delta_rule_output | 16 | 9.5% | ||
| #0 (vLLM TP1), chunk output, tokens=64 heads=16/32 dim=128/128 | 2.9% | ||||
| #1 (vLLM TP1), chunk output, tokens=128 heads=16/32 dim=128/128 | 4.9% | ||||
| #2 (vLLM TP1), chunk output, tokens=256 heads=16/32 dim=128/128 | 6.3% | ||||
| #3 (vLLM TP1), chunk output, tokens=512 heads=16/32 dim=128/128 | 8.0% | ||||
| #4 (vLLM TP1), chunk output, tokens=1024 heads=16/32 dim=128/128 | 9.6% | ||||
| #5 (vLLM TP1), chunk output, tokens=2048 heads=16/32 dim=128/128 | 9.8% | ||||
| #6 (vLLM TP1), chunk output, tokens=4096 heads=16/32 dim=128/128 | 10.0% | ||||
| #7 (vLLM TP1), chunk output, tokens=8192 heads=16/32 dim=128/128 | 10.3% | ||||
| #8 (SGLang TP8), chunk output, tokens=704 heads=8/16 dim=128/128 | 6.9% | ||||
| #9 (SGLang TP8), chunk output, tokens=1024 heads=8/16 dim=128/128 | 7.8% | ||||
| #10 (SGLang TP8), chunk output, tokens=2112 heads=8/16 dim=128/128 | 9.3% | ||||
| #11 (SGLang TP8), chunk output, tokens=4096 heads=8/16 dim=128/128 | 9.5% | ||||
| #12 (SGLang TP8), chunk output, tokens=8256 heads=8/16 dim=128/128 | 9.7% | ||||
| #13 (SGLang/vLLM), chunk output, tokens=4195 heads=16/32 dim=128/128 | 9.5% | ||||
| #14 (SGLang/vLLM), chunk output, tokens=11027 heads=16/32 dim=128/128 | 9.5% | ||||
| #15 (SGLang/vLLM), chunk output, tokens=14807 heads=16/32 dim=128/128 | 10.4% | ||||
| ▶ | layer_norm | 13 | 11.1% | ||
| #0 (vLLM TP4), shape=(276,1,1152) | 2.7% | ||||
| #1 (vLLM TP4), shape=(600,1,1152) | 5.3% | ||||
| #2 (vLLM TP4), shape=(1012,1,1152) | 7.3% | ||||
| #3 (vLLM TP4), shape=(2024,1,1152) | 11.1% | ||||
| #4 (vLLM TP4), shape=(4100,1,1152) | 15.3% | ||||
| #5 (vLLM TP4), shape=(8184,1,1152) | 18.7% | ||||
| #6 (vLLM TP4), shape=(15476,1,1152) | 21.0% | ||||
| #7 (vLLM TP4), shape=(24952,1,1152) | 22.2% | ||||
| #8 (vLLM TP4), shape=(40560,1,1152) | 23.2% | ||||
| #9 (SGLang TP1), LayerNorm, tokens=91192 hidden=2048 | 17.9% | ||||
| #10 (rtp-llm TP1), shape=(1,2048) | 0.0% | ||||
| #11 (rtp-llm TP1), shape=(1,4096) | 0.1% | ||||
| #12 (rtp-llm TP1), shape=(1,5120) | 0.1% | ||||
| ▶ | mrope | 5 | 15.1% | ||
| #0 (vLLM TP2), tokens=3592 heads=16/1 head_dim=128 | 15.1% | ||||
| #1 (vLLM TP2), tokens=8373 heads=16/1 head_dim=128 | 20.7% | ||||
| #2 (vLLM TP2), tokens=16530 heads=16/1 head_dim=128 | 23.0% | ||||
| #3 (sglang), tokens=838 heads=16/4 head_dim=128 | 10.5% | ||||
| #4 (sglang), tokens=1673 heads=16/4 head_dim=128 | 13.1% | ||||
| ▶ | reshape_and_cache | 8 | 15.8% | ||
| #0 (vLLM TP1), KV cache reshape_and_cache, shape=(1,4,128) | 0.1% | ||||
| #1 (vLLM TP1), KV cache reshape_and_cache, shape=(144,2,256) | 2.2% | ||||
| #2 (vLLM TP4), KV cache reshape_and_cache, shape=(248,16,128) | 12.0% | ||||
| #3 (vLLM TP4), KV cache reshape_and_cache, shape=(520,16,128) | 17.2% | ||||
| #4 (vLLM TP4), KV cache reshape_and_cache, shape=(1015,16,128) | 22.4% | ||||
| #5 (vLLM TP4), KV cache reshape_and_cache, shape=(2044,16,128) | 32.5% | ||||
| #6 (vLLM TP1), KV cache reshape_and_cache, shape=(4108,4,128) | 14.2% | ||||
| #7 (vLLM TP1), KV cache reshape_and_cache, shape=(8192,4,128) | 15.8% | ||||
| ▶ | rms_norm | 56 | 16.0% | ||
| #0 (vLLM), RMSNorm, tokens=2500 hidden=1024 | 19.5% | ||||
| #1 (vLLM), RMSNorm, tokens=4968 hidden=1024 | 27.2% | ||||
| #2 (vLLM), RMSNorm, tokens=10000 hidden=1024 | 33.9% | ||||
| #3 (vLLM), RMSNorm, tokens=16200 hidden=1024 | 38.3% | ||||
| #4 (vLLM), RMSNorm, tokens=22100 hidden=1024 | 40.6% | ||||
| #5 (vLLM), RMSNorm, tokens=360 hidden=1152 | 4.0% | ||||
| #6 (vLLM), RMSNorm, tokens=720 hidden=1152 | 7.8% | ||||
| #7 (vLLM), RMSNorm, tokens=1200 hidden=1152 | 11.3% | ||||
| #8 (vLLM), RMSNorm, tokens=2116 hidden=1152 | 15.0% | ||||
| #9 (vLLM), RMSNorm, tokens=3844 hidden=1152 | 18.9% | ||||
| #10 (vLLM), RMSNorm, tokens=8136 hidden=1152 | 22.7% | ||||
| #11 (vLLM), RMSNorm, tokens=15000 hidden=1152 | 25.0% | ||||
| #12 (vLLM), RMSNorm, tokens=24368 hidden=1152 | 26.1% | ||||
| #13 (vLLM), RMSNorm, tokens=49152 hidden=1152 | 27.1% | ||||
| #14 (vLLM), RMSNorm, tokens=65556 hidden=1152 | 27.4% | ||||
| #15 (SGLang), RMSNorm, tokens=3 hidden=128 | 0.0% | ||||
| #16 (SGLang), RMSNorm, tokens=1504 hidden=128 | 1.9% | ||||
| #17 (SGLang), RMSNorm, tokens=2048 hidden=128 | 2.6% | ||||
| #18 (SGLang), RMSNorm, tokens=4096 hidden=128 | 4.6% | ||||
| #19 (SGLang), RMSNorm, tokens=8192 hidden=128 | 6.7% | ||||
| #20 (SGLang), RMSNorm, tokens=16384 hidden=128 | 8.5% | ||||
| #21 (SGLang), RMSNorm, tokens=24544 hidden=128 | 9.7% | ||||
| #22 (SGLang), RMSNorm, tokens=49024 hidden=128 | 11.3% | ||||
| #23 (vLLM), RMSNorm, tokens=65488 hidden=128 | 11.7% | ||||
| #24 (vLLM), RMSNorm, tokens=95952 hidden=128 | 12.2% | ||||
| #25 (vLLM), RMSNorm, tokens=131072 hidden=128 | 12.5% | ||||
| #26 (vLLM), RMSNorm, tokens=247680 hidden=128 | 12.5% | ||||
| #27 (SGLang), RMSNorm, tokens=505888 hidden=128 | 13.2% | ||||
| #28 (vLLM), RMSNorm, tokens=7 hidden=1280 | 0.1% | ||||
| #29 (rtp-llm), RMSNorm, tokens=24752 hidden=1280 | 29.1% | ||||
| #30 (vLLM), RMSNorm, tokens=1736 hidden=1536 | 17.3% | ||||
| #31 (vLLM), RMSNorm, tokens=5000 hidden=1536 | 26.2% | ||||
| #32 (SGLang), RMSNorm, tokens=128 hidden=2048 | 2.6% | ||||
| #33 (SGLang), RMSNorm, tokens=256 hidden=2048 | 5.1% | ||||
| #34 (SGLang), RMSNorm, tokens=512 hidden=2048 | 9.6% | ||||
| #35 (SGLang), RMSNorm, tokens=1029 hidden=2048 | 17.5% | ||||
| #36 (vLLM TP1), RMSNorm, tokens=2010 hidden=2048 | 24.7% | ||||
| #37 (SGLang), RMSNorm, tokens=4174 hidden=2048 | 30.7% | ||||
| #38 (vLLM), RMSNorm, tokens=8290 hidden=2048 | 36.7% | ||||
| #39 (vLLM), RMSNorm, tokens=16530 hidden=2048 | 40.7% | ||||
| #40 (vLLM), RMSNorm, tokens=20538 hidden=2048 | 39.4% | ||||
| #41 (SGLang), RMSNorm, tokens=3 hidden=256 | 0.0% | ||||
| #42 (SGLang), RMSNorm, tokens=1672 hidden=2560 | 22.7% | ||||
| #43 (vLLM TP1), RMSNorm, tokens=298 hidden=3584 | 10.3% | ||||
| #44 (vLLM TP1), RMSNorm, tokens=514 hidden=3584 | 16.0% | ||||
| #45 (vLLM TP1), RMSNorm, tokens=1024 hidden=3584 | 24.3% | ||||
| #46 (SGLang), RMSNorm, tokens=3 hidden=512 | 0.0% | ||||
| #47 (rtp-llm TP1), RMSNorm, tokens=1 hidden=5120 | 0.1% | ||||
| #48 (SGLang TP8), RMSNorm, tokens=53 hidden=7168 | 3.8% | ||||
| #49 (SGLang TP8), RMSNorm, tokens=141 hidden=7168 | 9.7% | ||||
| #50 (SGLang TP8), RMSNorm, tokens=632 hidden=7168 | 26.5% | ||||
| #51 (SGLang TP8), RMSNorm, tokens=917 hidden=7168 | 30.2% | ||||
| #52 (SGLang/vLLM), RMSNorm, tokens=1688 hidden=7168 | 36.7% | ||||
| #53 (SGLang/vLLM), RMSNorm, tokens=3804 hidden=7168 | 44.3% | ||||
| #54 (SGLang), RMSNorm, tokens=154 hidden=8192 | 12.0% | ||||
| #55 (SGLang), RMSNorm, tokens=510 hidden=8192 | 25.5% | ||||
| ▶ | moe_sum_reduce | 7 | 19.5% | ||
| #0 (SGLang TP inferred), MoE top-8 expert sum, tokens=11027 hidden=2048 | 21.2% | ||||
| #1 (SGLang TP inferred), MoE top-8 expert sum, tokens=6030 hidden=2048 | 20.0% | ||||
| #2 (SGLang TP inferred), MoE top-8 expert sum, tokens=5983 hidden=2048 | 19.5% | ||||
| #3 (SGLang TP inferred), MoE top-8 expert sum, tokens=5043 hidden=2048 | 19.5% | ||||
| #4 (SGLang TP inferred), MoE top-8 expert sum, tokens=4996 hidden=2048 | 19.5% | ||||
| #5 (SGLang TP inferred), MoE top-8 expert sum, tokens=4363 hidden=2048 | 19.1% | ||||
| #6 (SGLang TP inferred), MoE top-8 expert sum, tokens=4195 hidden=2048 | 18.9% | ||||
| ▶ | gated_rms_norm | 7 | 19.5% | ||
| #0 (SGLang), gated RMSNorm, rows=352864 hidden=128 | 20.6% | ||||
| #1 (SGLang), gated RMSNorm, rows=192960 hidden=128 | 19.8% | ||||
| #2 (SGLang), gated RMSNorm, rows=191456 hidden=128 | 19.8% | ||||
| #3 (SGLang), gated RMSNorm, rows=161376 hidden=128 | 19.5% | ||||
| #4 (SGLang), gated RMSNorm, rows=159872 hidden=128 | 19.5% | ||||
| #5 (SGLang), gated RMSNorm, rows=139616 hidden=128 | 19.4% | ||||
| #6 (SGLang), gated RMSNorm, rows=134240 hidden=128 | 19.4% | ||||
| ▶ | fused_moe | 23 | 20.9% | ||
| #0 (vLLM), MoE expert FFN, tokens=135 hidden=2048 8 experts top-2 | 12.8% | ||||
| #1 (vLLM), MoE expert FFN, tokens=277 hidden=2048 8 experts top-2 | 12.0% | ||||
| #2 (vLLM), MoE expert FFN, tokens=668 hidden=2048 8 experts top-2 | 20.9% | ||||
| #3 (vLLM), MoE expert FFN, tokens=1023 hidden=2048 8 experts top-2 | 22.0% | ||||
| #4 (vLLM), MoE expert FFN, tokens=1979 hidden=2048 8 experts top-2 | 27.6% | ||||
| #5 (SGLang), MoE expert FFN, tokens=4195 hidden=2048 8 experts top-2 | 27.4% | ||||
| #6 (vLLM), MoE expert FFN, tokens=7689 hidden=2048 8 experts top-2 | 26.0% | ||||
| #7 (SGLang), MoE expert FFN, tokens=15809 hidden=2048 8 experts top-2 | 22.3% | ||||
| #8 (rtp-llm TP1), text decoder MoE expert FFN, hidden_states=(1,2048) 128 experts top-8 intermediate=768 | 6.3% | ||||
| #9 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(2400,2048) 128 experts top-8 intermediate=768 | 20.5% | ||||
| #10 (SGLang TP1), text decoder MoE expert FFN, hidden_states=(4195,2048) 128 experts top-8 intermediate=768 | 22.7% | ||||
| #11 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(8192,2048) 128 experts top-8 intermediate=768 | 24.5% | ||||
| #12 (SGLang TP1), text decoder MoE expert FFN, hidden_states=(15809,2048) 128 experts top-8 intermediate=768 | 22.5% | ||||
| #13 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(25632,2048) 128 experts top-8 intermediate=768 | 20.2% | ||||
| #14 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(45936,2048) 128 experts top-8 intermediate=768 | 18.9% | ||||
| #15 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(520,4096) 128 experts top-8 intermediate=384 | 18.3% | ||||
| #16 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(1019,4096) 128 experts top-8 intermediate=384 | 17.1% | ||||
| #17 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(2044,4096) 128 experts top-8 intermediate=384 | 19.4% | ||||
| #18 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(3936,4096) 128 experts top-8 intermediate=384 | 21.9% | ||||
| #19 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(8192,4096) 128 experts top-8 intermediate=384 | 22.3% | ||||
| #20 (SGLang), MoE expert FFN, tokens=4098 hidden=4096 8 experts top-2 | 28.7% | ||||
| #21 (SGLang), MoE expert FFN, tokens=8179 hidden=4096 8 experts top-2 | 20.6% | ||||
| #22 (SGLang), MoE expert FFN, tokens=15381 hidden=4096 8 experts top-2 | 17.5% | ||||
| ▶ | l2_norm | 13 | 21.8% | ||
| #0 (vLLM TP1), L2 norm, tokens=1 hidden=2048 | 0.0% | ||||
| #1 (vLLM TP1), L2 norm, tokens=128 hidden=2048 | 2.6% | ||||
| #2 (vLLM TP1), L2 norm, tokens=256 hidden=2048 | 5.1% | ||||
| #3 (vLLM TP1), L2 norm, tokens=480 hidden=2048 | 8.2% | ||||
| #4 (vLLM TP1), L2 norm, tokens=2304 hidden=2048 | 16.6% | ||||
| #5 (vLLM TP1), L2 norm, tokens=4064 hidden=2048 | 20.9% | ||||
| #6 (vLLM TP1), L2 norm, tokens=8736 hidden=2048 | 22.9% | ||||
| #7 (vLLM TP1), L2 norm, tokens=16608 hidden=2048 | 21.9% | ||||
| #8 (vLLM TP1), L2 norm, tokens=24480 hidden=2048 | 21.8% | ||||
| #9 (vLLM TP1), L2 norm, tokens=48704 hidden=2048 | 23.8% | ||||
| #10 (vLLM TP1), L2 norm, tokens=65792 hidden=2048 | 23.6% | ||||
| #11 (vLLM TP1), L2 norm, tokens=98880 hidden=2048 | 24.1% | ||||
| #12 (vLLM TP1), L2 norm, tokens=131072 hidden=2048 | 23.7% | ||||
| ▶ | fp8_dynamic_per_token_quant | 20 | 22.5% | ||
| #0 (vLLM TP1), FP8 dynamic per-token quantization, input=(1,2048) | 0.0% | ||||
| #1 (vLLM TP1), FP8 dynamic per-token quantization, input=(608,2048) | 8.7% | ||||
| #2 (vLLM TP1), FP8 dynamic per-token quantization, input=(865,2048) | 11.0% | ||||
| #3 (vLLM TP1), FP8 dynamic per-token quantization, input=(2032,2048) | 16.9% | ||||
| #4 (vLLM TP1), FP8 dynamic per-token quantization, input=(4144,2048) | 22.5% | ||||
| #5 (vLLM TP1), FP8 dynamic per-token quantization, input=(6920,2048) | 25.2% | ||||
| #6 (vLLM TP1), FP8 dynamic per-token quantization, input=(16256,2048) | 25.6% | ||||
| #7 (vLLM TP1), FP8 dynamic per-token quantization, input=(24712,2048) | 27.1% | ||||
| #8 (vLLM TP1), FP8 dynamic per-token quantization, input=(45568,2048) | 28.8% | ||||
| #10 (SGLang TP8), FP8 dynamic per-token quantization, input=(1,4096) | 0.0% | ||||
| #11 (SGLang TP8), FP8 dynamic per-token quantization, input=(127,4096) | 3.8% | ||||
| #12 (SGLang TP8), FP8 dynamic per-token quantization, input=(255,4096) | 7.5% | ||||
| #13 (SGLang TP8), FP8 dynamic per-token quantization, input=(513,4096) | 12.8% | ||||
| #14 (SGLang TP8), FP8 dynamic per-token quantization, input=(1029,4096) | 20.1% | ||||
| #15 (SGLang TP8), FP8 dynamic per-token quantization, input=(2041,4096) | 27.0% | ||||
| #16 (SGLang TP8), FP8 dynamic per-token quantization, input=(4104,4096) | 33.2% | ||||
| #17 (SGLang TP8), FP8 dynamic per-token quantization, input=(8208,4096) | 36.7% | ||||
| #18 (rtp-llm TP2), FP8 dynamic per-token quantization, input=(1,5120) | 0.0% | ||||
| #19 (SGLang/vLLM), FP8 dynamic per-token quantization, input=(1688,7168) | 31.1% | ||||
| #20 (SGLang/vLLM), FP8 dynamic per-token quantization, input=(3804,7168) | 36.4% | ||||
| ▶ | silu_and_mul | 37 | 24.0% | ||
| #0 (rtp-llm TP1), text decoder MoE expert gated SiLU, shape=(1,1536) | 0.0% | ||||
| #1 (SGLang TP1), text decoder gated SiLU activation, shape=(1630,1536) | 11.6% | ||||
| #2 (vLLM TP1), text decoder gated SiLU activation, shape=(4688,1536) | 19.2% | ||||
| #3 (vLLM TP1), text decoder gated SiLU activation, shape=(10704,1536) | 24.4% | ||||
| #4 (vLLM TP1), text decoder gated SiLU activation, shape=(16024,1536) | 26.5% | ||||
| #5 (vLLM TP1), text decoder gated SiLU activation, shape=(24656,1536) | 28.1% | ||||
| #6 (vLLM TP1), text decoder gated SiLU activation, shape=(49032,1536) | 29.6% | ||||
| #7 (vLLM TP1), text decoder gated SiLU activation, shape=(128632,1536) | 30.8% | ||||
| #8 (vLLM TP1), text decoder gated SiLU activation, shape=(175680,1536) | 31.1% | ||||
| #9 (vLLM), gated SiLU activation, shape=(1,2048) | 0.0% | ||||
| #10 (sglang TP1), text decoder gated SiLU, shape=(183,2048) | 2.7% | ||||
| #11 (sglang TP1), text decoder gated SiLU, shape=(1113,2048) | 13.9% | ||||
| #12 (vLLM), gated SiLU activation, shape=(1,2560) | 0.0% | ||||
| #13 (vLLM), gated SiLU activation, shape=(1736,3072) | 19.9% | ||||
| #14 (vLLM), gated SiLU activation, shape=(5000,3072) | 27.7% | ||||
| #15 (vLLM TP4), text decoder gated SiLU activation, shape=(11570,3072) | 31.8% | ||||
| #16 (vLLM TP4), text decoder gated SiLU activation, shape=(14320,3072) | 24.4% | ||||
| #17 (SGLang), gated SiLU activation, shape=(128,4096) | 3.9% | ||||
| #18 (SGLang), gated SiLU activation, shape=(256,4096) | 7.4% | ||||
| #19 (SGLang), gated SiLU activation, shape=(512,4096) | 13.1% | ||||
| #20 (vLLM), gated SiLU activation, shape=(1023,4096) | 19.0% | ||||
| #21 (vLLM TP4), text decoder gated SiLU activation, shape=(2044,4096) | 24.3% | ||||
| #22 (vLLM), gated SiLU activation, shape=(4093,4096) | 28.7% | ||||
| #23 (vLLM TP4), text decoder gated SiLU activation, shape=(8192,4096) | 32.6% | ||||
| #24 (vLLM), gated SiLU activation, shape=(16418,4096) | 34.7% | ||||
| #25 (vLLM), gated SiLU activation, shape=(20538,4096) | 35.1% | ||||
| #26 (vLLM), gated SiLU activation, shape=(528,5120) | 12.4% | ||||
| #27 (vLLM), gated SiLU activation, shape=(1056,5120) | 15.8% | ||||
| #28 (vLLM), gated SiLU activation, shape=(1910,5120) | 18.8% | ||||
| #29 (SGLang), gated SiLU activation, shape=(4177,5120) | 21.2% | ||||
| #30 (vLLM), gated SiLU activation, shape=(8192,5120) | 22.6% | ||||
| #31 (vLLM), gated SiLU activation, shape=(540,8192) | 19.4% | ||||
| #32 (vLLM), gated SiLU activation, shape=(1003,8192) | 24.0% | ||||
| #33 (vLLM), gated SiLU activation, shape=(2046,8192) | 28.6% | ||||
| #34 (SGLang), gated SiLU activation, shape=(4098,8192) | 32.3% | ||||
| #35 (SGLang), gated SiLU activation, shape=(8179,8192) | 34.2% | ||||
| #36 (SGLang), gated SiLU activation, shape=(15381,8192) | 35.2% | ||||
| ▶ | attention_forward | 21 | 28.7% | ||
| #0 (vLLM TP1), prefill attention forward, q/k/v=(720,16,72) | 15.0% | ||||
| #1 (vLLM TP1), prefill attention forward, q/k/v=(1200,16,72) | 26.4% | ||||
| #2 (vLLM TP4), prefill attention forward, q/k/v=(2116,16,72) | 26.7% | ||||
| #3 (vLLM TP1), prefill attention forward, q/k/v=(3844,16,72) | 28.7% | ||||
| #4 (vLLM TP4), prefill attention forward, q/k/v=(8136,16,72) | 34.6% | ||||
| #5 (vLLM TP1), prefill attention forward, q/k/v=(17296,16,72) | 34.7% | ||||
| #6 (vLLM TP1), prefill attention forward, q/k/v=(24368,16,72) | 35.3% | ||||
| #7 (vLLM TP1), prefill attention forward, q/k/v=(49596,16,72) | 36.2% | ||||
| #8 (vLLM TP1), prefill attention forward, q/k/v=(65556,16,72) | 35.8% | ||||
| #9 (vLLM TP1), prefill attention forward, q/k/v=(16742,32,128) | 52.9% | ||||
| #10 (vLLM TP1), prefill attention forward, q/k/v=(30793,32,128) | 53.8% | ||||
| #11 (vLLM TP4), prefill attention forward, q/k/v=(276,4,72) | 0.9% | ||||
| #12 (vLLM TP4), prefill attention forward, q/k/v=(600,4,72) | 3.8% | ||||
| #13 (vLLM TP4), prefill attention forward, q/k/v=(1012,4,72) | 8.9% | ||||
| #14 (vLLM TP4), prefill attention forward, q/k/v=(2024,4,72) | 19.2% | ||||
| #15 (vLLM TP4), prefill attention forward, q/k/v=(4100,4,72) | 27.0% | ||||
| #16 (vLLM TP4), prefill attention forward, q/k/v=(8184,4,72) | 28.3% | ||||
| #17 (vLLM TP4), prefill attention forward, q/k/v=(15476,4,72) | 31.7% | ||||
| #18 (vLLM TP4), prefill attention forward, q/k/v=(24952,4,72) | 36.1% | ||||
| #19 (vLLM TP4), prefill attention forward, q/k/v=(40560,4,72) | 37.2% | ||||
| #20 (rtp-llm TP1), prefill attention forward, q/k/v=(1,8,128) | 0.0% | ||||
| ▶ | block_scaled_mm | 24 | 28.9% | ||
| #0 (sglang TP1), FP8 block-scale GEMM, input=(349,2048) group=128x128 | 10.3% | ||||
| #1 (vLLM TP1), FP8 block-scale GEMM (triton), m=1024 | 25.3% | ||||
| #2 (vLLM TP4), FP8 block-scaled GEMM inferred, m=4533 n=2048 k=2048 | 27.6% | ||||
| #3 (vLLM TP4), FP8 block-scaled GEMM inferred, m=6918 n=2048 k=2048 | 28.6% | ||||
| #4 (vLLM TP1), FP8 block-scale GEMM (triton), m=20480 | 30.2% | ||||
| #5 (vLLM TP1), FP8 block-scale GEMM (triton), m=51200 | 30.1% | ||||
| #6 (vLLM TP1), FP8 block-scale GEMM (triton), m=65536 | 30.1% | ||||
| #7 (vLLM TP1), FP8 block-scale GEMM (triton), m=98304 | 30.2% | ||||
| #8 (vLLM TP1), FP8 block-scale GEMM (triton), m=118784 | 30.2% | ||||
| #9 (vLLM TP1), FP8 block-scale GEMM (triton), m=250880 | 30.2% | ||||
| #10 (vLLM TP1), FP8 block-scale GEMM (triton), m=516096 | 30.2% | ||||
| #11 (vLLM TP1), FP8 block-scale GEMM (triton), m=704512 | 30.2% | ||||
| #12 (sglang TP1), FP8 block-scale GEMM, input=(237,4096) group=128x128 | 14.6% | ||||
| #13 (SGLang TP8), FP8 block-scale GEMM, m=3072 | 27.9% | ||||
| #14 (SGLang TP8), FP8 block-scale GEMM, m=4096 | 28.2% | ||||
| #15 (SGLang TP8), FP8 block-scale GEMM, m=8192 | 28.6% | ||||
| #16 (SGLang TP8), FP8 block-scale GEMM, m=16384 | 28.5% | ||||
| #17 (SGLang TP8), FP8 block-scale GEMM, m=24576 | 28.8% | ||||
| #18 (SGLang TP8), FP8 block-scale GEMM, m=49152 | 28.8% | ||||
| #19 (SGLang TP8), FP8 block-scale GEMM, m=65536 | 28.9% | ||||
| #20 (SGLang TP8), FP8 block-scale GEMM, m=98304 | 28.9% | ||||
| #21 (SGLang TP8), FP8 block-scale GEMM, m=131072 | 28.9% | ||||
| #22 (SGLang TP8), FP8 block-scale GEMM, m=258048 | 28.9% | ||||
| #23 (SGLang TP8), FP8 block-scale GEMM, m=299520 | 28.9% | ||||
| ▶ | linear_sigmoid_mul | 9 | 30.8% | ||
| #0 (SGLang TP inferred), shared expert gate, hidden_states=(8037,4096) out=4096 | 72.3% | ||||
| #1 (SGLang TP inferred), shared expert gate, hidden_states=(6717,4096) out=4096 | 76.5% | ||||
| #2 (SGLang TP inferred), shared expert gate, hidden_states=(4001,4096) out=4096 | 28.0% | ||||
| #3 (SGLang TP inferred), shared expert gate, hidden_states=(15381,4096) out=4096 | 30.8% | ||||
| #4 (SGLang TP inferred), shared expert gate, hidden_states=(6576,4096) out=4096 | 75.3% | ||||
| #5 (SGLang TP1), shared expert gate, hidden_states=(1,2048) out=2048 | 15.2% | ||||
| #6 (SGLang TP1), shared expert gate, hidden_states=(90,2048) out=2048 | 20.2% | ||||
| #7 (SGLang TP1), shared expert gate, hidden_states=(13,2048) out=2048 | 18.6% | ||||
| #8 (SGLang TP8), shared expert gate, hidden_states=(2748,4096) out=4096 | 70.2% | ||||
| ▶ | fused_add_rms_norm | 19 | 37.0% | ||
| #0 (rtp-llm), fused add RMSNorm, tokens=24752 hidden=1280 | 45.5% | ||||
| #1 (vLLM TP1), fused add RMSNorm, tokens=126 hidden=2048 | 5.0% | ||||
| #2 (vLLM TP1), fused add RMSNorm, tokens=257 hidden=2048 | 9.5% | ||||
| #3 (vLLM TP1), fused add RMSNorm, tokens=508 hidden=2048 | 18.7% | ||||
| #4 (vLLM TP1), fused add RMSNorm, tokens=1024 hidden=2048 | 27.3% | ||||
| #5 (vLLM TP1), fused add RMSNorm, tokens=2032 hidden=2048 | 37.4% | ||||
| #6 (vLLM TP1), fused add RMSNorm, tokens=4096 hidden=2048 | 45.4% | ||||
| #7 (vLLM TP1), fused add RMSNorm, tokens=141 hidden=5120 | 13.5% | ||||
| #8 (vLLM TP1), fused add RMSNorm, tokens=284 hidden=5120 | 21.8% | ||||
| #9 (vLLM TP1), fused add RMSNorm, tokens=558 hidden=5120 | 31.2% | ||||
| #10 (vLLM TP1), fused add RMSNorm, tokens=1024 hidden=5120 | 38.3% | ||||
| #11 (vLLM TP1), fused add RMSNorm, tokens=1720 hidden=5120 | 42.6% | ||||
| #12 (vLLM TP1), fused add RMSNorm, tokens=4205 hidden=5120 | 49.2% | ||||
| #13 (SGLang TP8), fused add RMSNorm, tokens=53 hidden=7168 | 7.3% | ||||
| #14 (SGLang TP8), fused add RMSNorm, tokens=141 hidden=7168 | 17.4% | ||||
| #15 (SGLang TP8), fused add RMSNorm, tokens=632 hidden=7168 | 37.0% | ||||
| #16 (SGLang TP8), fused add RMSNorm, tokens=917 hidden=7168 | 40.2% | ||||
| #17 (SGLang/vLLM), fused add RMSNorm, tokens=1688 hidden=7168 | 45.7% | ||||
| #18 (SGLang/vLLM), fused add RMSNorm, tokens=3804 hidden=7168 | 50.7% | ||||
| ▶ | fused_rmsnorm_quant | 11 | 37.7% | ||
| #0 (sglang TP1), fused RMSNorm + FP8 quant, input=(117,2048) | 4.1% | ||||
| #1 (sglang TP1), fused RMSNorm + FP8 quant, input=(837,2048) | 20.5% | ||||
| #2 (sglang TP1), fused RMSNorm + FP8 quant, input=(1769,2048) | 30.1% | ||||
| #3 (SGLang TP1), fused RMSNorm + FP8 quant, input=(3260,2048) | 37.7% | ||||
| #4 (SGLang TP1), fused RMSNorm + FP8 quant, input=(11399,2048) | 47.4% | ||||
| #5 (SGLang TP1), fused RMSNorm + FP8 quant, input=(26080,2048) | 50.1% | ||||
| #6 (SGLang TP8), fused RMSNorm + FP8 quant, input=(650,4096) | 26.1% | ||||
| #7 (SGLang TP8), fused RMSNorm + FP8 quant, input=(1013,4096) | 33.0% | ||||
| #8 (SGLang TP8), fused RMSNorm + FP8 quant, input=(2058,4096) | 40.7% | ||||
| #9 (SGLang TP8), fused RMSNorm + FP8 quant, input=(4083,4096) | 47.0% | ||||
| #10 (SGLang TP8), fused RMSNorm + FP8 quant, input=(8208,4096) | 51.3% | ||||
| ▶ | fp8_blockscale_fused_moe | 6 | 40.7% | ||
| #0 (sglang TP1), FP8 block-scale MoE expert FFN, hidden_states=(30821,2048) 128 experts top-8 | 39.7% | ||||
| #1 (sglang TP2), FP8 block-scale MoE expert FFN, hidden_states=(3804,7168) 256 experts top-8 | 42.7% | ||||
| #2 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=0 | 27.4% | ||||
| #3 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=1 | 63.7% | ||||
| #4 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=0 | 12.0% | ||||
| #5 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=80 | 40.7% | ||||
| ▶ | per_token_group_quant_fp8 | 19 | 44.6% | ||
| #0 (vLLM TP1), FP8 per-token-group quantization, input=(1,2048) group_size=128 | 0.0% | ||||
| #1 (vLLM TP1), FP8 per-token-group quantization, input=(37504,2048) group_size=128 | 41.4% | ||||
| #2 (vLLM TP1), FP8 per-token-group quantization, input=(75008,2048) group_size=128 | 44.0% | ||||
| #3 (vLLM TP1), FP8 per-token-group quantization, input=(85632,2048) group_size=128 | 44.9% | ||||
| #4 (vLLM TP1), FP8 per-token-group quantization, input=(128192,2048) group_size=128 | 46.1% | ||||
| #5 (vLLM TP1), FP8 per-token-group quantization, input=(256896,2048) group_size=128 | 47.0% | ||||
| #6 (vLLM TP1), FP8 per-token-group quantization, input=(500736,2048) group_size=128 | 46.6% | ||||
| #7 (vLLM TP1), FP8 per-token-group quantization, input=(1029056,2048) group_size=128 | 46.0% | ||||
| #8 (vLLM TP1), FP8 per-token-group quantization, input=(2058112,2048) group_size=128 | 45.2% | ||||
| #9 (vLLM TP1), FP8 per-token-group quantization, input=(3087168,2048) group_size=128 | 50.7% | ||||
| #10 (SGLang TP8), FP8 per-token-group quantization, input=(1,7168) group_size=128 | 0.0% | ||||
| #11 (SGLang TP8), FP8 per-token-group quantization, input=(726,7168) group_size=128 | 21.4% | ||||
| #12 (SGLang TP8), FP8 per-token-group quantization, input=(969,7168) group_size=128 | 25.4% | ||||
| #13 (SGLang TP8), FP8 per-token-group quantization, input=(1933,7168) group_size=128 | 34.3% | ||||
| #14 (SGLang TP8), FP8 per-token-group quantization, input=(4192,7168) group_size=128 | 36.1% | ||||
| #15 (SGLang TP8), FP8 per-token-group quantization, input=(8667,7168) group_size=128 | 38.8% | ||||
| #16 (SGLang TP8), FP8 per-token-group quantization, input=(16768,7168) group_size=128 | 44.6% | ||||
| #17 (SGLang TP8), FP8 per-token-group quantization, input=(40448,7168) group_size=128 | 46.6% | ||||
| #18 (SGLang TP8), FP8 per-token-group quantization, input=(58688,7168) group_size=128 | 46.7% | ||||