Benchmark Data

30 production operators with 440 hot shapes, importance weights from 1,303 production profiles, roofline bounds across 2 hardware platforms, and deployed kernel baselines.

30 operators

Operator ▴▾	dtype	Regime	Framework	Shapes	T_Prod (XPU-A)	Importance
`unified_attention` atrex_030	bf16	compute	vllm	25	4.7 ms	39.5%
`fused_moe` atrex_009	bf16	compute	vllm	23	5.7 ms	11.4%
`block_scaled_mm` atrex_002	fp8_e4m3	compute	vllm	24	5.9 ms	9.3%
`fp8_blockscale_fused_moe` atrex_006	fp8_e4m3	compute	aiter	6	4.1 ms	5.1%
`paged_attention_decode` atrex_024	bf16	memory	rtp-llm	8	16 us	4.4%
`reshape_and_cache` atrex_026	bf16	index	vllm	8	21 us	4.4%
`topk_filter` atrex_029	fp32	index	vllm	8	329 us	3.5%
`gated_delta_rule_update` atrex_013	bf16	compute	sglang	17	1.1 ms	3.5%
`fused_qkv_rope` atrex_011	fp16	memory	rtp-llm	6	9 us	3.4%
`rms_norm` atrex_027	bf16	memory	vllm	56	14 us	2.8%
`mla_decode_attention` atrex_018	bf16	compute	aiter	3	21 us	2.3%
`attention_forward` atrex_001	bf16	compute	vllm	21	1.2 ms	2.2%
`causal_conv1d` atrex_003	bf16	memory	sglang	9	101 us	2.2%
`moe_topk_gating_softmax` atrex_022	fp32	memory	vllm	10	13 us	2.1%
`fused_qk_rmsnorm` atrex_010	fp16	memory	rtp-llm	4	8 us	1.9%
`silu_and_mul` atrex_028	bf16	memory	vllm	37	31 us	1.8%
`chunk_gated_delta_rule_state` atrex_005	bf16	compute	sglang	25	185 us	1.6%
`fused_add_rms_norm` atrex_008	bf16	memory	vllm	19	17 us	1.1%
`moe_align_block_size` atrex_019	int32	index	vllm	6	122 us	1.1%
`mrope` atrex_023	bf16	memory	vllm	5	46 us	1.0%
`chunk_delta_rule_output` atrex_004	bf16	compute	sglang	16	191 us	0.8%
`fp8_dynamic_per_token_quant` atrex_007	fp8_e4m3	memory	rtp-llm	20	18 us	0.8%
`linear_sigmoid_mul` atrex_017	bf16	compute	sglang	9	1.3 ms	0.7%
`per_token_group_quant_fp8` atrex_025	fp8_e4m3	memory	vllm	19	200 us	0.5%
`gated_rms_norm` atrex_014	bf16	memory	sglang	7	120 us	0.5%
`l2_norm` atrex_015	bf16	memory	vllm	13	59 us	0.5%
`moe_count_and_sort` atrex_020	int32	index	vllm	5	17 us	0.5%
`fused_rmsnorm_quant` atrex_012	fp8_e4m3	memory	aiter	11	23 us	0.3%
`moe_sum_reduce` atrex_021	bf16	memory	sglang	7	180 us	0.3%
`layer_norm` atrex_016	bf16	memory	vllm	13	16 us	0.1%

Operator Importance Distribution

Importance score reflects trace-derived GPU time share across production workloads

Hardware Platforms

H20

BF16 Tensor Core 148 TFLOPS

FP8 Tensor Core 296 TFLOPS

HBM Bandwidth 4 TB/s

XPU-A

BF16 Tensor Core 232.7 TFLOPS

FP8 Tensor Core 465.3 TFLOPS

HBM Bandwidth 5.3 TB/s

Roofline by Operator × Hardware

T_SOL is the theoretical Speed-of-Light lower bound: max(W/P_peak, Q/B_peak). T_Prod is the median measured time of the deployed kernel on the selected hardware. All times shown for the median shape of each operator.

T_Prod hardware:

Operator	Regime	AI	T_SOL(H20)	T_SOL(XPU-A)	T_Prod
`unified_attention`	C	102.8	3.43 ms	2.18 ms	29.4 ms
`fused_moe`	C	460.2	4.07 ms	2.59 ms	10.6 ms
`block_scaled_mm`	C	371.3	27.18 us	17.29 us	118 us
`fp8_blockscale_fused_moe`	C	4.1	1.81 ms	1.36 ms	2.1 ms
`paged_attention_decode`	M	2.6	0.80 us	0.61 us	16 us
`reshape_and_cache`	I	0.0	6.28 us	4.74 us	21 us
`topk_filter`	I	0.0	33.57 us	25.34 us	805 us
`gated_delta_rule_update`	C	0.4	0.53 us	0.40 us	9 us
`fused_qkv_rope`	M	0.6	0.01 us	0.00 us	9 us
`rms_norm`	M	0.7	0.01 us	0.01 us	8 us
`mla_decode_attention`	C	34.2	0.08 us	0.06 us	19 us
`attention_forward`	C	15.4k	104.97 ms	66.78 ms	124.2 ms
`causal_conv1d`	M	3.0	4.21 us	3.18 us	272 us
`moe_topk_gating_softmax`	M	0.9	0.29 us	0.22 us	13 us
`fused_qk_rmsnorm`	M	0.7	0.01 us	0.00 us	8 us
`silu_and_mul`	M	0.8	0.79 us	0.59 us	8 us
`chunk_gated_delta_rule_state`	C	46.8	44.04 us	28.02 us	1.2 ms
`fused_add_rms_norm`	M	0.5	5.72 us	4.31 us	14 us
`moe_align_block_size`	I	0.0	0.04 us	0.03 us	122 us
`mrope`	M	0.6	42.32 us	31.94 us	139 us
`chunk_delta_rule_output`	C	51.4	5.03 us	3.20 us	47 us
`fp8_dynamic_per_token_quant`	M	0.3	0.39 us	0.29 us	8 us
`linear_sigmoid_mul`	C	1.1k	1.49 ms	948.55 us	1.3 ms
`per_token_group_quant_fp8`	M	0.3	4.79 ms	3.62 ms	7.1 ms
`gated_rms_norm`	M	1.3	30.98 us	23.38 us	120 us
`l2_norm`	M	0.8	17.89 us	13.50 us	59 us
`moe_count_and_sort`	I	0.0	0.01 us	0.01 us	17 us
`fused_rmsnorm_quant`	M	0.7	93.50 us	70.56 us	141 us
`moe_sum_reduce`	M	0.4	46.48 us	35.08 us	180 us
`layer_norm`	M	1.7	17.83 us	13.46 us	64 us

C = compute-bound (AI > 10 FLOP/byte), M = memory-bound, I = indexing/structural. T_SOL = Speed-of-Light theoretical minimum. T_Prod = deployed kernel median time on selected hardware.

SOL Ratio by Operator × Shape

Click an operator to expand per-shape details. S = T_SOL / T_Prod — higher is better (100% = hardware limit).

Hardware:

	Operator	dtype	Phase	Shapes	T_SOL (median)	T_Prod (median)	S (median)
▶	`moe_align_block_size`	int32	prefill/decode	6	0.0 us	122.0 us	0.0%
	#0 (vLLM TP1), MoE routing align_block_size, tokens=1 top_k=8 block_size=128				0.0 us	23.0 us	0.0%
	#2 (vLLM TP4), MoE routing align_block_size, tokens=520 top_k=8 block_size=128				0.0 us	42.0 us	0.0%
	#3 (vLLM TP4), MoE routing align_block_size, tokens=1019 top_k=8 block_size=128				0.0 us	67.2 us	0.0%
	#4 (vLLM TP4), MoE routing align_block_size, tokens=2044 top_k=8 block_size=128				0.0 us	122.0 us	0.0%
	#5 (vLLM TP4), MoE routing align_block_size, tokens=3936 top_k=8 block_size=128				0.1 us	187.9 us	0.0%
	#6 (vLLM TP4), MoE routing align_block_size, tokens=8192 top_k=8 block_size=128				0.1 us	400.4 us	0.0%
▶	`moe_count_and_sort`	int32	prefill/decode	5	0.0 us	16.6 us	0.1%
	#0 (rtp-llm TP1), MoE routing count_and_sort, tokens=1 top_k=8 experts=128				0.0 us	12.8 us	0.0%
	#1 (vLLM TP4), MoE routing count_and_sort, tokens=76 top_k=8 experts=128				0.0 us	12.7 us	0.0%
	#2 (vLLM TP4), MoE routing count_and_sort, tokens=520 top_k=8 experts=128				0.0 us	16.6 us	0.1%
	#3 (vLLM TP4), MoE routing count_and_sort, tokens=1019 top_k=8 experts=128				0.0 us	18.8 us	0.1%
	#4 (vLLM TP4), MoE routing count_and_sort, tokens=2044 top_k=8 experts=128				0.0 us	21.2 us	0.1%
▶	`fused_qkv_rope`	fp16	prefill/decode	6	0.0 us	8.9 us	0.1%
	#0 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=32/4 head_dim=128				0.0 us	8.0 us	0.0%
	#1 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=40/8 head_dim=128				0.0 us	9.1 us	0.1%
	#2 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=64/8 head_dim=128				0.0 us	11.1 us	0.1%
	#3 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=32/8 head_dim=128				0.0 us	8.9 us	0.0%
	#5 (rtp-llm TP2), text decoder fusedQKV+RoPE, heads=32/4 head_dim=128				0.0 us	8.5 us	0.0%
	#6 (rtp-llm TP1), text decoder fusedQKV+RoPE, heads=40/8 head_dim=128				0.0 us	8.0 us	0.1%
▶	`fused_qk_rmsnorm`	fp16	prefill/decode	4	0.0 us	7.8 us	0.1%
	#0 (rtp-llm TP1), text decoder fused QK RMSNorm, heads=64/8 head_dim=128				0.0 us	7.7 us	0.1%
	#1 (rtp-llm TP1), text decoder fused QK RMSNorm, heads=32/8 head_dim=128				0.0 us	7.8 us	0.0%
	#3 (rtp-llm TP2), text decoder fused QK RMSNorm, heads=32/4 head_dim=128				0.0 us	7.8 us	0.0%
	#4 (rtp-llm TP1), text decoder fused QK RMSNorm, heads=40/8 head_dim=128				0.0 us	7.7 us	0.1%
▶	`gated_delta_rule_update`	bf16	decode	17	8.1 us	1143.6 us	0.7%
	#0 (SGLang TP1), gated delta rule update, tokens=1 heads=16/32 dim=128/128				1.6 us	9.6 us	16.6%
	#1 (vLLM TP1), gated delta rule update, tokens=128 heads=16/32 dim=128/128				2.2 us	295.6 us	0.7%
	#2 (vLLM TP1), gated delta rule update, tokens=258 heads=16/32 dim=128/128				4.1 us	584.8 us	0.7%
	#3 (vLLM TP1), gated delta rule update, tokens=544 heads=16/32 dim=128/128				8.6 us	1226.6 us	0.7%
	#4 (vLLM TP1), gated delta rule update, tokens=1033 heads=16/32 dim=128/128				16.4 us	2312.7 us	0.7%
	#5 (vLLM TP1), gated delta rule update, tokens=2051 heads=16/32 dim=128/128				32.5 us	4635.1 us	0.7%
	#6 (vLLM TP1), gated delta rule update, tokens=4120 heads=16/32 dim=128/128				65.2 us	9673.9 us	0.7%
	#7 (vLLM TP1), gated delta rule update, tokens=8192 heads=16/32 dim=128/128				129.7 us	20104.8 us	0.6%
	#8 (vLLM TP4), gated delta rule update, tokens=1 heads=4/8 dim=128/128				0.4 us	8.9 us	4.5%
	#9 (SGLang TP8), gated delta rule update, tokens=1 heads=8/16 dim=128/128				0.8 us	8.7 us	9.1%
	#10 (SGLang TP8), gated delta rule update, tokens=129 heads=8/16 dim=128/128				1.1 us	289.0 us	0.4%
	#12 (SGLang), gated delta rule update, tokens=130 heads=16/32 dim=128/128				2.2 us	295.6 us	0.7%
	#13 (vLLM), gated delta rule update, tokens=256 heads=16/32 dim=128/128				4.0 us	574.6 us	0.7%
	#14 (SGLang), gated delta rule update, tokens=512 heads=16/32 dim=128/128				8.1 us	1143.6 us	0.7%
	#15 (SGLang), gated delta rule update, tokens=1029 heads=16/32 dim=128/128				16.3 us	2288.6 us	0.7%
	#16 (SGLang), gated delta rule update, tokens=1831 heads=16/32 dim=128/128				29.0 us	4065.1 us	0.7%
	#17 (SGLang), gated delta rule update, tokens=3251 heads=16/32 dim=128/128				51.5 us	7234.1 us	0.7%
▶	`causal_conv1d`	bf16	prefill	9	1.6 us	100.6 us	1.2%
	#0 (SGLang TP1), causal depthwise conv1d, tokens=1 dim=4096 width=4				0.0 us	7.6 us	0.3%
	#1 (vLLM TP1), causal depthwise conv1d, tokens=128 dim=4096 width=4				0.4 us	35.7 us	1.1%
	#2 (vLLM TP1), causal depthwise conv1d, tokens=256 dim=4096 width=4				0.8 us	77.1 us	1.0%
	#3 (vLLM TP1), causal depthwise conv1d, tokens=514 dim=4096 width=4				1.6 us	100.6 us	1.6%
	#4 (vLLM TP1), causal depthwise conv1d, tokens=1024 dim=4096 width=4				3.2 us	272.4 us	1.2%
	#5 (SGLang TP8), causal depthwise conv1d, tokens=1 dim=8192 width=4				0.0 us	7.7 us	0.4%
	#6 (SGLang), causal depthwise conv1d, tokens=4195 dim=8192 width=4				26.0 us	1334.9 us	1.9%
	#7 (SGLang), causal depthwise conv1d, tokens=11027 dim=8192 width=4				68.2 us	3496.7 us	2.0%
	#8 (SGLang), causal depthwise conv1d, tokens=14807 dim=8192 width=4				91.6 us	4659.6 us	2.0%
▶	`topk_filter`	fp32	decode	8	2.9 us	328.6 us	1.5%
	#0 (vLLM), top-k logit filter, shape=(1,151936)				0.2 us	299.7 us	0.1%
	#1 (vLLM), top-k logit filter, shape=(1,152064)				0.2 us	328.6 us	0.1%
	#2 (vLLM), top-k logit filter, shape=(922,2048)				2.9 us	187.4 us	1.5%
	#3 (SGLang), top-k logit filter, shape=(3,4002)				0.0 us	31.3 us	0.1%
	#4 (SGLang), top-k logit filter, shape=(4098,4096)				25.3 us	805.1 us	3.1%
	#5 (SGLang), top-k logit filter, shape=(8179,4096)				50.6 us	2010.0 us	2.5%
	#6 (SGLang), top-k logit filter, shape=(15381,4096)				95.1 us	2932.6 us	3.2%
	#7 (SGLang), top-k logit filter, shape=(3,4215)				0.0 us	30.7 us	0.1%
▶	`moe_topk_gating_softmax`	fp32	prefill/decode	10	0.2 us	13.3 us	1.7%
	#0 (rtp-llm TP1), MoE routing topk_gating_softmax, tokens=1 experts=128				0.0 us	11.4 us	0.0%
	#1 (vLLM TP1), MoE routing topk_gating_softmax, tokens=129 experts=128				0.0 us	12.8 us	0.1%
	#2 (vLLM TP1), MoE routing topk_gating_softmax, tokens=256 experts=128				0.0 us	13.0 us	0.2%
	#3 (vLLM TP4), MoE routing topk_gating_softmax, tokens=520 experts=128				0.1 us	13.0 us	0.5%
	#4 (vLLM TP4), MoE routing topk_gating_softmax, tokens=1019 experts=128				0.1 us	13.1 us	0.8%
	#5 (vLLM TP4), MoE routing topk_gating_softmax, tokens=2044 experts=128				0.2 us	13.3 us	1.7%
	#6 (SGLang TP8), MoE routing topk_gating_softmax, tokens=4083 experts=128				0.4 us	17.1 us	2.6%
	#7 (vLLM TP4), MoE routing topk_gating_softmax, tokens=8192 experts=128				0.9 us	25.6 us	3.5%
	#8 (vLLM TP inferred), MoE routing topk_softmax, tokens=8037 experts=256				1.6 us	39.4 us	4.2%
	#9 (vLLM TP inferred), MoE routing topk_softmax, tokens=13526 experts=256				2.8 us	56.2 us	4.9%
▶	`mla_decode_attention`	bf16	decode	3	0.4 us	21.0 us	1.7%
	#0 (sglang TP2), MLA decode attention, nhead=128 kv_lora_rank=512 rope_dim=64				0.8 us	46.7 us	1.7%
	#1 (SGLang TP8), MLA decode attention, grid=(0,0,0)				0.1 us	19.0 us	0.3%
	#2 (SGLang TP8), MLA decode attention, grid=(7,1,1)				0.4 us	21.0 us	2.1%
▶	`paged_attention_decode`	bf16	decode	8	0.6 us	15.8 us	3.9%
	#0 (vLLM), decode paged attention, heads=16/16 head_dim=128 ctx_len=64				26.1 us	152.6 us	17.1%
	#1 (SGLang TP1), decode paged attention, heads=16/2 head_dim=256				0.1 us	17.1 us	0.9%
	#2 (vLLM), decode paged attention, heads=3/3 head_dim=128 ctx_len=64				2.5 us	24.7 us	10.1%
	#3 (rtp-llm TP1), decode paged attention, heads=32/4 head_dim=128				0.3 us	15.6 us	2.0%
	#4 (rtp-llm TP1), decode paged attention, heads=32/8 head_dim=128				0.6 us	15.7 us	3.9%
	#5 (SGLang TP8), decode paged attention, heads=4/1 head_dim=128				0.0 us	15.2 us	0.3%
	#6 (rtp-llm TP1), decode paged attention, heads=40/8 head_dim=128				0.6 us	15.8 us	3.9%
	#7 (rtp-llm TP1), decode paged attention, heads=64/8 head_dim=128				0.6 us	15.5 us	4.0%
▶	`chunk_gated_delta_rule_state`	bf16	prefill	25	9.3 us	184.8 us	4.5%
	#0 (vLLM TP1), chunk state update, tokens=2048 heads=1/2 dim=128/128				1.2 us	93.2 us	1.3%
	#1 (vLLM TP1), chunk state update, tokens=2048 heads=10/20 dim=128/128				11.7 us	184.8 us	6.3%
	#2 (vLLM TP1), chunk state update, tokens=2048 heads=11/22 dim=128/128				12.8 us	269.8 us	4.8%
	#3 (vLLM TP1), chunk state update, tokens=2048 heads=12/24 dim=128/128				14.0 us	290.5 us	4.8%
	#4 (vLLM TP1), chunk state update, tokens=2048 heads=13/26 dim=128/128				15.2 us	284.6 us	5.3%
	#5 (vLLM TP1), chunk state update, tokens=2048 heads=14/28 dim=128/128				16.3 us	293.6 us	5.6%
	#6 (vLLM TP1), chunk state update, tokens=2048 heads=15/30 dim=128/128				17.5 us	293.2 us	6.0%
	#7 (vLLM TP1), chunk state update, tokens=2048 heads=16/32 dim=128/128				18.7 us	416.0 us	4.5%
	#8 (vLLM TP1), chunk state update, tokens=2048 heads=2/4 dim=128/128				2.3 us	95.0 us	2.5%
	#9 (vLLM TP1), chunk state update, tokens=4096 heads=2/4 dim=128/128				4.7 us	179.0 us	2.6%
	#10 (vLLM TP1), chunk state update, tokens=8192 heads=2/4 dim=128/128				9.3 us	348.4 us	2.7%
	#11 (vLLM TP1), chunk state update, tokens=16384 heads=2/4 dim=128/128				18.7 us	767.6 us	2.4%
	#12 (vLLM TP1), chunk state update, tokens=24576 heads=2/4 dim=128/128				28.0 us	1203.7 us	2.3%
	#13 (vLLM TP1), chunk state update, tokens=2048 heads=3/6 dim=128/128				3.5 us	94.2 us	3.7%
	#14 (vLLM TP4), chunk state update, tokens=512 heads=4/8 dim=128/128				1.3 us	32.2 us	4.0%
	#15 (vLLM TP4), chunk state update, tokens=1024 heads=4/8 dim=128/128				2.4 us	54.1 us	4.4%
	#16 (vLLM TP1), chunk state update, tokens=2048 heads=4/8 dim=128/128				4.7 us	97.1 us	4.8%
	#17 (vLLM TP1), chunk state update, tokens=2048 heads=5/10 dim=128/128				5.8 us	95.7 us	6.1%
	#18 (vLLM TP1), chunk state update, tokens=2048 heads=6/12 dim=128/128				7.0 us	184.3 us	3.8%
	#19 (vLLM TP1), chunk state update, tokens=2048 heads=7/14 dim=128/128				8.2 us	181.9 us	4.5%
	#20 (SGLang TP8), chunk state update, tokens=1024 heads=8/16 dim=128/128				4.8 us	101.6 us	4.7%
	#21 (SGLang TP8), chunk state update, tokens=2048 heads=8/16 dim=128/128				9.3 us	188.4 us	5.0%
	#22 (vLLM TP1), chunk state update, tokens=2048 heads=9/18 dim=128/128				10.5 us	182.6 us	5.8%
	#23 (SGLang/vLLM), chunk state update, tokens=4195 heads=16/32 dim=128/128				38.3 us	899.5 us	4.3%
	#24 (SGLang/vLLM), chunk state update, tokens=11027 heads=16/32 dim=128/128				100.6 us	2352.4 us	4.3%
▶	`unified_attention`	bf16	prefill/decode	25	350.4 us	4689.4 us	7.4%
	#0 (vLLM TP4), prefill paged attention, seqs=130 heads=10/1 head_dim=128				94.1 us	1292.2 us	7.3%
	#1 (vLLM TP4), prefill paged attention, seqs=290 heads=10/1 head_dim=128				209.9 us	2893.5 us	7.3%
	#2 (vLLM TP4), prefill paged attention, seqs=484 heads=10/1 head_dim=128				350.4 us	4689.4 us	7.5%
	#3 (vLLM TP4), prefill paged attention, seqs=1 heads=16/1 head_dim=128				0.0 us	24.8 us	0.1%
	#4 (SGLang), causal attention, seq_len=1688 heads=16/16 head_dim=192				75.3 us	1126.7 us	6.7%
	#5 (vLLM), causal attention, seq_len=11840 heads=16/16 head_dim=72				1.4 ms	38.6 ms	3.6%
	#6 (rtp-llm), causal attention, seq_len=24752 heads=16/16 head_dim=80				6.7 ms	70.6 ms	9.6%
	#7 (vLLM), causal attention, seq_len=1144 heads=16/2 head_dim=128				23.1 us	258.3 us	8.9%
	#8 (vLLM), causal attention, seq_len=875 heads=16/2 head_dim=256				27.0 us	413.2 us	6.5%
	#9 (vLLM TP1), prefill paged attention, seqs=324 heads=16/4 head_dim=128				375.3 us	5037.1 us	7.5%
	#10 (vLLM TP1), prefill paged attention, seqs=699 heads=16/4 head_dim=128				809.7 us	10834.2 us	7.5%
	#11 (vLLM TP1), prefill paged attention, seqs=1032 heads=16/4 head_dim=128				1.2 ms	16.0 ms	7.5%
	#12 (vLLM TP1), prefill paged attention, seqs=1882 heads=16/4 head_dim=128				2.2 ms	29.4 ms	7.4%
	#13 (vLLM TP1), prefill paged attention, seqs=3095 heads=16/4 head_dim=128				3.6 ms	48.6 ms	7.4%
	#14 (vLLM), causal attention, seq_len=72 heads=16/4 head_dim=256				0.3 us	72.2 us	0.4%
	#15 (vLLM), causal attention, seq_len=7 heads=20/20 head_dim=64				0.0 us	18.5 us	0.1%
	#16 (vLLM TP1), prefill paged attention, seqs=18 heads=28/4 head_dim=128				36.5 us	518.6 us	7.0%
	#17 (vLLM TP1), prefill paged attention, seqs=118 heads=28/4 head_dim=128				239.2 us	3222.0 us	7.4%
	#18 (vLLM TP1), prefill paged attention, seqs=1192 heads=28/4 head_dim=128				2.4 ms	32.5 ms	7.4%
	#19 (vLLM TP1), prefill paged attention, seqs=2056 heads=28/4 head_dim=128				4.2 ms	56.2 ms	7.4%
	#20 (vLLM TP1), prefill paged attention, seqs=1 heads=32/4 head_dim=128				1.4 ms	9.7 ms	14.3%
	#21 (vLLM TP1), prefill paged attention, seqs=1 heads=32/8 head_dim=128				143.5 us	1179.5 us	12.2%
	#22 (vLLM), causal attention, seq_len=49152 heads=4/4 head_dim=72				6.0 ms	160.9 ms	3.7%
	#23 (vLLM), causal attention, seq_len=145 heads=40/8 head_dim=128				0.9 us	43.7 us	2.1%
	#24 (SGLang), causal attention, seq_len=5926 heads=8/1 head_dim=256				618.4 us	5684.4 us	10.9%
▶	`chunk_delta_rule_output`	bf16	prefill	16	18.6 us	190.5 us	9.5%
	#0 (vLLM TP1), chunk output, tokens=64 heads=16/32 dim=128/128				0.6 us	20.2 us	2.9%
	#1 (vLLM TP1), chunk output, tokens=128 heads=16/32 dim=128/128				1.2 us	23.9 us	4.9%
	#2 (vLLM TP1), chunk output, tokens=256 heads=16/32 dim=128/128				2.3 us	37.1 us	6.3%
	#3 (vLLM TP1), chunk output, tokens=512 heads=16/32 dim=128/128				4.7 us	57.9 us	8.0%
	#4 (vLLM TP1), chunk output, tokens=1024 heads=16/32 dim=128/128				9.3 us	97.2 us	9.6%
	#5 (vLLM TP1), chunk output, tokens=2048 heads=16/32 dim=128/128				18.6 us	190.5 us	9.8%
	#6 (vLLM TP1), chunk output, tokens=4096 heads=16/32 dim=128/128				37.2 us	372.9 us	10.0%
	#7 (vLLM TP1), chunk output, tokens=8192 heads=16/32 dim=128/128				74.4 us	725.5 us	10.3%
	#8 (SGLang TP8), chunk output, tokens=704 heads=8/16 dim=128/128				3.2 us	46.7 us	6.9%
	#9 (SGLang TP8), chunk output, tokens=1024 heads=8/16 dim=128/128				4.7 us	59.9 us	7.8%
	#10 (SGLang TP8), chunk output, tokens=2112 heads=8/16 dim=128/128				9.6 us	103.1 us	9.3%
	#11 (SGLang TP8), chunk output, tokens=4096 heads=8/16 dim=128/128				18.6 us	195.8 us	9.5%
	#12 (SGLang TP8), chunk output, tokens=8256 heads=8/16 dim=128/128				37.5 us	385.0 us	9.7%
	#13 (SGLang/vLLM), chunk output, tokens=4195 heads=16/32 dim=128/128				38.4 us	403.6 us	9.5%
	#14 (SGLang/vLLM), chunk output, tokens=11027 heads=16/32 dim=128/128				100.6 us	1055.1 us	9.5%
	#15 (SGLang/vLLM), chunk output, tokens=14807 heads=16/32 dim=128/128				134.9 us	1291.6 us	10.4%
▶	`layer_norm`	bf16	prefill/decode	13	1.8 us	15.8 us	11.1%
	#0 (vLLM TP4), shape=(276,1,1152)				0.2 us	8.9 us	2.7%
	#1 (vLLM TP4), shape=(600,1,1152)				0.5 us	9.9 us	5.3%
	#2 (vLLM TP4), shape=(1012,1,1152)				0.9 us	12.1 us	7.3%
	#3 (vLLM TP4), shape=(2024,1,1152)				1.8 us	15.8 us	11.1%
	#4 (vLLM TP4), shape=(4100,1,1152)				3.6 us	23.3 us	15.3%
	#5 (vLLM TP4), shape=(8184,1,1152)				7.1 us	38.0 us	18.7%
	#6 (vLLM TP4), shape=(15476,1,1152)				13.5 us	64.0 us	21.0%
	#7 (vLLM TP4), shape=(24952,1,1152)				21.7 us	97.5 us	22.2%
	#8 (vLLM TP4), shape=(40560,1,1152)				35.3 us	152.2 us	23.2%
	#9 (SGLang TP1), LayerNorm, tokens=91192 hidden=2048				140.9 us	788.4 us	17.9%
	#10 (rtp-llm TP1), shape=(1,2048)				0.0 us	7.6 us	0.0%
	#11 (rtp-llm TP1), shape=(1,4096)				0.0 us	8.1 us	0.1%
	#12 (rtp-llm TP1), shape=(1,5120)				0.0 us	8.9 us	0.1%
▶	`mrope`	bf16	prefill/decode	5	6.9 us	46.0 us	15.1%
	#0 (vLLM TP2), tokens=3592 heads=16/1 head_dim=128				6.9 us	46.0 us	15.1%
	#1 (vLLM TP2), tokens=8373 heads=16/1 head_dim=128				16.2 us	78.3 us	20.7%
	#2 (vLLM TP2), tokens=16530 heads=16/1 head_dim=128				31.9 us	139.0 us	23.0%
	#3 (sglang), tokens=838 heads=16/4 head_dim=128				1.9 us	17.7 us	10.5%
	#4 (sglang), tokens=1673 heads=16/4 head_dim=128				3.7 us	28.4 us	13.1%
▶	`reshape_and_cache`	bf16	prefill/decode	8	4.7 us	21.2 us	15.8%
	#0 (vLLM TP1), KV cache reshape_and_cache, shape=(1,4,128)				0.0 us	7.5 us	0.1%
	#1 (vLLM TP1), KV cache reshape_and_cache, shape=(144,2,256)				0.2 us	7.7 us	2.2%
	#2 (vLLM TP4), KV cache reshape_and_cache, shape=(248,16,128)				1.2 us	9.8 us	12.0%
	#3 (vLLM TP4), KV cache reshape_and_cache, shape=(520,16,128)				2.4 us	14.2 us	17.2%
	#4 (vLLM TP4), KV cache reshape_and_cache, shape=(1015,16,128)				4.7 us	21.2 us	22.4%
	#5 (vLLM TP4), KV cache reshape_and_cache, shape=(2044,16,128)				9.5 us	29.2 us	32.5%
	#6 (vLLM TP1), KV cache reshape_and_cache, shape=(4108,4,128)				4.8 us	33.7 us	14.2%
	#7 (vLLM TP1), KV cache reshape_and_cache, shape=(8192,4,128)				9.5 us	60.3 us	15.8%
▶	`rms_norm`	bf16	prefill/decode	56	3.2 us	14.1 us	16.0%
	#0 (vLLM), RMSNorm, tokens=2500 hidden=1024				1.9 us	9.9 us	19.5%
	#1 (vLLM), RMSNorm, tokens=4968 hidden=1024				3.8 us	14.1 us	27.2%
	#2 (vLLM), RMSNorm, tokens=10000 hidden=1024				7.7 us	22.8 us	33.9%
	#3 (vLLM), RMSNorm, tokens=16200 hidden=1024				12.5 us	32.7 us	38.3%
	#4 (vLLM), RMSNorm, tokens=22100 hidden=1024				17.1 us	42.1 us	40.6%
	#5 (vLLM), RMSNorm, tokens=360 hidden=1152				0.3 us	7.8 us	4.0%
	#6 (vLLM), RMSNorm, tokens=720 hidden=1152				0.6 us	8.1 us	7.8%
	#7 (vLLM), RMSNorm, tokens=1200 hidden=1152				1.0 us	9.2 us	11.3%
	#8 (vLLM), RMSNorm, tokens=2116 hidden=1152				1.8 us	12.3 us	15.0%
	#9 (vLLM), RMSNorm, tokens=3844 hidden=1152				3.3 us	17.7 us	18.9%
	#10 (vLLM), RMSNorm, tokens=8136 hidden=1152				7.1 us	31.1 us	22.7%
	#11 (vLLM), RMSNorm, tokens=15000 hidden=1152				13.0 us	52.1 us	25.0%
	#12 (vLLM), RMSNorm, tokens=24368 hidden=1152				21.2 us	81.2 us	26.1%
	#13 (vLLM), RMSNorm, tokens=49152 hidden=1152				42.7 us	157.4 us	27.1%
	#14 (vLLM), RMSNorm, tokens=65556 hidden=1152				57.0 us	208.0 us	27.4%
	#15 (SGLang), RMSNorm, tokens=3 hidden=128				0.0 us	7.5 us	0.0%
	#16 (SGLang), RMSNorm, tokens=1504 hidden=128				0.1 us	7.7 us	1.9%
	#17 (SGLang), RMSNorm, tokens=2048 hidden=128				0.2 us	7.7 us	2.6%
	#18 (SGLang), RMSNorm, tokens=4096 hidden=128				0.4 us	8.7 us	4.6%
	#19 (SGLang), RMSNorm, tokens=8192 hidden=128				0.8 us	11.8 us	6.7%
	#20 (SGLang), RMSNorm, tokens=16384 hidden=128				1.6 us	18.5 us	8.5%
	#21 (SGLang), RMSNorm, tokens=24544 hidden=128				2.4 us	24.4 us	9.7%
	#22 (SGLang), RMSNorm, tokens=49024 hidden=128				4.7 us	42.0 us	11.3%
	#23 (vLLM), RMSNorm, tokens=65488 hidden=128				6.3 us	53.9 us	11.7%
	#24 (vLLM), RMSNorm, tokens=95952 hidden=128				9.3 us	75.8 us	12.2%
	#25 (vLLM), RMSNorm, tokens=131072 hidden=128				12.7 us	101.1 us	12.5%
	#26 (vLLM), RMSNorm, tokens=247680 hidden=128				23.9 us	190.8 us	12.5%
	#27 (SGLang), RMSNorm, tokens=505888 hidden=128				48.9 us	371.2 us	13.2%
	#28 (vLLM), RMSNorm, tokens=7 hidden=1280				0.0 us	7.5 us	0.1%
	#29 (rtp-llm), RMSNorm, tokens=24752 hidden=1280				23.9 us	82.1 us	29.1%
	#30 (vLLM), RMSNorm, tokens=1736 hidden=1536				2.0 us	11.6 us	17.3%
	#31 (vLLM), RMSNorm, tokens=5000 hidden=1536				5.8 us	22.1 us	26.2%
	#32 (SGLang), RMSNorm, tokens=128 hidden=2048				0.2 us	7.7 us	2.6%
	#33 (SGLang), RMSNorm, tokens=256 hidden=2048				0.4 us	7.8 us	5.1%
	#34 (SGLang), RMSNorm, tokens=512 hidden=2048				0.8 us	8.2 us	9.6%
	#35 (SGLang), RMSNorm, tokens=1029 hidden=2048				1.6 us	9.1 us	17.5%
	#36 (vLLM TP1), RMSNorm, tokens=2010 hidden=2048				3.1 us	12.6 us	24.7%
	#37 (SGLang), RMSNorm, tokens=4174 hidden=2048				6.5 us	21.0 us	30.7%
	#38 (vLLM), RMSNorm, tokens=8290 hidden=2048				12.8 us	34.9 us	36.7%
	#39 (vLLM), RMSNorm, tokens=16530 hidden=2048				25.6 us	62.8 us	40.7%
	#40 (vLLM), RMSNorm, tokens=20538 hidden=2048				31.8 us	80.5 us	39.4%
	#41 (SGLang), RMSNorm, tokens=3 hidden=256				0.0 us	7.5 us	0.0%
	#42 (SGLang), RMSNorm, tokens=1672 hidden=2560				3.2 us	14.2 us	22.7%
	#43 (vLLM TP1), RMSNorm, tokens=298 hidden=3584				0.8 us	7.9 us	10.3%
	#44 (vLLM TP1), RMSNorm, tokens=514 hidden=3584				1.4 us	8.7 us	16.0%
	#45 (vLLM TP1), RMSNorm, tokens=1024 hidden=3584				2.8 us	11.4 us	24.3%
	#46 (SGLang), RMSNorm, tokens=3 hidden=512				0.0 us	7.4 us	0.0%
	#47 (rtp-llm TP1), RMSNorm, tokens=1 hidden=5120				0.0 us	7.5 us	0.1%
	#48 (SGLang TP8), RMSNorm, tokens=53 hidden=7168				0.3 us	7.7 us	3.8%
	#49 (SGLang TP8), RMSNorm, tokens=141 hidden=7168				0.8 us	7.9 us	9.7%
	#50 (SGLang TP8), RMSNorm, tokens=632 hidden=7168				3.4 us	12.9 us	26.5%
	#51 (SGLang TP8), RMSNorm, tokens=917 hidden=7168				5.0 us	16.4 us	30.2%
	#52 (SGLang/vLLM), RMSNorm, tokens=1688 hidden=7168				9.1 us	24.9 us	36.7%
	#53 (SGLang/vLLM), RMSNorm, tokens=3804 hidden=7168				20.6 us	46.5 us	44.3%
	#54 (SGLang), RMSNorm, tokens=154 hidden=8192				1.0 us	8.0 us	12.0%
	#55 (SGLang), RMSNorm, tokens=510 hidden=8192				3.2 us	12.4 us	25.5%
▶	`moe_sum_reduce`	bf16	prefill/decode	7	35.1 us	179.8 us	19.5%
	#0 (SGLang TP inferred), MoE top-8 expert sum, tokens=11027 hidden=2048				76.7 us	362.4 us	21.2%
	#1 (SGLang TP inferred), MoE top-8 expert sum, tokens=6030 hidden=2048				41.9 us	209.8 us	20.0%
	#2 (SGLang TP inferred), MoE top-8 expert sum, tokens=5983 hidden=2048				41.6 us	213.3 us	19.5%
	#3 (SGLang TP inferred), MoE top-8 expert sum, tokens=5043 hidden=2048				35.1 us	179.8 us	19.5%
	#4 (SGLang TP inferred), MoE top-8 expert sum, tokens=4996 hidden=2048				34.8 us	178.4 us	19.5%
	#5 (SGLang TP inferred), MoE top-8 expert sum, tokens=4363 hidden=2048				30.4 us	159.2 us	19.1%
	#6 (SGLang TP inferred), MoE top-8 expert sum, tokens=4195 hidden=2048				29.2 us	154.0 us	18.9%
▶	`gated_rms_norm`	bf16	prefill/decode	7	23.4 us	119.7 us	19.5%
	#0 (SGLang), gated RMSNorm, rows=352864 hidden=128				51.1 us	248.7 us	20.6%
	#1 (SGLang), gated RMSNorm, rows=192960 hidden=128				28.0 us	141.1 us	19.8%
	#2 (SGLang), gated RMSNorm, rows=191456 hidden=128				27.7 us	140.0 us	19.8%
	#3 (SGLang), gated RMSNorm, rows=161376 hidden=128				23.4 us	119.7 us	19.5%
	#4 (SGLang), gated RMSNorm, rows=159872 hidden=128				23.2 us	118.7 us	19.5%
	#5 (SGLang), gated RMSNorm, rows=139616 hidden=128				20.2 us	104.4 us	19.4%
	#6 (SGLang), gated RMSNorm, rows=134240 hidden=128				19.4 us	100.0 us	19.4%
▶	`fused_moe`	bf16	prefill/decode	23	1.2 ms	5.7 ms	20.9%
	#0 (vLLM), MoE expert FFN, tokens=135 hidden=2048 8 experts top-2				38.3 us	300.2 us	12.8%
	#1 (vLLM), MoE expert FFN, tokens=277 hidden=2048 8 experts top-2				56.7 us	473.4 us	12.0%
	#2 (vLLM), MoE expert FFN, tokens=668 hidden=2048 8 experts top-2				136.8 us	654.1 us	20.9%
	#3 (vLLM), MoE expert FFN, tokens=1023 hidden=2048 8 experts top-2				208.2 us	947.1 us	22.0%
	#4 (vLLM), MoE expert FFN, tokens=1979 hidden=2048 8 experts top-2				403.0 us	1461.2 us	27.6%
	#5 (SGLang), MoE expert FFN, tokens=4195 hidden=2048 8 experts top-2				851.8 us	3109.2 us	27.4%
	#6 (vLLM), MoE expert FFN, tokens=7689 hidden=2048 8 experts top-2				1.6 ms	6.0 ms	26.0%
	#7 (SGLang), MoE expert FFN, tokens=15809 hidden=2048 8 experts top-2				3.2 ms	14.4 ms	22.3%
	#8 (rtp-llm TP1), text decoder MoE expert FFN, hidden_states=(1,2048) 128 experts top-8 intermediate=768				14.3 us	225.9 us	6.3%
	#9 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(2400,2048) 128 experts top-8 intermediate=768				757.0 us	3698.5 us	20.5%
	#10 (SGLang TP1), text decoder MoE expert FFN, hidden_states=(4195,2048) 128 experts top-8 intermediate=768				1.3 ms	5.8 ms	22.7%
	#11 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(8192,2048) 128 experts top-8 intermediate=768				2.6 ms	10.6 ms	24.5%
	#12 (SGLang TP1), text decoder MoE expert FFN, hidden_states=(15809,2048) 128 experts top-8 intermediate=768				5.0 ms	22.2 ms	22.5%
	#13 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(25632,2048) 128 experts top-8 intermediate=768				8.1 ms	40.1 ms	20.2%
	#14 (vLLM TP1), text decoder MoE expert FFN, hidden_states=(45936,2048) 128 experts top-8 intermediate=768				14.5 ms	76.8 ms	18.9%
	#15 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(520,4096) 128 experts top-8 intermediate=384				230.3 us	1262.0 us	18.3%
	#16 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(1019,4096) 128 experts top-8 intermediate=384				322.6 us	1891.0 us	17.1%
	#17 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(2044,4096) 128 experts top-8 intermediate=384				646.0 us	3325.7 us	19.4%
	#18 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(3936,4096) 128 experts top-8 intermediate=384				1.2 ms	5.7 ms	21.9%
	#19 (vLLM TP4), text decoder MoE expert FFN, hidden_states=(8192,4096) 128 experts top-8 intermediate=384				2.6 ms	11.6 ms	22.3%
	#20 (SGLang), MoE expert FFN, tokens=4098 hidden=4096 8 experts top-2				3.3 ms	11.6 ms	28.7%
	#21 (SGLang), MoE expert FFN, tokens=8179 hidden=4096 8 experts top-2				6.6 ms	32.3 ms	20.6%
	#22 (SGLang), MoE expert FFN, tokens=15381 hidden=4096 8 experts top-2				12.5 ms	71.5 ms	17.5%
▶	`l2_norm`	bf16	prefill/decode	13	13.5 us	59.0 us	21.8%
	#0 (vLLM TP1), L2 norm, tokens=1 hidden=2048				0.0 us	7.6 us	0.0%
	#1 (vLLM TP1), L2 norm, tokens=128 hidden=2048				0.2 us	7.7 us	2.6%
	#2 (vLLM TP1), L2 norm, tokens=256 hidden=2048				0.4 us	7.9 us	5.1%
	#3 (vLLM TP1), L2 norm, tokens=480 hidden=2048				0.7 us	9.0 us	8.2%
	#4 (vLLM TP1), L2 norm, tokens=2304 hidden=2048				3.6 us	21.5 us	16.6%
	#5 (vLLM TP1), L2 norm, tokens=4064 hidden=2048				6.3 us	30.0 us	20.9%
	#6 (vLLM TP1), L2 norm, tokens=8736 hidden=2048				13.5 us	59.0 us	22.9%
	#7 (vLLM TP1), L2 norm, tokens=16608 hidden=2048				25.7 us	117.2 us	21.9%
	#8 (vLLM TP1), L2 norm, tokens=24480 hidden=2048				37.8 us	173.3 us	21.8%
	#9 (vLLM TP1), L2 norm, tokens=48704 hidden=2048				75.3 us	315.9 us	23.8%
	#10 (vLLM TP1), L2 norm, tokens=65792 hidden=2048				101.7 us	430.0 us	23.6%
	#11 (vLLM TP1), L2 norm, tokens=98880 hidden=2048				152.8 us	635.0 us	24.1%
	#12 (vLLM TP1), L2 norm, tokens=131072 hidden=2048				202.6 us	853.1 us	23.7%
▶	`fp8_dynamic_per_token_quant`	fp8_e4m3	prefill/decode	20	4.7 us	17.5 us	22.5%
	#0 (vLLM TP1), FP8 dynamic per-token quantization, input=(1,2048)				0.0 us	7.6 us	0.0%
	#1 (vLLM TP1), FP8 dynamic per-token quantization, input=(608,2048)				0.7 us	8.2 us	8.7%
	#2 (vLLM TP1), FP8 dynamic per-token quantization, input=(865,2048)				1.0 us	9.1 us	11.0%
	#3 (vLLM TP1), FP8 dynamic per-token quantization, input=(2032,2048)				2.4 us	14.0 us	16.9%
	#4 (vLLM TP1), FP8 dynamic per-token quantization, input=(4144,2048)				4.8 us	21.4 us	22.5%
	#5 (vLLM TP1), FP8 dynamic per-token quantization, input=(6920,2048)				8.0 us	31.9 us	25.2%
	#6 (vLLM TP1), FP8 dynamic per-token quantization, input=(16256,2048)				18.9 us	73.6 us	25.6%
	#7 (vLLM TP1), FP8 dynamic per-token quantization, input=(24712,2048)				28.7 us	105.7 us	27.1%
	#8 (vLLM TP1), FP8 dynamic per-token quantization, input=(45568,2048)				52.9 us	183.5 us	28.8%
	#10 (SGLang TP8), FP8 dynamic per-token quantization, input=(1,4096)				0.0 us	7.6 us	0.0%
	#11 (SGLang TP8), FP8 dynamic per-token quantization, input=(127,4096)				0.3 us	7.7 us	3.8%
	#12 (SGLang TP8), FP8 dynamic per-token quantization, input=(255,4096)				0.6 us	7.9 us	7.5%
	#13 (SGLang TP8), FP8 dynamic per-token quantization, input=(513,4096)				1.2 us	9.3 us	12.8%
	#14 (SGLang TP8), FP8 dynamic per-token quantization, input=(1029,4096)				2.4 us	11.9 us	20.1%
	#15 (SGLang TP8), FP8 dynamic per-token quantization, input=(2041,4096)				4.7 us	17.5 us	27.0%
	#16 (SGLang TP8), FP8 dynamic per-token quantization, input=(4104,4096)				9.5 us	28.7 us	33.2%
	#17 (SGLang TP8), FP8 dynamic per-token quantization, input=(8208,4096)				19.0 us	51.9 us	36.7%
	#18 (rtp-llm TP2), FP8 dynamic per-token quantization, input=(1,5120)				0.0 us	7.6 us	0.0%
	#19 (SGLang/vLLM), FP8 dynamic per-token quantization, input=(1688,7168)				6.8 us	22.0 us	31.1%
	#20 (SGLang/vLLM), FP8 dynamic per-token quantization, input=(3804,7168)				15.4 us	42.4 us	36.4%
▶	`silu_and_mul`	bf16	prefill/decode	37	8.7 us	31.4 us	24.0%
	#0 (rtp-llm TP1), text decoder MoE expert gated SiLU, shape=(1,1536)				0.0 us	7.6 us	0.0%
	#1 (SGLang TP1), text decoder gated SiLU activation, shape=(1630,1536)				1.4 us	12.2 us	11.6%
	#2 (vLLM TP1), text decoder gated SiLU activation, shape=(4688,1536)				4.1 us	21.2 us	19.2%
	#3 (vLLM TP1), text decoder gated SiLU activation, shape=(10704,1536)				9.3 us	38.1 us	24.4%
	#4 (vLLM TP1), text decoder gated SiLU activation, shape=(16024,1536)				13.9 us	52.6 us	26.5%
	#5 (vLLM TP1), text decoder gated SiLU activation, shape=(24656,1536)				21.4 us	76.2 us	28.1%
	#6 (vLLM TP1), text decoder gated SiLU activation, shape=(49032,1536)				42.6 us	143.8 us	29.6%
	#7 (vLLM TP1), text decoder gated SiLU activation, shape=(128632,1536)				111.8 us	362.7 us	30.8%
	#8 (vLLM TP1), text decoder gated SiLU activation, shape=(175680,1536)				152.7 us	490.7 us	31.1%
	#9 (vLLM), gated SiLU activation, shape=(1,2048)				0.0 us	7.5 us	0.0%
	#10 (sglang TP1), text decoder gated SiLU, shape=(183,2048)				0.2 us	7.7 us	2.7%
	#11 (sglang TP1), text decoder gated SiLU, shape=(1113,2048)				1.3 us	9.3 us	13.9%
	#12 (vLLM), gated SiLU activation, shape=(1,2560)				0.0 us	7.5 us	0.0%
	#13 (vLLM), gated SiLU activation, shape=(1736,3072)				3.0 us	15.2 us	19.9%
	#14 (vLLM), gated SiLU activation, shape=(5000,3072)				8.7 us	31.4 us	27.7%
	#15 (vLLM TP4), text decoder gated SiLU activation, shape=(11570,3072)				20.1 us	63.2 us	31.8%
	#16 (vLLM TP4), text decoder gated SiLU activation, shape=(14320,3072)				24.9 us	102.2 us	24.4%
	#17 (SGLang), gated SiLU activation, shape=(128,4096)				0.3 us	7.7 us	3.9%
	#18 (SGLang), gated SiLU activation, shape=(256,4096)				0.6 us	8.0 us	7.4%
	#19 (SGLang), gated SiLU activation, shape=(512,4096)				1.2 us	9.1 us	13.1%
	#20 (vLLM), gated SiLU activation, shape=(1023,4096)				2.4 us	12.5 us	19.0%
	#21 (vLLM TP4), text decoder gated SiLU activation, shape=(2044,4096)				4.7 us	19.5 us	24.3%
	#22 (vLLM), gated SiLU activation, shape=(4093,4096)				9.5 us	33.1 us	28.7%
	#23 (vLLM TP4), text decoder gated SiLU activation, shape=(8192,4096)				19.0 us	58.3 us	32.6%
	#24 (vLLM), gated SiLU activation, shape=(16418,4096)				38.1 us	109.6 us	34.7%
	#25 (vLLM), gated SiLU activation, shape=(20538,4096)				47.6 us	135.7 us	35.1%
	#26 (vLLM), gated SiLU activation, shape=(528,5120)				1.5 us	12.3 us	12.4%
	#27 (vLLM), gated SiLU activation, shape=(1056,5120)				3.1 us	19.4 us	15.8%
	#28 (vLLM), gated SiLU activation, shape=(1910,5120)				5.5 us	29.5 us	18.8%
	#29 (SGLang), gated SiLU activation, shape=(4177,5120)				12.1 us	57.0 us	21.2%
	#30 (vLLM), gated SiLU activation, shape=(8192,5120)				23.7 us	105.1 us	22.6%
	#31 (vLLM), gated SiLU activation, shape=(540,8192)				2.5 us	12.9 us	19.4%
	#32 (vLLM), gated SiLU activation, shape=(1003,8192)				4.7 us	19.4 us	24.0%
	#33 (vLLM), gated SiLU activation, shape=(2046,8192)				9.5 us	33.2 us	28.6%
	#34 (SGLang), gated SiLU activation, shape=(4098,8192)				19.0 us	58.9 us	32.3%
	#35 (SGLang), gated SiLU activation, shape=(8179,8192)				37.9 us	110.8 us	34.2%
	#36 (SGLang), gated SiLU activation, shape=(15381,8192)				71.3 us	202.6 us	35.2%
▶	`attention_forward`	bf16	prefill	21	331.6 us	1173.6 us	28.7%
	#0 (vLLM TP1), prefill attention forward, q/k/v=(720,16,72)				10.3 us	68.4 us	15.0%
	#1 (vLLM TP1), prefill attention forward, q/k/v=(1200,16,72)				28.5 us	108.1 us	26.4%
	#2 (vLLM TP4), prefill attention forward, q/k/v=(2116,16,72)				88.7 us	332.7 us	26.7%
	#3 (vLLM TP1), prefill attention forward, q/k/v=(3844,16,72)				292.7 us	1018.0 us	28.7%
	#4 (vLLM TP4), prefill attention forward, q/k/v=(8136,16,72)				1.3 ms	3.8 ms	34.6%
	#5 (vLLM TP1), prefill attention forward, q/k/v=(17296,16,72)				5.9 ms	17.1 ms	34.7%
	#6 (vLLM TP1), prefill attention forward, q/k/v=(24368,16,72)				11.8 ms	33.3 ms	35.3%
	#7 (vLLM TP1), prefill attention forward, q/k/v=(49596,16,72)				48.7 ms	134.5 ms	36.2%
	#8 (vLLM TP1), prefill attention forward, q/k/v=(65556,16,72)				85.1 ms	237.7 ms	35.8%
	#9 (vLLM TP1), prefill attention forward, q/k/v=(16742,32,128)				19.7 ms	37.3 ms	52.9%
	#10 (vLLM TP1), prefill attention forward, q/k/v=(30793,32,128)				66.8 ms	124.2 ms	53.8%
	#11 (vLLM TP4), prefill attention forward, q/k/v=(276,4,72)				0.4 us	43.2 us	0.9%
	#12 (vLLM TP4), prefill attention forward, q/k/v=(600,4,72)				1.8 us	47.4 us	3.8%
	#13 (vLLM TP4), prefill attention forward, q/k/v=(1012,4,72)				5.1 us	57.0 us	8.9%
	#14 (vLLM TP4), prefill attention forward, q/k/v=(2024,4,72)				20.3 us	105.5 us	19.2%
	#15 (vLLM TP4), prefill attention forward, q/k/v=(4100,4,72)				83.2 us	308.7 us	27.0%
	#16 (vLLM TP4), prefill attention forward, q/k/v=(8184,4,72)				331.6 us	1173.6 us	28.3%
	#17 (vLLM TP4), prefill attention forward, q/k/v=(15476,4,72)				1.2 ms	3.7 ms	31.7%
	#18 (vLLM TP4), prefill attention forward, q/k/v=(24952,4,72)				3.1 ms	8.5 ms	36.1%
	#19 (vLLM TP4), prefill attention forward, q/k/v=(40560,4,72)				8.1 ms	21.9 ms	37.2%
	#20 (rtp-llm TP1), prefill attention forward, q/k/v=(1,8,128)				0.0 us	14.4 us	0.0%
▶	`block_scaled_mm`	fp8_e4m3	prefill/decode	24	1.8 ms	5.9 ms	28.9%
	#0 (sglang TP1), FP8 block-scale GEMM, input=(349,2048) group=128x128				6.4 us	61.8 us	10.3%
	#1 (vLLM TP1), FP8 block-scale GEMM (triton), m=1024				18.7 us	73.9 us	25.3%
	#2 (vLLM TP4), FP8 block-scaled GEMM inferred, m=4533 n=2048 k=2048				82.7 us	300.0 us	27.6%
	#3 (vLLM TP4), FP8 block-scaled GEMM inferred, m=6918 n=2048 k=2048				126.2 us	440.4 us	28.6%
	#4 (vLLM TP1), FP8 block-scale GEMM (triton), m=20480				373.5 us	1237.3 us	30.2%
	#5 (vLLM TP1), FP8 block-scale GEMM (triton), m=51200				933.6 us	3098.0 us	30.1%
	#6 (vLLM TP1), FP8 block-scale GEMM (triton), m=65536				1.2 ms	4.0 ms	30.1%
	#7 (vLLM TP1), FP8 block-scale GEMM (triton), m=98304				1.8 ms	5.9 ms	30.2%
	#8 (vLLM TP1), FP8 block-scale GEMM (triton), m=118784				2.2 ms	7.2 ms	30.2%
	#9 (vLLM TP1), FP8 block-scale GEMM (triton), m=250880				4.6 ms	15.1 ms	30.2%
	#10 (vLLM TP1), FP8 block-scale GEMM (triton), m=516096				9.4 ms	31.1 ms	30.2%
	#11 (vLLM TP1), FP8 block-scale GEMM (triton), m=704512				12.8 ms	42.5 ms	30.2%
	#12 (sglang TP1), FP8 block-scale GEMM, input=(237,4096) group=128x128				17.3 us	118.3 us	14.6%
	#13 (SGLang TP8), FP8 block-scale GEMM, m=3072				224.1 us	801.9 us	27.9%
	#14 (SGLang TP8), FP8 block-scale GEMM, m=4096				298.8 us	1059.5 us	28.2%
	#15 (SGLang TP8), FP8 block-scale GEMM, m=8192				597.6 us	2090.1 us	28.6%
	#16 (SGLang TP8), FP8 block-scale GEMM, m=16384				1.2 ms	4.2 ms	28.5%
	#17 (SGLang TP8), FP8 block-scale GEMM, m=24576				1.8 ms	6.2 ms	28.8%
	#18 (SGLang TP8), FP8 block-scale GEMM, m=49152				3.6 ms	12.4 ms	28.8%
	#19 (SGLang TP8), FP8 block-scale GEMM, m=65536				4.8 ms	16.6 ms	28.9%
	#20 (SGLang TP8), FP8 block-scale GEMM, m=98304				7.2 ms	24.8 ms	28.9%
	#21 (SGLang TP8), FP8 block-scale GEMM, m=131072				9.6 ms	33.0 ms	28.9%
	#22 (SGLang TP8), FP8 block-scale GEMM, m=258048				18.8 ms	65.1 ms	28.9%
	#23 (SGLang TP8), FP8 block-scale GEMM, m=299520				21.8 ms	75.6 ms	28.9%
▶	`linear_sigmoid_mul`	bf16	prefill/decode	9	577.1 us	1260.5 us	30.8%
	#0 (SGLang TP inferred), shared expert gate, hidden_states=(8037,4096) out=4096				1.2 ms	1.6 ms	72.3%
	#1 (SGLang TP inferred), shared expert gate, hidden_states=(6717,4096) out=4096				968.9 us	1265.8 us	76.5%
	#2 (SGLang TP inferred), shared expert gate, hidden_states=(4001,4096) out=4096				577.1 us	2059.3 us	28.0%
	#3 (SGLang TP inferred), shared expert gate, hidden_states=(15381,4096) out=4096				2.2 ms	7.2 ms	30.8%
	#4 (SGLang TP inferred), shared expert gate, hidden_states=(6576,4096) out=4096				948.5 us	1260.5 us	75.3%
	#5 (SGLang TP1), shared expert gate, hidden_states=(1,2048) out=2048				3.2 us	20.9 us	15.2%
	#6 (SGLang TP1), shared expert gate, hidden_states=(90,2048) out=2048				6.5 us	32.2 us	20.2%
	#7 (SGLang TP1), shared expert gate, hidden_states=(13,2048) out=2048				3.2 us	17.3 us	18.6%
	#8 (SGLang TP8), shared expert gate, hidden_states=(2748,4096) out=4096				396.4 us	564.5 us	70.2%
▶	`fused_add_rms_norm`	bf16	prefill/decode	19	6.3 us	16.8 us	37.0%
	#0 (rtp-llm), fused add RMSNorm, tokens=24752 hidden=1280				47.8 us	105.2 us	45.5%
	#1 (vLLM TP1), fused add RMSNorm, tokens=126 hidden=2048				0.4 us	7.8 us	5.0%
	#2 (vLLM TP1), fused add RMSNorm, tokens=257 hidden=2048				0.8 us	8.4 us	9.5%
	#3 (vLLM TP1), fused add RMSNorm, tokens=508 hidden=2048				1.6 us	8.4 us	18.7%
	#4 (vLLM TP1), fused add RMSNorm, tokens=1024 hidden=2048				3.2 us	11.6 us	27.3%
	#5 (vLLM TP1), fused add RMSNorm, tokens=2032 hidden=2048				6.3 us	16.8 us	37.4%
	#6 (vLLM TP1), fused add RMSNorm, tokens=4096 hidden=2048				12.7 us	27.9 us	45.4%
	#7 (vLLM TP1), fused add RMSNorm, tokens=141 hidden=5120				1.1 us	8.1 us	13.5%
	#8 (vLLM TP1), fused add RMSNorm, tokens=284 hidden=5120				2.2 us	10.1 us	21.8%
	#9 (vLLM TP1), fused add RMSNorm, tokens=558 hidden=5120				4.3 us	13.8 us	31.2%
	#10 (vLLM TP1), fused add RMSNorm, tokens=1024 hidden=5120				7.9 us	20.7 us	38.3%
	#11 (vLLM TP1), fused add RMSNorm, tokens=1720 hidden=5120				13.3 us	31.2 us	42.6%
	#12 (vLLM TP1), fused add RMSNorm, tokens=4205 hidden=5120				32.5 us	66.1 us	49.2%
	#13 (SGLang TP8), fused add RMSNorm, tokens=53 hidden=7168				0.6 us	7.9 us	7.3%
	#14 (SGLang TP8), fused add RMSNorm, tokens=141 hidden=7168				1.5 us	8.8 us	17.4%
	#15 (SGLang TP8), fused add RMSNorm, tokens=632 hidden=7168				6.8 us	18.5 us	37.0%
	#16 (SGLang TP8), fused add RMSNorm, tokens=917 hidden=7168				9.9 us	24.7 us	40.2%
	#17 (SGLang/vLLM), fused add RMSNorm, tokens=1688 hidden=7168				18.3 us	40.0 us	45.7%
	#18 (SGLang/vLLM), fused add RMSNorm, tokens=3804 hidden=7168				41.2 us	81.2 us	50.7%
▶	`fused_rmsnorm_quant`	fp8_e4m3	prefill/decode	11	8.8 us	23.4 us	37.7%
	#0 (sglang TP1), fused RMSNorm + FP8 quant, input=(117,2048)				0.3 us	7.9 us	4.1%
	#1 (sglang TP1), fused RMSNorm + FP8 quant, input=(837,2048)				2.3 us	11.1 us	20.5%
	#2 (sglang TP1), fused RMSNorm + FP8 quant, input=(1769,2048)				4.8 us	15.9 us	30.1%
	#3 (SGLang TP1), fused RMSNorm + FP8 quant, input=(3260,2048)				8.8 us	23.4 us	37.7%
	#4 (SGLang TP1), fused RMSNorm + FP8 quant, input=(11399,2048)				30.8 us	65.1 us	47.4%
	#5 (SGLang TP1), fused RMSNorm + FP8 quant, input=(26080,2048)				70.6 us	140.9 us	50.1%
	#6 (SGLang TP8), fused RMSNorm + FP8 quant, input=(650,4096)				3.5 us	13.5 us	26.1%
	#7 (SGLang TP8), fused RMSNorm + FP8 quant, input=(1013,4096)				5.5 us	16.6 us	33.0%
	#8 (SGLang TP8), fused RMSNorm + FP8 quant, input=(2058,4096)				11.1 us	27.4 us	40.7%
	#9 (SGLang TP8), fused RMSNorm + FP8 quant, input=(4083,4096)				22.1 us	47.0 us	47.0%
	#10 (SGLang TP8), fused RMSNorm + FP8 quant, input=(8208,4096)				44.4 us	86.5 us	51.3%
▶	`fp8_blockscale_fused_moe`	fp8_e4m3	prefill/decode	6	1.7 ms	4.1 ms	40.7%
	#0 (sglang TP1), FP8 block-scale MoE expert FFN, hidden_states=(30821,2048) 128 experts top-8				5.0 ms	12.6 ms	39.7%
	#1 (sglang TP2), FP8 block-scale MoE expert FFN, hidden_states=(3804,7168) 256 experts top-8				5.8 ms	13.5 ms	42.7%
	#2 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=0				66.5 us	242.6 us	27.4%
	#3 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=1				1.4 ms	2.1 ms	63.7%
	#4 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=0				28.5 us	238.4 us	12.0%
	#5 (SGLang TP8), FP8 block-scale MoE expert FFN, grid_x=80				1.7 ms	4.1 ms	40.7%
▶	`per_token_group_quant_fp8`	fp8_e4m3	prefill/decode	19	87.9 us	199.6 us	44.6%
	#0 (vLLM TP1), FP8 per-token-group quantization, input=(1,2048) group_size=128				0.0 us	7.6 us	0.0%
	#1 (vLLM TP1), FP8 per-token-group quantization, input=(37504,2048) group_size=128				43.9 us	106.1 us	41.4%
	#2 (vLLM TP1), FP8 per-token-group quantization, input=(75008,2048) group_size=128				87.9 us	199.6 us	44.0%
	#3 (vLLM TP1), FP8 per-token-group quantization, input=(85632,2048) group_size=128				100.3 us	223.3 us	44.9%
	#4 (vLLM TP1), FP8 per-token-group quantization, input=(128192,2048) group_size=128				150.2 us	325.7 us	46.1%
	#5 (vLLM TP1), FP8 per-token-group quantization, input=(256896,2048) group_size=128				300.9 us	639.8 us	47.0%
	#6 (vLLM TP1), FP8 per-token-group quantization, input=(500736,2048) group_size=128				586.5 us	1258.4 us	46.6%
	#7 (vLLM TP1), FP8 per-token-group quantization, input=(1029056,2048) group_size=128				1.2 ms	2.6 ms	46.0%
	#8 (vLLM TP1), FP8 per-token-group quantization, input=(2058112,2048) group_size=128				2.4 ms	5.3 ms	45.2%
	#9 (vLLM TP1), FP8 per-token-group quantization, input=(3087168,2048) group_size=128				3.6 ms	7.1 ms	50.7%
	#10 (SGLang TP8), FP8 per-token-group quantization, input=(1,7168) group_size=128				0.0 us	7.6 us	0.0%
	#11 (SGLang TP8), FP8 per-token-group quantization, input=(726,7168) group_size=128				3.0 us	13.9 us	21.4%
	#12 (SGLang TP8), FP8 per-token-group quantization, input=(969,7168) group_size=128				4.0 us	15.6 us	25.4%
	#13 (SGLang TP8), FP8 per-token-group quantization, input=(1933,7168) group_size=128				7.9 us	23.1 us	34.3%
	#14 (SGLang TP8), FP8 per-token-group quantization, input=(4192,7168) group_size=128				17.2 us	47.6 us	36.1%
	#15 (SGLang TP8), FP8 per-token-group quantization, input=(8667,7168) group_size=128				35.5 us	91.5 us	38.8%
	#16 (SGLang TP8), FP8 per-token-group quantization, input=(16768,7168) group_size=128				68.7 us	154.1 us	44.6%
	#17 (SGLang TP8), FP8 per-token-group quantization, input=(40448,7168) group_size=128				165.8 us	355.8 us	46.6%
	#18 (SGLang TP8), FP8 per-token-group quantization, input=(58688,7168) group_size=128				240.6 us	515.1 us	46.7%

> 100%

33–100%

12–33%

< 12%

S = T_SOL / T_Prod. Higher = closer to hardware limit.