Benchmark Data

30 production operators with 440 hot shapes, importance weights from 1,303 production profiles, roofline bounds across 2 hardware platforms, and deployed kernel baselines.

30 operators
Operator ▴▾ dtype Shapes Importance
unified_attention
atrex_030
bf16 25
39.5%
fused_moe
atrex_009
bf16 23
11.4%
block_scaled_mm
atrex_002
fp8_e4m3 24
9.3%
fp8_blockscale_fused_moe
atrex_006
fp8_e4m3 6
5.1%
paged_attention_decode
atrex_024
bf16 8
4.4%
reshape_and_cache
atrex_026
bf16 8
4.4%
topk_filter
atrex_029
fp32 8
3.5%
gated_delta_rule_update
atrex_013
bf16 17
3.5%
fused_qkv_rope
atrex_011
fp16 6
3.4%
rms_norm
atrex_027
bf16 56
2.8%
mla_decode_attention
atrex_018
bf16 3
2.3%
attention_forward
atrex_001
bf16 21
2.2%
causal_conv1d
atrex_003
bf16 9
2.2%
moe_topk_gating_softmax
atrex_022
fp32 10
2.1%
fused_qk_rmsnorm
atrex_010
fp16 4
1.9%
silu_and_mul
atrex_028
bf16 37
1.8%
chunk_gated_delta_rule_state
atrex_005
bf16 25
1.6%
fused_add_rms_norm
atrex_008
bf16 19
1.1%
moe_align_block_size
atrex_019
int32 6
1.1%
mrope
atrex_023
bf16 5
1.0%
chunk_delta_rule_output
atrex_004
bf16 16
0.8%
fp8_dynamic_per_token_quant
atrex_007
fp8_e4m3 20
0.8%
linear_sigmoid_mul
atrex_017
bf16 9
0.7%
per_token_group_quant_fp8
atrex_025
fp8_e4m3 19
0.5%
gated_rms_norm
atrex_014
bf16 7
0.5%
l2_norm
atrex_015
bf16 13
0.5%
moe_count_and_sort
atrex_020
int32 5
0.5%
fused_rmsnorm_quant
atrex_012
fp8_e4m3 11
0.3%
moe_sum_reduce
atrex_021
bf16 7
0.3%
layer_norm
atrex_016
bf16 13
0.1%

Operator Importance Distribution

Importance score reflects trace-derived GPU time share across production workloads