End-to-End Operator Generation Benchmark for Multiple DSLs
Production-trace driven benchmark for evaluating "agent platform + model" stacks on GPU kernel generation.
A production-grade kernel benchmark sets a specific bar: the shape distribution that matters is whatever a serving stack actually invokes under real traffic, not a synthetic grid; not all kernels are equally important; and the methodology must give a unitless score that answers "how close to the hardware ceiling is this kernel?"
Operators and shapes are derived from real-world inference traces across vLLM, SGLang, AITER, and RTP-LLM. The benchmark is periodically refreshed.
Each operator carries an importance weight derived from production frequency × time share. A fused attention path consuming 39.5% of GPU time gets 39.5% weight in the aggregate score — the benchmark answers "for what fraction of production wall-time would the candidate's kernels be good enough?"
Performance is scored as roofline achievement S = T_roofline / T_candidate per (op, shape), where T_roofline is the hardware Speed-of-Light lower bound. This unitless score measures absolute hardware utilization rather than relative speedup against a reference baseline.