Early Stage

Atrex-Bench

End-to-End Operator Generation Benchmark for Multiple DSLs

Production-trace driven benchmark for evaluating "agent platform + model" stacks on GPU kernel generation.

View on GitHub Quick Start

Production Operators

440

Hot Shapes

1.3k

Prod. Profiles

Why Atrex-Bench?

A production-grade kernel benchmark sets a specific bar: the shape distribution that matters is whatever a serving stack actually invokes under real traffic, not a synthetic grid; not all kernels are equally important; and the methodology must give a unitless score that answers "how close to the hardware ceiling is this kernel?"

Production-Trace Sourced

Operators and shapes are derived from real-world inference traces across vLLM, SGLang, AITER, and RTP-LLM. The benchmark is periodically refreshed.

Importance-Weighted Scoring

Each operator carries an importance weight derived from production frequency × time share. A fused attention path consuming 39.5% of GPU time gets 39.5% weight in the aggregate score — the benchmark answers "for what fraction of production wall-time would the candidate's kernels be good enough?"

Roofline-Normalized Metrics

Performance is scored as roofline achievement S = T_roofline / T_candidate per (op, shape), where T_roofline is the hardware Speed-of-Light lower bound. This unitless score measures absolute hardware utilization rather than relative speedup against a reference baseline.