A three-stage evaluation pipeline with isolated generation and evaluation environments.
Import the candidate module and verify compilation produces an artifact for each shape. For FlyDSL this means a completed ahead-of-time MLIR compilation, not merely a module that imports. Per-shape compile pass is what makes this gate informative — the old import-level gate silently passed fallback modules.
Compare candidate output against PyTorch reference over K random seeds (default K=5) with tolerances atol=1e-2, rtol=5e-2. A shape counts as correct only when all seeds pass. This multi-seed check is essential — silent numeric mismatch (right shape/dtype, wrong values) is the dominant failure mode.
Measured in a controlled sandbox with frequency lock, per-iteration L2 flush, and per-shape subprocess isolation. 10 warmup + 100 timed iterations; median wall-clock becomes T_cand. Only kernels reaching this stage contribute to the roofline score S.
The roofline model expresses achievable performance as the minimum of compute-bound and memory-bandwidth-bound ceilings, parameterized by arithmetic intensity. Hardware peaks are calibrated per accelerator; per-problem semantic FLOPs and bytes are derived from the reference implementation. The per-accelerator roofline time T_roofline is stored in roofline.json and serves as the denominator in the scoring formula.
T_roofline = max(W/P_peak, Q/B_peak) S = T_roofline / T_candidate S_agg = Σ w_i · S_i compile_pass Per-shape artifact-based compilation check correctness Multi-seed numerical correctness verification S (roofline) T_roofline / T_candidate — how close to hardware ceiling vs. production Candidate time vs deployed production kernel time
The generation workspace never sees metadata.json (upstream symbol and provenance) or roofline.json (SOL bounds and W/Q values), so a model can neither retrieve the reference implementation by name nor fit its output to the scoring formula. What this isolation leaves available is specification shortcutting — satisfying the PyTorch reference with a non-DSL implementation — which the DSL-adoption metric makes visible rather than hiding.