Overview

A three-stage evaluation pipeline with isolated generation and evaluation environments.

Stage 0

Stage 0: Compilation

Import the candidate module and verify compilation produces an artifact for each shape. For FlyDSL this means a completed ahead-of-time MLIR compilation, not merely a module that imports. Per-shape compile pass is what makes this gate informative — the old import-level gate silently passed fallback modules.

Stage 1

Stage 1: Correctness

Compare candidate output against PyTorch reference over K random seeds (default K=5) with tolerances atol=1e-2, rtol=5e-2. A shape counts as correct only when all seeds pass. This multi-seed check is essential — silent numeric mismatch (right shape/dtype, wrong values) is the dominant failure mode.

Stage 2

Stage 2: Performance

Measured in a controlled sandbox with frequency lock, per-iteration L2 flush, and per-shape subprocess isolation. 10 warmup + 100 timed iterations; median wall-clock becomes T_cand. Only kernels reaching this stage contribute to the roofline score S.

Roofline Scoring Methodology Roofline Scoring

The roofline model expresses achievable performance as the minimum of compute-bound and memory-bandwidth-bound ceilings, parameterized by arithmetic intensity. Hardware peaks are calibrated per accelerator; per-problem semantic FLOPs and bytes are derived from the reference implementation. The per-accelerator roofline time T_roofline is stored in roofline.json and serves as the denominator in the scoring formula.

Per-Kernel Score Single-Kernel Score

Roofline bound
T_roofline = max(W/P_peak, Q/B_peak)
Per-kernel achievement Per-kernel Achievement
S = T_roofline / T_candidate
Importance-weighted aggregate Importance-Weighted Aggregate
S_agg = Σ w_i · S_i

Core Metrics

compile_pass Per-shape artifact-based compilation check
correctness Multi-seed numerical correctness verification
S (roofline) T_roofline / T_candidate — how close to hardware ceiling
vs. production Candidate time vs deployed production kernel time

Anti-Hacking by Construction

The generation workspace never sees metadata.json (upstream symbol and provenance) or roofline.json (SOL bounds and W/Q values), so a model can neither retrieve the reference implementation by name nor fit its output to the scoring formula. What this isolation leaves available is specification shortcutting — satisfying the PyTorch reference with a non-DSL implementation — which the DSL-adoption metric makes visible rather than hiding.