Skip to content

CLI Reference

skill-up provides the following commands, covering the full evaluation lifecycle: validate → run → list cases → generate reports → import legacy formats.


skill-up run

Run evaluation cases and produce reports.

bash
skill-up run [path] [flags]

Argument

ArgumentDescription
pathPath to eval.yaml. Defaults to evals/eval.yaml in the current dir

Flags

FlagDefaultDescription
--autofalseAuto-detect the evals/ directory; can directly consume an Anthropic evals.json
--include-case-nameRun only matching cases (glob; can be repeated)
--exclude-case-nameExclude matching cases (glob; can be repeated)
--formatExtra report formats: junit / html (repeatable). result.json is always written; --format junit produces report.xml, --format html produces report.html; --format json is a no-op
--output-dirSame dir as eval.yamlOutput directory for reports and artifacts
--iteration1Total run count. Each iteration writes to iteration-1/iteration-N/
--engineFrom configOverride engine name
--modelFrom configOverride model (format: provider/name)
--parallelismFrom configOverride cases.parallelism. Allowed range: 1–256
--api-keyPass an API key (higher precedence than env vars)
-v, --verbose0Increase log verbosity. Default info; -v / --verbose / --verbose=truedebug; -vv / --verbose=2trace; --verbose=false disables extra detail

Examples

bash
# Run all cases
skill-up run ./evals/eval.yaml

# Run a subset
skill-up run ./evals/eval.yaml --include-case-name "basic-*"

# Exclude cases
skill-up run ./evals/eval.yaml --exclude-case-name "*-old" --exclude-case-name "*-deprecated"

# Override engine and model
skill-up run ./evals/eval.yaml --engine codex --model openai/gpt-4

# Temporarily override case parallelism
skill-up run ./evals/eval.yaml --parallelism 4

# Multiple report formats
skill-up run ./evals/eval.yaml --format json --format html --format junit

# Three iterations, one folder per run
skill-up run ./evals/eval.yaml --iteration 3

# Auto-detect mode (consumes Anthropic evals.json directly)
skill-up run --auto
skill-up run --auto --engine codex
skill-up run ./my-skill/ --auto

Exit codes

Exit codeMeaning
0All cases passed
1At least one case failed/errored

Use the exit code in CI to determine whether the evaluation succeeded.

OTLP trace export

Telemetry is disabled by default. After standard OpenTelemetry env vars are set, skill-up run exports run traces over OTLP and decorates verbose slog logs with trace_id / span_id:

bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_METRICS_EXPORTER=otlp
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local,service.namespace=skill-up
skill-up run ./evals/eval.yaml -v

You can also use trace-specific overrides like OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and OTEL_EXPORTER_OTLP_TRACES_PROTOCOL. Currently grpc and http/protobuf are supported.

OTEL_METRICS_EXPORTER=otlp enables low-cardinality metrics (counters and durations for run/case/runtime exec). OTEL_METRICS_EXPORTER=console is also available for local debugging. OTEL_RESOURCE_ATTRIBUTES is forwarded as resource attributes. Set OTEL_METRICS_EXPORTER=none to explicitly disable metrics.


skill-up validate

Validate the eval config files. Run this before run to catch issues early.

bash
skill-up validate [path to eval.yaml]

Examples

bash
skill-up validate
skill-up validate ./evals/eval.yaml

On success:

text
✓ eval.yaml is valid (loaded 3 case(s))

The validator checks that:

  • eval.yaml and every referenced case file exist and parse correctly
  • All required fields are present
  • Field values are within the allowed range

skill-up list-cases

List every case referenced by an eval config — handy for quickly inspecting your suite.

bash
skill-up list-cases [path to eval.yaml]

Examples

bash
skill-up list-cases
skill-up list-cases ./evals/eval.yaml

Sample output:

text
ID                    Title                           Tag              Prompt
basic-success         Agent should find null bug      functional_test  Review the current diff and report ...
edge-case-empty       Handle empty input gracefully   functional_test  Review an empty repository with no ...
regression-001        Fix: no longer misreports       functional_test  Review the payment processor code ...

skill-up report

Regenerate reports from an existing result file without re-running the evaluation.

bash
skill-up report <path to result.json> [flags]

Flags

FlagDefaultDescription
--formatjsonReport format: json / junit / html (repeatable)
--output-dirSame dir as result.jsonOutput directory; created if missing

Examples

bash
# Generate an HTML report from existing results
skill-up report result.json --format html

# Multiple formats at once
skill-up report result.json --format json --format junit --format html

# Pin the output directory
skill-up report result.json --format html --output-dir ./reports

Report formats

FormatFileUse case
jsonreport.jsonMachine-readable structured data; consumable by downstream tools
junitreport.xmlJUnit XML; parseable by CI systems (Jenkins, GitHub Actions, …)
htmlreport.htmlHuman-readable visualization; open in a browser

skill-up import

One-shot conversion of an Anthropic evals.json into skill-up's native YAML format.

bash
skill-up import <path to evals.json> [flags]

Flags

FlagDefaultDescription
--outputSame dir as evals.jsonOutput directory

Examples

bash
# Convert in place
skill-up import ./evals/evals.json

# Custom output directory
skill-up import ./evals/evals.json --output ./new-evals

The import produces:

  • eval.yaml — entrypoint config (with sensible defaults; review before running)
  • cases/*.yaml — one case file per evals.json entry

import vs --auto: import is a one-time format conversion — afterwards you maintain YAML files. run --auto consumes evals.json at runtime without producing intermediate files. See Migrating from Anthropic.


Output layout

After a run, the output directory looks like:

text
<skill-name>-workspace/
  iteration-1/                    # First iteration
    benchmark.json                # Aggregated stats
    <case-id>/
      with_skill/                 # Run with the Skill installed
        outputs/                  # Files generated by the Agent
        grading.json              # Grading result
      without_skill/              # Baseline (only when benchmark.enabled=true)
        outputs/
        grading.json

grading.json

Per-case grading result (Anthropic-compatible):

json
{
  "expectations": [
    {
      "text": "Output contains the keyword `null`",
      "passed": true,
      "evidence": "final_message contains 'null pointer at line 42'"
    }
  ],
  "summary": {
    "passed": 1,
    "failed": 0,
    "total": 1,
    "pass_rate": 1.0
  }
}

Note: the grading.json under the workspace uses the Anthropic-compatible shape — only expectations and summary at the top level. The full evaluation status (status, turns_executed, turns_total, assertion_results) lives under case_results[].grading inside result.json.

The grading object inside result.json carries the full status:

  • PASS — all assertions passed
  • FAIL — at least one assertion failed
  • ERROR — execution exception (timeout, engine crash, etc.)

benchmark.json

Aggregated statistics across all cases:

json
{
  "run_summary": {
    "with_skill": {
      "pass_rate": { "mean": 0.83 },
      "time_seconds": { "mean": 45.0 },
      "tokens": { "mean": 3800 }
    },
    "without_skill": null,
    "delta": null
  }
}

Released under the Apache 2.0 License.