Files
sar/.agents/skills/bmad-eval-runner/SKILL.md
julian 17c08e6392 chore: initial monorepo scaffold + WDS Phase 1+2 artifacts
- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:34:20 +00:00

7.4 KiB

name, description
name description
bmad-eval-runner Run a skill's evals in a clean, isolated environment and report results. Use when the user wants to evaluate a skill, run evals, benchmark a skill, validate triggers, or grade skill outputs.

Skill Eval Runner

Overview

Run a skill's evals in an environment that does not bleed in the user's global config, auto-memory, or ancestor CLAUDE.md files — so the result reflects the skill itself, not the bench it was tested on. Preserve every run's artifacts so the user can inspect what happened, not just whether it passed.

Two eval shapes are supported and run independently:

  • Artifact evals (evals.json) — execute the skill against a prompt, capture the run's outputs, and grade each output against the eval's expectations.
  • Trigger evals (triggers.json) — measure whether the skill's description actually causes Claude to invoke the skill on a given query versus stay clear when it shouldn't.

You are an experienced eval engineer. The user wants signal, not theatre. Cite specific findings, surface evals that pass for trivial reasons, and never silently widen tolerances to make a run "succeed."

Args

  • Positional: a path to the skill being evaluated (directory containing SKILL.md).
  • --evals <path> — explicit path to evals folder or a specific evals.json / triggers.json file. If omitted, discover.
  • --mode artifact|trigger|both — which eval kind to run. Default: both if both files are found, else whichever exists.
  • --isolation docker|local|auto — sandbox strategy. Default: auto (Docker when available, otherwise local).
  • --project-root <path> — root of the project the skill belongs to. Default: walk up from skill path looking for _bmad/ or .git/.
  • --output-dir <path> — where run folders are written. Default: {bmad_builder_reports}/eval-runs/ if configured, else ~/bmad-evals/.
  • --workers <n> — parallel evals. Default: 4.
  • --headless / -H — non-interactive; emit final JSON only.

On Activation

  1. Resolve config the same way bmad-workflow-builder does ({project-root}/_bmad/config.yaml then config.user.yaml, falling back to bmb/config.yaml). Resolve {user_name}, {communication_language}, {bmad_builder_reports}. Apply throughout the session.

  2. If --headless was passed, set {headless_mode}=true and skip every confirmation below; pick the safest defaults and proceed.

  3. Locate the skill. Verify <skill-path>/SKILL.md exists; halt with a clear error if it doesn't.

  4. Discover evals — see ## Eval Discovery below.

  5. Choose isolation — see ## Isolation below. On the first Docker run on this machine, the image will need to be built; surface that, ask once unless headless, then cache.

  6. Confirm the run summary with the user (skill, evals found, mode, isolation, output dir) unless headless. Then execute.

Eval Discovery

Look in this order, taking the first match:

  1. --evals argument if provided. May point to a folder (containing evals.json and/or triggers.json) or a specific JSON file.
  2. <skill-path>/evals/ — colocated with the skill.
  3. <skill-path>/../../evals/<skill-name>/ — sibling-of-parent layout (common in BMad modules where evals/ is excluded from distribution but lives next to src/).
  4. <project-root>/evals/<skill-name>/ — top-level evals tree.
  5. <project-root>/evals/**/<skill-name>/ — anywhere under project evals.

Surface what you found and where. If no evals are discovered, halt with a clear message — do not attempt to fabricate evals.

Isolation

Run each eval in a fresh workspace so memory, project CLAUDE.md, prior runs, and host shell config cannot bias the result. Two strategies, picked automatically by default:

  • Docker (preferred when available): each eval runs in a fresh container off bmad-eval-runner:latest. The host's ANTHROPIC_API_KEY is the only env passed in. The skill's project is bind-mounted read-only and copied into a writable scratch dir inside the container; HOME is a fresh in-container directory; there is no auto-memory and no host CLAUDE.md.

  • Local fallback (when Docker is unavailable or the user opts out): each eval runs in a fresh ~/bmad-evals/<run-id>/<eval-id>/workspace/ directory with HOME=<workspace>/.home overridden so global memory and global CLAUDE.md do not leak. The project is copied (or hardlinked where supported) into the workspace. Tell the user this is the active mode and acknowledge that local isolation is best-effort, not hermetic.

The first time Docker is selected on this machine, build the image — python3 {skill-root}/scripts/docker_setup.py --build — and tell the user this is happening once.

Details and the exact mount layout live in references/isolation.md. Read that file when you need to debug an isolation issue or explain to the user what is being isolated.

Run Execution

For artifact evals, invoke python3 {skill-root}/scripts/run_evals.py with the resolved arguments. The script handles isolation per eval, runs claude -p in the sandbox with the eval's prompt and any staged fixture files, and writes a per-eval folder with prompt.txt, transcript.jsonl, artifacts/, and metrics.json.

For trigger evals, invoke python3 {skill-root}/scripts/run_triggers.py. The script measures whether the skill's description causes the skill to fire for each query, with runs-per-query repeats for stability, and writes triggers-result.json. Trigger evals should run under Docker isolation when available — local mode can have the host's installed skills bleed in via cwd-based skill discovery, biasing the trigger signal. If Docker is unavailable, run trigger evals locally but say so explicitly.

After artifact runs complete, grade each eval. Spawn a grader subagent per eval in parallel (Agent tool, prompt loaded from {skill-root}/agents/grader.md plus the eval's expectations and the path to its outputs). Each grader writes grading.json next to the artifacts. The grader has license to flag weak assertions — relay that feedback to the user.

After all grading is done, generate the aggregate report — python3 {skill-root}/scripts/generate_report.py --run-dir <run-id> — which produces report.html. Tell the user where the run folder is and where the HTML report is.

Outcomes

  • Every eval's prompt, transcript, artifacts, and grading land on disk and stay there. Nothing is silently cleaned up.
  • The run honestly reflects the skill's behavior in a clean room — not the behavior of the host shell with its memories and configs.
  • The user knows whether Docker or local was used and why.
  • Failures cite specific expectations and evidence; passes that look superficial are flagged, not papered over.

Constraints

  • Artifacts are forever. Never delete, overwrite, or rotate run folders. Disk usage is the user's call.
  • Auth boundary is narrow. On macOS, the host's Claude Code OAuth credential is staged into each isolated .claude/.credentials.json so the subprocess can authenticate without inheriting host config. ANTHROPIC_API_KEY, if set, is also forwarded. Nothing else crosses.
  • Trigger evals do not need real artifacts. They use a stub command file and only measure description firing — keep them cheap and parallel.
  • No silent fallbacks on grading. If a grader subagent errors, mark that eval grading_error rather than substituting a default verdict.
  • Stop when evals are missing. If discovery returns nothing, halt with diagnostics — the runner does not invent test cases.