chore: initial monorepo scaffold + WDS Phase 1+2 artifacts

- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-27 14:34:20 +00:00
commit 17c08e6392
3631 changed files with 855518 additions and 0 deletions

View File

@@ -0,0 +1,91 @@
---
name: bmad-eval-runner
description: Run a skill's evals in a clean, isolated environment and report results. Use when the user wants to evaluate a skill, run evals, benchmark a skill, validate triggers, or grade skill outputs.
---
# Skill Eval Runner
## Overview
Run a skill's evals in an environment that does not bleed in the user's global config, auto-memory, or ancestor `CLAUDE.md` files — so the result reflects the skill itself, not the bench it was tested on. Preserve every run's artifacts so the user can inspect what happened, not just whether it passed.
Two eval shapes are supported and run independently:
- **Artifact evals** (`evals.json`) — execute the skill against a prompt, capture the run's outputs, and grade each output against the eval's `expectations`.
- **Trigger evals** (`triggers.json`) — measure whether the skill's `description` actually causes Claude to invoke the skill on a given query versus stay clear when it shouldn't.
You are an experienced eval engineer. The user wants signal, not theatre. Cite specific findings, surface evals that pass for trivial reasons, and never silently widen tolerances to make a run "succeed."
## Args
- Positional: a path to the skill being evaluated (directory containing `SKILL.md`).
- `--evals <path>` — explicit path to evals folder or a specific `evals.json` / `triggers.json` file. If omitted, discover.
- `--mode artifact|trigger|both` — which eval kind to run. Default: `both` if both files are found, else whichever exists.
- `--isolation docker|local|auto` — sandbox strategy. Default: `auto` (Docker when available, otherwise local).
- `--project-root <path>` — root of the project the skill belongs to. Default: walk up from skill path looking for `_bmad/` or `.git/`.
- `--output-dir <path>` — where run folders are written. Default: `{bmad_builder_reports}/eval-runs/` if configured, else `~/bmad-evals/`.
- `--workers <n>` — parallel evals. Default: 4.
- `--headless` / `-H` — non-interactive; emit final JSON only.
## On Activation
1. Resolve config the same way `bmad-workflow-builder` does (`{project-root}/_bmad/config.yaml` then `config.user.yaml`, falling back to `bmb/config.yaml`). Resolve `{user_name}`, `{communication_language}`, `{bmad_builder_reports}`. Apply throughout the session.
2. If `--headless` was passed, set `{headless_mode}=true` and skip every confirmation below; pick the safest defaults and proceed.
3. Locate the skill. Verify `<skill-path>/SKILL.md` exists; halt with a clear error if it doesn't.
4. Discover evals — see `## Eval Discovery` below.
5. Choose isolation — see `## Isolation` below. On the first Docker run on this machine, the image will need to be built; surface that, ask once unless headless, then cache.
6. Confirm the run summary with the user (skill, evals found, mode, isolation, output dir) unless headless. Then execute.
## Eval Discovery
Look in this order, taking the first match:
1. `--evals` argument if provided. May point to a folder (containing `evals.json` and/or `triggers.json`) or a specific JSON file.
2. `<skill-path>/evals/` — colocated with the skill.
3. `<skill-path>/../../evals/<skill-name>/` — sibling-of-parent layout (common in BMad modules where `evals/` is excluded from distribution but lives next to `src/`).
4. `<project-root>/evals/<skill-name>/` — top-level evals tree.
5. `<project-root>/evals/**/<skill-name>/` — anywhere under project evals.
Surface what you found and where. If no evals are discovered, halt with a clear message — do not attempt to fabricate evals.
## Isolation
Run each eval in a fresh workspace so memory, project CLAUDE.md, prior runs, and host shell config cannot bias the result. Two strategies, picked automatically by default:
- **Docker** (preferred when available): each eval runs in a fresh container off `bmad-eval-runner:latest`. The host's `ANTHROPIC_API_KEY` is the only env passed in. The skill's project is bind-mounted read-only and copied into a writable scratch dir inside the container; `HOME` is a fresh in-container directory; there is no auto-memory and no host CLAUDE.md.
- **Local fallback** (when Docker is unavailable or the user opts out): each eval runs in a fresh `~/bmad-evals/<run-id>/<eval-id>/workspace/` directory with `HOME=<workspace>/.home` overridden so global memory and global CLAUDE.md do not leak. The project is copied (or hardlinked where supported) into the workspace. Tell the user this is the active mode and acknowledge that local isolation is best-effort, not hermetic.
The first time Docker is selected on this machine, build the image — `python3 {skill-root}/scripts/docker_setup.py --build` — and tell the user this is happening once.
Details and the exact mount layout live in `references/isolation.md`. Read that file when you need to debug an isolation issue or explain to the user what is being isolated.
## Run Execution
For artifact evals, invoke `python3 {skill-root}/scripts/run_evals.py` with the resolved arguments. The script handles isolation per eval, runs `claude -p` in the sandbox with the eval's prompt and any staged fixture files, and writes a per-eval folder with `prompt.txt`, `transcript.jsonl`, `artifacts/`, and `metrics.json`.
For trigger evals, invoke `python3 {skill-root}/scripts/run_triggers.py`. The script measures whether the skill's description causes the skill to fire for each query, with `runs-per-query` repeats for stability, and writes `triggers-result.json`. Trigger evals should run under Docker isolation when available — local mode can have the host's installed skills bleed in via cwd-based skill discovery, biasing the trigger signal. If Docker is unavailable, run trigger evals locally but say so explicitly.
After artifact runs complete, grade each eval. Spawn a grader subagent per eval in parallel (Agent tool, prompt loaded from `{skill-root}/agents/grader.md` plus the eval's `expectations` and the path to its outputs). Each grader writes `grading.json` next to the artifacts. The grader has license to flag weak assertions — relay that feedback to the user.
After all grading is done, generate the aggregate report — `python3 {skill-root}/scripts/generate_report.py --run-dir <run-id>` — which produces `report.html`. Tell the user where the run folder is and where the HTML report is.
## Outcomes
- Every eval's prompt, transcript, artifacts, and grading land on disk and stay there. Nothing is silently cleaned up.
- The run honestly reflects the skill's behavior in a clean room — not the behavior of the host shell with its memories and configs.
- The user knows whether Docker or local was used and why.
- Failures cite specific expectations and evidence; passes that look superficial are flagged, not papered over.
## Constraints
- **Artifacts are forever.** Never delete, overwrite, or rotate run folders. Disk usage is the user's call.
- **Auth boundary is narrow.** On macOS, the host's Claude Code OAuth credential is staged into each isolated `.claude/.credentials.json` so the subprocess can authenticate without inheriting host config. `ANTHROPIC_API_KEY`, if set, is also forwarded. Nothing else crosses.
- **Trigger evals do not need real artifacts.** They use a stub command file and only measure description firing — keep them cheap and parallel.
- **No silent fallbacks on grading.** If a grader subagent errors, mark that eval `grading_error` rather than substituting a default verdict.
- **Stop when evals are missing.** If discovery returns nothing, halt with diagnostics — the runner does not invent test cases.

View File

@@ -0,0 +1,93 @@
# Grader Agent
Evaluate a single eval's expectations against its captured transcript and artifacts. Return pass/fail per expectation with evidence — and flag weak assertions when you see them.
You are not the executor. You are not allowed to "fix" the artifacts. Your only job is to inspect what was produced and answer: did each expectation hold?
## Inputs
You receive in your prompt:
- **eval_id**: identifier for this eval
- **prompt**: the original user message that was sent to the skill
- **expected_output**: human-readable description of what success looks like (context only, not scored against)
- **expectations**: list of strings — the assertions you grade
- **transcript_path**: absolute path to a stream-JSON transcript (`.jsonl`)
- **artifacts_dir**: absolute path to the directory containing files the skill wrote
- **grading_path**: absolute path where you write `grading.json`
## Process
1. **Read the transcript.** Open `transcript_path`. The transcript is stream-JSON: each line is a JSON event. Note:
- The user prompt that was sent
- Every tool call Claude made — `Write`, `Edit`, `Read`, `Skill`, `Bash`, etc. (the event has `type: "assistant"` and `content[].type: "tool_use"` with `name` and `input`)
- The order tool calls happened in (events are line-ordered)
- The final assistant message — often contains a JSON status block for headless runs
- Any errors or warnings logged
2. **List and inspect artifacts.** Walk `artifacts_dir`. For each expectation, open the files it implicates and read their contents — do not rely on filenames alone. Note file modification times when ordering or read-only behavior matters.
3. **Grade each expectation independently.** For each entry in `expectations`, identify what kind of check it is and gather the right evidence:
- **Side-artifact existence + content** ("decision-log.md exists AND captures decision X") → open the file, read it, check the content matches.
- **Transcript tool-call patterns** ("transcript contains a Skill call to bmad-editorial-review-prose") → scan the transcript for `tool_use` events with the matching `name` and `input`. Quote the matching event.
- **Phase ordering** ("polish call occurs after the Write to brief.md and before the final JSON block") → find the line numbers / event indices of each landmark and verify the order.
- **Read-only enforcement** ("input brief.md is byte-identical to the fixture; no Write/Edit calls targeted it") → compare file content if the original is available; AND scan the transcript for any Write/Edit `tool_use` whose `input.file_path` falls in the protected directory.
- **YAML frontmatter** ("frontmatter contains title, status, created (ISO 8601), updated") → parse the frontmatter, check fields and their formats.
- **JSON output blocks** ("final assistant message contains a JSON object with intent='create'") → look at the final `text` content of the last assistant message; extract the JSON object; check the field.
- **Bidirectional fidelity** ("every decision in decision-log.md is reflected in brief.md AND no claim in brief.md is absent from the input prompt or log") → list decisions in the log, verify each appears in the brief; list substantive claims in the brief, verify each traces to either the prompt or the log.
4. **Decide PASS or FAIL with specific evidence.**
- PASS only if there is clear, specific evidence the expectation holds AND the evidence reflects substance, not surface compliance (file exists AND contains correct content, not just the right filename).
- FAIL when no evidence is found, evidence contradicts, or the assertion is technically satisfied but the underlying outcome is wrong.
- Cite the evidence — quote a specific line, name a specific file with a path, point to a specific tool call with its index or input.
5. **Critique the evals.** After grading, surface assertions that look weak: ones that passed but would also pass for a clearly wrong output, or important outcomes you observed (good or bad) that no assertion checks. Keep the bar high — flag what an eval author would say "good catch" about, not nits.
6. **Write `grading.json`.** Save to `grading_path`.
## Output Format
```json
{
"eval_id": "<eval_id>",
"expectations": [
{
"text": "brief.md exists in the run folder",
"passed": true,
"evidence": "Found at artifacts/2026-05-09-insulens/brief.md, 487 words"
},
{
"text": "decision-log.md references having ingested the memo as source material",
"passed": false,
"evidence": "decision-log.md exists but contains only template placeholders; no mention of the memo"
}
],
"summary": {
"passed": 1,
"failed": 1,
"total": 2,
"pass_rate": 0.5
},
"eval_feedback": {
"suggestions": [
{
"assertion": "brief.md exists in the run folder",
"reason": "Existence is a weak check — an empty brief.md would also pass. Consider pairing with a content assertion (e.g., word count > 200, contains the project name)."
}
],
"overall": "Assertions check structure but not content correctness in two places."
}
}
```
If `eval_feedback.suggestions` would be empty, set it to `[]` and `overall` to `"No suggestions; assertions look solid."`
## Guidelines
- **Be objective.** Verdicts come from evidence, not vibes.
- **Be specific.** Quote, name files, point to line numbers.
- **No partial credit.** Each expectation is pass or fail.
- **Burden of proof is on the expectation.** When uncertain, fail.
- **Do not edit artifacts.** You are read-only against the run folder.
- **Do not silently substitute defaults.** If you genuinely cannot read a file or the transcript is missing, mark the affected expectations failed with that as the evidence.

View File

@@ -0,0 +1,29 @@
FROM node:20-bookworm-slim
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
git \
python3 \
python3-pip \
ca-certificates \
curl \
jq \
rsync \
&& rm -rf /var/lib/apt/lists/*
RUN npm install -g @anthropic-ai/claude-code
RUN useradd -ms /bin/bash evaluator \
&& mkdir -p /workspace /project /output /home/evaluator/.claude \
&& chown -R evaluator:evaluator /workspace /output /home/evaluator
USER evaluator
WORKDIR /workspace
ENV HOME=/home/evaluator
ENV CLAUDE_CONFIG_DIR=/home/evaluator/.claude
ENV PATH=/home/evaluator/.local/bin:$PATH
CMD ["bash"]

View File

@@ -0,0 +1,147 @@
# Eval Formats
The runner accepts two file shapes, both compatible with Anthropic's skill-creator conventions.
## Artifact evals — `evals.json`
```json
{
"skill_name": "bmad-product-brief",
"evals": [
{
"id": 1,
"prompt": "I want to create a brief for ...",
"expected_output": "A run folder with brief.md and decision-log.md ...",
"files": [
"evals/.../files/some-fixture.md"
],
"expectations": [
"brief.md exists in the run folder",
"decision-log.md exists",
"brief.md word count is between 250 and 1500"
]
}
]
}
```
Field semantics:
- **id**: stable identifier; used as the eval's directory name in the run folder.
- **prompt**: the literal user message Claude will receive. Sent verbatim to `claude -p`.
- **expected_output**: human-readable description, used for context only — the grader reads it but does not score against it directly.
- **files**: optional fixture paths. Resolved relative to the project root (or the evals folder). Each file is staged into the eval's workspace before execution. Path semantics:
- A bare filename is staged at the workspace root.
- A nested path (`some-brief/brief.md`) preserves the directory structure inside the workspace.
- **expectations**: list of pass/fail assertions evaluated by the grader subagent. Each is graded independently. The grader is instructed to flag weak assertions — assertions a wrong output would also trivially pass.
The grader writes `grading.json` next to each eval's artifacts; the runner aggregates.
## Trigger evals — `triggers.json`
```json
[
{ "query": "Help me write a product brief for ...", "should_trigger": true },
{ "query": "Help me brainstorm ideas for ...", "should_trigger": false }
]
```
The runner creates a synthetic command file in the sandbox's `.claude/commands/<skill-name>.md` containing the skill's description, then runs each query against `claude -p` with stream-JSON output and detects whether the skill (or a Read of its SKILL.md) appears as a tool call. Each query is run `--runs-per-query` times (default 3); `trigger_rate` is the fraction of runs that fired.
A query passes when:
- `should_trigger=true` and `trigger_rate >= --trigger-threshold` (default 0.5)
- `should_trigger=false` and `trigger_rate < --trigger-threshold`
Trigger evals do not produce artifacts beyond the result JSON. They are cheap and parallelize aggressively.
## Where evals can live
The runner discovers evals in this order:
1. `--evals <path>` — explicit. May point to a folder or a specific `*.json`.
2. `<skill-path>/evals/` — colocated with the skill.
3. `<skill-path>/../../evals/<skill-name>/` — sibling-of-parent. Common pattern when evals are intentionally excluded from skill distribution.
4. `<project-root>/evals/<skill-name>/`.
5. `<project-root>/evals/**/<skill-name>/` — fuzzy search under the project's evals tree.
If both `evals.json` and `triggers.json` are found, both run unless `--mode` narrows it.
## Two patterns for single-shot evals
Most multi-turn workflow skills can be evaluated single-shot if you design the eval right. Two patterns cover the bulk of what you'd otherwise need a multi-turn simulator for:
### Pattern A — artifact correctness (headless + rich prompt)
Force the skill into headless mode and pack the prompt with everything Discovery would have surfaced. Grade what comes out: the artifact, its structure, whether it reflects the inputs without inventing.
Use when:
- The deliverable is the artifact (brief, PRD, doc, plan)
- You can write a complete pre-Discovery prompt
- You want regression coverage on drafting/format/extraction
### Pattern B — process discipline (headless + transcript and side-artifact inspection)
Same single-shot mechanics, but the expectations look at *what the skill did internally* — not just the final output. The grader reads the stream-JSON transcript for tool calls, walks side-artifacts (decision logs, addenda, distillates), checks file mtimes, and verifies phase ordering.
Use when:
- The skill enforces a protocol (decision log, polish phase, finalize sequence)
- The skill has read-only intents (Validate must not write)
- You need to catch "drafting works but the discipline went soft" regressions
These are deterministic checks against the transcript and filesystem — no LLM judgment needed for most of them.
### What single-shot can NOT cover
Facilitation arc: vague-input → sharper pushback → user clarifies → better artifact. That requires a multi-turn user simulator. Defer it to a separate eval mode for skills where conversation is the value (coaching, brainstorming, design thinking).
## Writing good expectations
The grader's job is easier when expectations are *discriminating* — hard to pass without actually doing the work.
**Weak patterns to avoid:**
- **Filename-only checks** — "brief.md exists" passes for an empty file. Pair with a content check.
- **Wholly subjective phrasing** — "the brief is high quality" cannot be evaluated. State the property concretely.
- **Tautologies** — anything that follows from the prompt being understood is not a useful expectation.
**Strong patterns for artifact correctness (Pattern A):**
- Specific facts that should appear ("incorporates at least 2 specific findings from section X")
- Structural claims a wrong output would fail ("word count between 250 and 1500")
- Negative assertions ("does not introduce content from unrelated sections")
- YAML frontmatter checks ("frontmatter contains title, status, created, updated as ISO 8601")
- Bounded JSON output ("final assistant message contains a JSON object with intent='create'")
**Strong patterns for process discipline (Pattern B):**
- **Side-artifact existence + content** ("decision-log.md exists AND captures the pricing decision with rejected alternative and rationale")
- **Transcript tool-call patterns** ("the transcript contains a Skill tool call invoking bmad-editorial-review-prose")
- **Phase ordering** ("the polish-phase Skill calls occur after the brief body Write and before the final JSON status block")
- **Read-only enforcement** ("the input brief.md is byte-identical to the staged fixture; no Write or Edit tool calls targeted the run folder")
- **Bidirectional fidelity** ("every substantive entry in decision-log.md has a corresponding reflection in brief.md, AND no claim in brief.md is absent from the input prompt or decision-log.md")
- **Timestamp checks** ("YAML frontmatter 'updated' field is later than 'created'; 'created' is unchanged from the input fixture")
## Headless mode — getting the skill to behave non-interactively
Most multi-turn skills expose a headless flag or keyword that suppresses clarifying questions and produces a structured JSON status block at the end. To use Pattern A or B, the eval prompt needs to trigger this. Common signals:
- The literal phrase `Run headless.` at the start of the prompt
- Skill-specific flags or keywords as documented in the skill's `## Headless Mode` section
- Sufficient context such that no clarification is genuinely needed
If the skill has no headless mode, single-shot evals will halt at the first clarifying question and you have two options: (1) add a headless mode to the skill, (2) defer that skill's evals to the multi-turn simulator.
## Pre-staging files (Update / Validate intents)
For Update and Validate evals, the workspace needs to contain an existing brief, decision log, addendum, etc. Use the `files` field — each path is staged into the workspace at the same relative location. The eval prompt then references the staged path explicitly:
```json
{
"id": "B5",
"prompt": "Run headless. Update the brief at evals/skill-x/files/some-brief/brief.md — ...",
"files": [
"evals/skill-x/files/some-brief/brief.md",
"evals/skill-x/files/some-brief/decision-log.md",
"evals/skill-x/files/some-brief/addendum.md"
]
}
```
For Validate (read-only) expectations, pair the staged files with byte-identical assertions and a no-Write/no-Edit transcript check.

View File

@@ -0,0 +1,110 @@
# Isolation Strategies
The eval runner offers two strategies. The intent is identical in both: every eval starts from a clean slate so the result reflects the skill itself, not the host's accumulated state.
## What we are isolating from
- The user's global `~/.claude/CLAUDE.md` (private global instructions)
- Any ancestor `CLAUDE.md` in the project tree above the skill
- Auto-memory at `~/.claude/projects/.../memory/MEMORY.md`
- Cached settings, MCP configurations, IDE integrations
- Prior conversation context bleeding via the shell
## Authentication
The isolated `claude -p` subprocess needs to authenticate, but cannot read the host's `~/.claude/` (HOME is overridden) or the macOS Keychain (Keychain ACLs are scoped to the process that wrote the entry). The runner solves this in the parent process:
1. On macOS, read the OAuth credential JSON from the Keychain entry `Claude Code-credentials` via `security find-generic-password -s "Claude Code-credentials" -w`. This succeeds because the parent runs as the same user that wrote the entry.
2. Stage that JSON as `<workspace>/.home/.claude/.credentials.json` (local mode) or copy it into `/home/evaluator/.claude/.credentials.json` inside the container (Docker mode).
3. The subprocess reads `.credentials.json` exactly the way Claude Code normally does, with no other host config bleed.
If the parent has `ANTHROPIC_API_KEY` set, that env var is also forwarded — and it takes precedence over the Keychain credential. On non-macOS hosts, the Keychain step is skipped and `ANTHROPIC_API_KEY` is the only auth path.
## Docker (preferred)
A single image, `bmad-eval-runner:latest`, is built once per machine. It contains Node 20, Claude Code (via `npm install -g @anthropic-ai/claude-code`), Python 3, and standard tools. The image is intentionally minimal — every eval starts from this baseline.
### Image build
`scripts/docker_setup.py --build` builds the image from `assets/Dockerfile`. This runs once. Re-runs are a no-op unless `--rebuild` is passed.
### Per-eval container
Each eval gets a fresh container:
```
docker run --rm \
-v "<project-root>:/project:ro" \
-v "<output-dir>/<eval-id>:/output" \
-v "<fixtures-dir>:/fixtures:ro" \
-e ANTHROPIC_API_KEY \
-e EVAL_PROMPT \
-e EVAL_ID \
-e SKILL_PATH \
bmad-eval-runner:latest \
/bin/bash -c "/scripts/run_one_eval.sh"
```
Inside the container:
1. The project is copied from `/project` (read-only) to `/workspace` (writable, container-local). Copy is fast because the underlying layer is shared.
2. Fixtures are copied into `/workspace/fixtures/`.
3. `HOME` is `/home/evaluator`, an empty directory created by the image — no global `CLAUDE.md`, no memory.
4. `claude -p "$EVAL_PROMPT" --output-format stream-json --verbose` runs at `/workspace`.
5. The stream-json transcript is captured to `/output/transcript.jsonl`. Any files the skill writes under `/workspace` are rsynced to `/output/artifacts/` after the run completes.
6. The container exits and is removed automatically.
The host then has `<output-dir>/<eval-id>/transcript.jsonl`, `<output-dir>/<eval-id>/artifacts/`, and timing data. Nothing on the host is touched.
### Why Docker is preferred
- The image is reproducible — every run starts from byte-identical state.
- `HOME` is genuinely empty, not just overridden.
- Filesystem isolation is real, not just convention.
- Network can be locked down (`--network=none` for trigger evals; full network for artifact evals that may need it).
## Local fallback
When Docker is unavailable, the runner falls back to per-eval temp directories under `~/bmad-evals/<run-id>/<eval-id>/`. Layout:
```
~/bmad-evals/<run-id>/<eval-id>/
workspace/ # the eval's working directory
.home/ # HOME override — empty .claude/ inside
project/ # rsync'd copy of <project-root>
fixtures/ # staged fixture files
transcript.jsonl # claude -p stream output
artifacts/ # files Claude wrote under workspace/
metrics.json
```
Per-eval invocation roughly:
```
HOME="$WORKSPACE/.home" \
CLAUDE_CONFIG_DIR="$WORKSPACE/.home/.claude" \
ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
claude -p "$EVAL_PROMPT" \
--output-format stream-json --verbose \
> transcript.jsonl
```
### Limitations of local mode
- `HOME` override prevents global `CLAUDE.md` and memory loading, but ancestor discovery still happens from the workspace's cwd. If the workspace is created inside a directory tree that contains a `.claude/skills/` further up, the subprocess may discover those skills regardless of `HOME`. This matters most for trigger evals, where stray host skills can fire instead of the synthetic skill we're testing — **prefer Docker for trigger evals**, where filesystem isolation is real.
- Filesystem isolation is by convention only — the skill could write outside its workspace if it tries. We don't sandbox syscalls.
- Network is unrestricted.
Tell the user clearly when local mode is in use and that it is best-effort.
## Why a real skill, not a slash command, for trigger evals
The trigger runner stages a synthetic skill at `<workspace>/.claude/skills/<unique-name>/SKILL.md` — not at `.claude/commands/<name>.md`. Slash commands are user-invoked (`/<name>`); they do not surface as `Skill` tool calls and so a description placed there can never be observed firing the way a real skill would. Anthropic's reference `run_eval.py` uses the commands path and is known to report 0% trigger rates as a result. Placing the synthetic at `.claude/skills/` matches how real skills load and lets the detector observe genuine `Skill` (or `Read` of the synthetic SKILL.md) tool calls.
## Why not `--add-dir` only?
`claude -p --add-dir <skill>` would let Claude see the skill but would still inherit the user's `CLAUDE.md` and memory from the cwd's ancestors. The whole point of this runner is to test the skill, not the host's accumulated state. So we always either Docker-isolate or temp-dir-isolate.
## Artifact retention
Run folders are never deleted by this skill. Disk management is the user's responsibility. The runner emits the run folder path on completion; users who want to clean up old runs can delete `~/bmad-evals/<run-id>/` directly.

View File

@@ -0,0 +1,115 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.9"
# ///
"""Detect Docker and build the bmad-eval-runner image when needed.
Usage:
python3 docker_setup.py --check # exit 0 if image is ready, 1 otherwise
python3 docker_setup.py --build # build the image (no-op if present)
python3 docker_setup.py --rebuild # force rebuild
"""
from __future__ import annotations
import argparse
import json
import shutil
import subprocess
import sys
from pathlib import Path
IMAGE_TAG = "bmad-eval-runner:latest"
SCRIPT_DIR = Path(__file__).resolve().parent
DOCKERFILE = SCRIPT_DIR.parent / "assets" / "Dockerfile"
def docker_available() -> tuple[bool, str]:
if shutil.which("docker") is None:
return False, "docker CLI not found on PATH"
try:
result = subprocess.run(
["docker", "info"],
capture_output=True,
text=True,
timeout=5,
)
if result.returncode != 0:
return False, f"`docker info` failed: {result.stderr.strip().splitlines()[-1] if result.stderr.strip() else 'unknown'}"
return True, "ok"
except subprocess.TimeoutExpired:
return False, "`docker info` timed out"
except Exception as e:
return False, f"docker check error: {e}"
def image_present(tag: str = IMAGE_TAG) -> bool:
try:
result = subprocess.run(
["docker", "image", "inspect", tag],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
timeout=10,
)
return result.returncode == 0
except Exception:
return False
def build_image(tag: str = IMAGE_TAG, force: bool = False, verbose: bool = True) -> int:
if not DOCKERFILE.is_file():
print(f"Dockerfile missing at {DOCKERFILE}", file=sys.stderr)
return 2
cmd = ["docker", "build", "-t", tag, "-f", str(DOCKERFILE), str(DOCKERFILE.parent)]
if force:
cmd.insert(2, "--no-cache")
if verbose:
print(f"Building {tag} from {DOCKERFILE} ...", file=sys.stderr)
proc = subprocess.run(cmd, stdout=sys.stderr if verbose else subprocess.DEVNULL, stderr=sys.stderr)
return proc.returncode
def main() -> int:
parser = argparse.ArgumentParser(description="Manage the bmad-eval-runner Docker image")
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument("--check", action="store_true", help="Report status as JSON; exit 0 if image is ready")
group.add_argument("--build", action="store_true", help="Build the image (no-op if already present)")
group.add_argument("--rebuild", action="store_true", help="Force rebuild")
parser.add_argument("--quiet", action="store_true")
args = parser.parse_args()
available, reason = docker_available()
present = image_present() if available else False
if args.check:
print(json.dumps({
"docker_available": available,
"docker_reason": reason,
"image_present": present,
"image_tag": IMAGE_TAG,
}, indent=2))
return 0 if (available and present) else 1
if not available:
print(f"Docker is not available: {reason}", file=sys.stderr)
return 3
if args.rebuild:
return build_image(force=True, verbose=not args.quiet)
if args.build:
if present:
if not args.quiet:
print(f"{IMAGE_TAG} already present; skipping build (use --rebuild to force).", file=sys.stderr)
return 0
return build_image(force=False, verbose=not args.quiet)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,184 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.9"
# ///
"""Generate an aggregate HTML report for a run folder.
Reads run.json, execution-summary.json, each <eval-id>/grading.json (if present),
and triggers-result.json (if present), then renders a single-file HTML report.
Usage:
python3 generate_report.py --run-dir PATH [-o report.html]
"""
from __future__ import annotations
import argparse
import html as html_lib
import json
import sys
from pathlib import Path
def esc(s: object) -> str:
return html_lib.escape(str(s), quote=True)
def load(path: Path) -> dict | list | None:
if not path.is_file():
return None
try:
return json.loads(path.read_text(encoding="utf-8"))
except json.JSONDecodeError:
return None
def render(run_dir: Path) -> str:
run_meta = load(run_dir / "run.json") or {}
exec_summary = load(run_dir / "execution-summary.json") or {}
triggers = load(run_dir / "triggers-result.json")
eval_blocks: list[str] = []
grading_total = 0
grading_passed = 0
for res in exec_summary.get("results", []):
eval_id = str(res.get("eval_id", "?"))
eval_dir = run_dir / eval_id
grading = load(eval_dir / "grading.json")
metrics = res.get("metrics") or load(eval_dir / "metrics.json") or {}
rc = res.get("return_code")
rows: list[str] = []
if grading:
for exp in grading.get("expectations", []):
passed = bool(exp.get("passed"))
grading_total += 1
if passed:
grading_passed += 1
rows.append(
f'<tr class="{ "pass" if passed else "fail" }">'
f'<td>{ "" if passed else "" }</td>'
f'<td>{esc(exp.get("text", ""))}</td>'
f'<td>{esc(exp.get("evidence", ""))}</td></tr>'
)
feedback = (grading or {}).get("eval_feedback") or {}
feedback_html = ""
if feedback:
sugg = feedback.get("suggestions") or []
sugg_html = "".join(
f"<li><strong>{esc(s.get('assertion','(general)'))}</strong>: {esc(s.get('reason',''))}</li>"
for s in sugg
)
overall = esc(feedback.get("overall", ""))
feedback_html = (
f'<details class="feedback"><summary>Grader feedback on the evals</summary>'
f'<p>{overall}</p>'
f'{"<ul>" + sugg_html + "</ul>" if sugg_html else ""}'
f'</details>'
)
artifacts_listing = ""
artifacts_dir = eval_dir / "artifacts"
if artifacts_dir.is_dir():
files = sorted(p for p in artifacts_dir.rglob("*") if p.is_file())
if files:
artifacts_listing = "<ul>" + "".join(
f'<li><code>{esc(p.relative_to(eval_dir))}</code> '
f'<span class="muted">({p.stat().st_size}b)</span></li>'
for p in files
) + "</ul>"
tool_calls = metrics.get("tool_calls", {})
tool_summary = ", ".join(f"{k}={v}" for k, v in sorted(tool_calls.items())) or ""
eval_blocks.append(f"""
<section class="eval">
<h3>Eval {esc(eval_id)} <span class="muted">rc={esc(rc)} · {esc(metrics.get('elapsed_s', '?'))}s</span></h3>
<p class="muted">Tool calls: {esc(tool_summary)} · output {esc(metrics.get('output_chars', 0))}b · transcript {esc(metrics.get('transcript_chars', 0))}b</p>
{ '<table><thead><tr><th></th><th>Expectation</th><th>Evidence</th></tr></thead><tbody>' + ''.join(rows) + '</tbody></table>' if rows else '<p class="muted">No grading.json yet.</p>' }
{feedback_html}
<details><summary>Artifacts</summary>{artifacts_listing or '<p class="muted">No artifacts captured.</p>'}</details>
</section>
""")
triggers_html = ""
if triggers:
rows = []
for r in triggers.get("results", []):
rows.append(
f'<tr class="{ "pass" if r["pass"] else "fail" }">'
f'<td>{ "" if r["pass"] else "" }</td>'
f'<td>{esc(r["query"])}</td>'
f'<td>{esc(r["should_trigger"])}</td>'
f'<td>{r["triggers"]}/{r["runs"]} ({r["trigger_rate"]:.2f})</td></tr>'
)
s = triggers.get("summary", {})
triggers_html = f"""
<section class="triggers">
<h2>Trigger Evals — {s.get('passed',0)}/{s.get('total',0)} pass</h2>
<table><thead><tr><th></th><th>Query</th><th>Should fire</th><th>Rate</th></tr></thead>
<tbody>{''.join(rows)}</tbody></table>
</section>
"""
artifact_summary = ""
if exec_summary:
artifact_summary = (
f"<p>Executed {exec_summary.get('executed', 0)} / {exec_summary.get('total', 0)} "
f"evals · {exec_summary.get('exec_failures', 0)} execution failures · "
f"grader: {grading_passed}/{grading_total} expectations passed</p>"
)
return f"""<!doctype html>
<html><head><meta charset="utf-8"><title>Eval Run — {esc(run_meta.get('skill_name','?'))}</title>
<style>
body {{ font: 14px/1.5 system-ui, sans-serif; max-width: 1080px; margin: 2em auto; color: #222; padding: 0 1em; }}
h1, h2, h3 {{ font-weight: 600; }}
h1 {{ font-size: 1.6em; margin-bottom: 0.2em; }}
.meta {{ color: #666; margin-bottom: 1.5em; }}
.muted {{ color: #888; font-weight: normal; }}
section.eval {{ border: 1px solid #ddd; border-radius: 6px; padding: 1em 1.2em; margin: 1em 0; background: #fafafa; }}
table {{ width: 100%; border-collapse: collapse; margin: 0.5em 0; font-size: 13px; }}
th, td {{ text-align: left; padding: 6px 8px; border-bottom: 1px solid #eee; vertical-align: top; }}
tr.pass td:first-child {{ color: #2c8a3a; font-weight: 700; }}
tr.fail td:first-child {{ color: #b3261e; font-weight: 700; }}
tr.fail {{ background: #fdf3f2; }}
details.feedback {{ margin-top: 0.6em; padding: 0.4em 0.7em; background: #fff8e1; border-radius: 4px; }}
details summary {{ cursor: pointer; font-weight: 600; }}
code {{ background: #eee; padding: 1px 4px; border-radius: 3px; }}
</style></head>
<body>
<h1>{esc(run_meta.get('skill_name','?'))} — eval run</h1>
<div class="meta">
Run id: <code>{esc(run_meta.get('run_id','?'))}</code> ·
isolation: {esc(run_meta.get('isolation','?'))} ·
started: {esc(run_meta.get('started_at','?'))}
</div>
{artifact_summary}
{''.join(eval_blocks)}
{triggers_html}
</body></html>
"""
def main() -> int:
parser = argparse.ArgumentParser(description="Generate HTML report for an eval run folder")
parser.add_argument("--run-dir", required=True, type=Path)
parser.add_argument("-o", "--output", type=Path, default=None)
args = parser.parse_args()
run_dir = args.run_dir.resolve()
if not run_dir.is_dir():
print(f"run-dir not found: {run_dir}", file=sys.stderr)
return 2
out = args.output or (run_dir / "report.html")
out.write_text(render(run_dir), encoding="utf-8")
print(str(out))
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,171 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.9"
# ///
"""Run claude interactively via PTY so the Skill tool is available.
In `claude -p` (print mode) the Skill tool is never offered — Claude handles
everything inline. Running `claude` in interactive mode activates the Skill
tool so dependency skills installed in .claude/skills/ can be properly invoked.
The PTY tricks claude into thinking it has a terminal (interactive mode) while
we capture its stream-json output programmatically.
Usage:
python3 pty_runner.py --prompt-file /path/to/prompt.txt \\
--output /path/to/transcript.jsonl \\
[--timeout 600]
python3 pty_runner.py --prompt "Run headless. ..." --output transcript.jsonl
"""
from __future__ import annotations
import argparse
import json
import os
import pty
import re
import select
import subprocess
import sys
import time
from pathlib import Path
ANSI_RE = re.compile(r"\x1b(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])|\r")
# How long to wait for claude to initialize before sending the prompt.
# Claude loads skill registry, checks credentials, etc. on startup.
INIT_WAIT_S = 5.0
# How long to wait after the stream-json 'result' event before killing claude.
# Trailing tool-result output sometimes follows the result event.
POST_RESULT_S = 4.0
def _strip_ansi(text: str) -> str:
return ANSI_RE.sub("", text)
def run_interactive(prompt: str, output: Path, timeout: int = 600) -> None:
"""Spawn claude interactively via PTY, send one prompt, capture transcript."""
master, slave = pty.openpty()
proc = subprocess.Popen(
[
"claude",
"--output-format", "stream-json",
"--verbose",
"--dangerously-skip-permissions",
],
stdin=slave,
stdout=slave,
stderr=slave,
close_fds=True,
)
os.close(slave)
json_lines: list[str] = []
buf = b""
prompt_sent = False
done_at: float | None = None
start = time.time()
try:
while True:
elapsed = time.time() - start
if elapsed > timeout:
print(f"[pty_runner] timeout after {elapsed:.0f}s", file=sys.stderr)
break
if done_at is not None and (time.time() - done_at) > POST_RESULT_S:
break
# Short select so we stay responsive but don't spin.
r, _, _ = select.select([master], [], [], 0.3)
if r:
try:
chunk = os.read(master, 8192)
except OSError:
break # PTY closed — claude exited
buf += chunk
# Process all complete lines in buffer.
while b"\n" in buf:
raw, buf = buf.split(b"\n", 1)
line = _strip_ansi(raw.decode("utf-8", errors="replace")).strip()
if not line.startswith("{"):
continue
json_lines.append(line)
try:
obj = json.loads(line)
# 'result' marks end of a claude turn.
if obj.get("type") == "result" and done_at is None:
done_at = time.time()
print(
f"[pty_runner] result event at t={time.time()-start:.1f}s "
f"({len(json_lines)} lines so far)",
file=sys.stderr,
)
except json.JSONDecodeError:
pass
else:
# Silence window — send prompt once claude has had time to init.
if not prompt_sent and (time.time() - start) >= INIT_WAIT_S:
os.write(master, (prompt + "\n").encode())
prompt_sent = True
print(
f"[pty_runner] prompt sent at t={time.time()-start:.1f}s",
file=sys.stderr,
)
finally:
# Politely ask claude to exit, then hard-kill if needed.
try:
os.write(master, b"exit\n")
time.sleep(0.3)
except OSError:
pass
try:
proc.terminate()
proc.wait(timeout=5)
except Exception:
try:
proc.kill()
except Exception:
pass
try:
os.close(master)
except OSError:
pass
output.parent.mkdir(parents=True, exist_ok=True)
content = "\n".join(json_lines) + ("\n" if json_lines else "")
output.write_text(content, encoding="utf-8")
print(
f"[pty_runner] wrote {len(json_lines)} transcript lines → {output}",
file=sys.stderr,
)
def main() -> int:
p = argparse.ArgumentParser(
description="Run claude interactively via PTY and capture stream-json transcript"
)
grp = p.add_mutually_exclusive_group(required=True)
grp.add_argument("--prompt", help="Prompt text")
grp.add_argument("--prompt-file", type=Path, help="File containing the prompt")
p.add_argument("--output", type=Path, required=True, help="Output .jsonl transcript file")
p.add_argument("--timeout", type=int, default=600, help="Hard timeout in seconds")
args = p.parse_args()
prompt = (
args.prompt_file.read_text(encoding="utf-8").strip()
if args.prompt_file
else args.prompt
)
run_interactive(prompt, args.output, args.timeout)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,492 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.9"
# ///
"""Run a skill's artifact evals in isolated workspaces.
For each eval, the runner:
1. Stages a fresh workspace (Docker container or local tmp dir under ~/bmad-evals).
2. Applies the setup overlay (base then per-eval) so _bmad/ config and dependency
skills land in the workspace BEFORE the skill is staged — the skill's own copy
always wins over overlay content.
3. Copies the skill into .claude/skills/ so it is discoverable by claude.
4. Stages any fixture files declared in the eval's `files` list.
5. Runs `claude -p '<prompt>' --output-format stream-json --verbose`, capturing
the transcript. The Skill tool is available in -p mode and fires for installed
skills, so dependency skills provided by the setup overlay are properly invokable.
6. Rsyncs any files claude wrote into `<run-dir>/<eval-id>/artifacts/`.
7. Writes `metrics.json` (tool-call counts, timing, output sizes).
Grading is performed separately by the parent skill's grader subagents.
Usage:
python3 run_evals.py \\
--skill-path PATH \\
--evals-file PATH/evals.json \\
--project-root PATH \\
--output-dir PATH \\
--isolation docker|local \\
[--workers N] [--timeout SECS] [--eval-ids A1,B3] [--quiet]
"""
from __future__ import annotations
import argparse
import json
import os
import shutil
import subprocess
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(SCRIPT_DIR))
from utils import ( # noqa: E402
apply_setup_overlay,
discover_setup_dirs,
new_run_id,
parse_skill_md,
read_json,
read_macos_keychain_credentials,
stage_credentials,
utc_now_iso,
write_json,
)
DOCKER_IMAGE = "bmad-eval-runner:latest"
_KEYCHAIN_CREDS: str | None = read_macos_keychain_credentials()
RSYNC_EXCLUDES = (
".git", ".bare", "node_modules", ".venv", "__pycache__",
".pytest_cache", ".next", "dist", "build", ".cache",
".DS_Store", "*.pyc",
)
def stage_workspace_local(
workspace: Path,
project_root: Path,
skill_path: Path,
fixtures: list[tuple[Path, str]],
setup_dirs: list[Path] | None = None,
) -> Path:
"""Build a clean local workspace. Returns the project root inside workspace."""
workspace.mkdir(parents=True, exist_ok=True)
project_dest = workspace / "project"
home_dir = workspace / ".home"
(home_dir / ".claude").mkdir(parents=True, exist_ok=True)
excludes: list[str] = []
for pat in RSYNC_EXCLUDES:
excludes.extend(["--exclude", pat])
if shutil.which("rsync"):
subprocess.run(
["rsync", "-a", *excludes, f"{project_root}/", f"{project_dest}/"],
check=True,
)
else:
shutil.copytree(project_root, project_dest, dirs_exist_ok=True,
ignore=shutil.ignore_patterns(*RSYNC_EXCLUDES))
# Apply setup overlay before staging the skill — the skill's own copy wins.
if setup_dirs:
apply_setup_overlay(setup_dirs, project_dest)
skill_link_dir = project_dest / ".claude" / "skills"
skill_link_dir.mkdir(parents=True, exist_ok=True)
skill_dest = skill_link_dir / skill_path.name
if not skill_dest.exists():
try:
os.symlink(skill_path, skill_dest)
except OSError:
shutil.copytree(skill_path, skill_dest, dirs_exist_ok=True)
for src, dest_rel in fixtures:
dest = project_dest / dest_rel
dest.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, dest)
return project_dest
def run_eval_local(
eval_item: dict,
run_dir: Path,
skill_path: Path,
project_root: Path,
timeout: int,
setup_dirs: list[Path] | None = None,
) -> dict:
eval_id = str(eval_item.get("id", "unnamed"))
eval_dir = run_dir / eval_id
workspace_root = eval_dir / "workspace"
artifacts_dir = eval_dir / "artifacts"
transcript_path = eval_dir / "transcript.jsonl"
eval_dir.mkdir(parents=True, exist_ok=True)
artifacts_dir.mkdir(parents=True, exist_ok=True)
fixtures = resolve_fixtures(eval_item.get("files", []), project_root)
workspace_project = stage_workspace_local(
workspace_root, project_root, skill_path, fixtures, setup_dirs
)
(eval_dir / "prompt.txt").write_text(eval_item["prompt"], encoding="utf-8")
workspace_snapshot_before = snapshot_files(workspace_project)
home_dir = workspace_root / ".home"
stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS)
env = {
"HOME": str(home_dir),
"CLAUDE_CONFIG_DIR": str(home_dir / ".claude"),
"PATH": os.environ.get("PATH", ""),
"ANTHROPIC_API_KEY": os.environ.get("ANTHROPIC_API_KEY", ""),
}
cmd = [
"claude",
"-p", eval_item["prompt"],
"--output-format", "stream-json",
"--verbose",
"--dangerously-skip-permissions",
]
start = time.time()
try:
with transcript_path.open("wb") as out:
proc = subprocess.run(
cmd,
stdout=out,
stderr=subprocess.PIPE,
cwd=str(workspace_project),
env=env,
timeout=timeout,
)
elapsed = time.time() - start
return_code = proc.returncode
stderr_tail = (proc.stderr or b"").decode("utf-8", errors="replace")[-2000:]
except subprocess.TimeoutExpired as e:
elapsed = time.time() - start
return_code = -1
stderr_tail = f"TIMEOUT after {timeout}s"
if e.stderr:
stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:]
new_files = diff_workspace(workspace_project, workspace_snapshot_before)
sync_artifacts(workspace_project, new_files, artifacts_dir)
metrics = compute_metrics(transcript_path, artifacts_dir, elapsed, return_code, stderr_tail)
write_json(eval_dir / "metrics.json", metrics)
return {
"eval_id": eval_id,
"elapsed_s": elapsed,
"return_code": return_code,
"transcript": str(transcript_path.relative_to(run_dir)),
"artifacts_dir": str(artifacts_dir.relative_to(run_dir)),
"metrics": metrics,
}
def run_eval_docker(
eval_item: dict,
run_dir: Path,
skill_path: Path,
project_root: Path,
timeout: int,
setup_dirs: list[Path] | None = None,
) -> dict:
eval_id = str(eval_item.get("id", "unnamed"))
eval_dir = run_dir / eval_id
artifacts_dir = eval_dir / "artifacts"
transcript_path = eval_dir / "transcript.jsonl"
eval_dir.mkdir(parents=True, exist_ok=True)
artifacts_dir.mkdir(parents=True, exist_ok=True)
fixtures_staging = eval_dir / "fixtures_in"
fixtures_staging.mkdir(parents=True, exist_ok=True)
fixtures = resolve_fixtures(eval_item.get("files", []), project_root)
for src, dest_rel in fixtures:
dest = fixtures_staging / dest_rel
dest.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, dest)
(eval_dir / "prompt.txt").write_text(eval_item["prompt"], encoding="utf-8")
# Pre-merge setup overlay dirs on the host; mount as /setup:ro in the container.
setup_merged: Path | None = None
if setup_dirs:
setup_merged = eval_dir / "setup_merged"
apply_setup_overlay(setup_dirs, setup_merged)
if not any(setup_merged.iterdir()):
setup_merged = None
creds_dir: Path | None = None
if _KEYCHAIN_CREDS:
creds_dir = eval_dir / "creds"
creds_dir.mkdir(parents=True, exist_ok=True)
(creds_dir / ".credentials.json").write_text(_KEYCHAIN_CREDS, encoding="utf-8")
container_script = r"""
set -e
mkdir -p /workspace
rsync -a \
--exclude=.git --exclude=.bare --exclude=node_modules --exclude=.venv \
--exclude=__pycache__ --exclude=.pytest_cache --exclude=.next \
--exclude=dist --exclude=build --exclude=.cache --exclude=.DS_Store \
/project/ /workspace/
if [ -d /setup ]; then
rsync -a /setup/ /workspace/
fi
mkdir -p /workspace/.claude/skills
cp -R "$SKILL_SRC" "/workspace/.claude/skills/$SKILL_NAME"
if [ -d /fixtures ]; then
cp -R /fixtures/. /workspace/
fi
if [ -f /creds/.credentials.json ]; then
mkdir -p /home/evaluator/.claude
cp /creds/.credentials.json /home/evaluator/.claude/.credentials.json
fi
cd /workspace
claude -p "$EVAL_PROMPT" \
--output-format stream-json --verbose \
--dangerously-skip-permissions \
> /output/transcript.jsonl 2> /output/stderr.log || true
mkdir -p /output/artifacts
rsync -a --exclude=.claude --exclude=node_modules --exclude=.git \
--filter='+ */' --filter='+ *' \
/workspace/ /output/artifacts/
"""
skill_name = skill_path.name
cmd = [
"docker", "run", "--rm",
"-v", f"{project_root}:/project:ro",
"-v", f"{skill_path}:/skill_src:ro",
"-v", f"{eval_dir}:/output",
"-e", "ANTHROPIC_API_KEY",
"-e", f"EVAL_PROMPT={eval_item['prompt']}",
"-e", f"SKILL_SRC=/skill_src",
"-e", f"SKILL_NAME={skill_name}",
]
if creds_dir:
cmd += ["-v", f"{creds_dir}:/creds:ro"]
if fixtures:
cmd += ["-v", f"{fixtures_staging}:/fixtures:ro"]
if setup_merged:
cmd += ["-v", f"{setup_merged}:/setup:ro"]
cmd += [DOCKER_IMAGE, "bash", "-c", container_script]
start = time.time()
try:
proc = subprocess.run(
cmd,
capture_output=True,
timeout=timeout + 30,
)
elapsed = time.time() - start
return_code = proc.returncode
stderr_tail = proc.stderr.decode("utf-8", errors="replace")[-2000:]
if proc.stdout:
(eval_dir / "docker.stdout.log").write_bytes(proc.stdout)
except subprocess.TimeoutExpired as e:
elapsed = time.time() - start
return_code = -1
stderr_tail = f"TIMEOUT after {timeout}s"
if e.stderr:
stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:]
metrics = compute_metrics(transcript_path, artifacts_dir, elapsed, return_code, stderr_tail)
write_json(eval_dir / "metrics.json", metrics)
shutil.rmtree(fixtures_staging, ignore_errors=True)
return {
"eval_id": eval_id,
"elapsed_s": elapsed,
"return_code": return_code,
"transcript": str(transcript_path.relative_to(run_dir)),
"artifacts_dir": str(artifacts_dir.relative_to(run_dir)),
"metrics": metrics,
}
def resolve_fixtures(files: list[str], project_root: Path) -> list[tuple[Path, str]]:
out: list[tuple[Path, str]] = []
for entry in files:
candidate = (project_root / entry).resolve()
if not candidate.is_file():
alt = Path(entry).resolve()
if alt.is_file():
candidate = alt
else:
print(f"Warning: fixture not found: {entry}", file=sys.stderr)
continue
out.append((candidate, entry))
return out
def snapshot_files(root: Path) -> set[str]:
snap: set[str] = set()
for p in root.rglob("*"):
if p.is_file():
snap.add(str(p.relative_to(root)))
return snap
def diff_workspace(root: Path, before: set[str]) -> list[str]:
after = snapshot_files(root)
return sorted(after - before)
def sync_artifacts(workspace: Path, new_files: list[str], dest: Path) -> None:
for rel in new_files:
src = workspace / rel
if not src.is_file():
continue
if any(part in (".claude", "node_modules", ".git", ".venv") for part in src.parts):
continue
target = dest / rel
target.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, target)
def compute_metrics(transcript: Path, artifacts: Path, elapsed: float,
rc: int, stderr_tail: str) -> dict:
tool_calls: dict[str, int] = {}
total_steps = 0
if transcript.is_file():
for raw in transcript.read_text(encoding="utf-8", errors="replace").splitlines():
raw = raw.strip()
if not raw:
continue
try:
evt = json.loads(raw)
except json.JSONDecodeError:
continue
if evt.get("type") == "assistant":
total_steps += 1
for item in evt.get("message", {}).get("content", []):
if item.get("type") == "tool_use":
name = item.get("name", "?")
tool_calls[name] = tool_calls.get(name, 0) + 1
output_chars = 0
for f in artifacts.rglob("*"):
if f.is_file():
try:
output_chars += f.stat().st_size
except OSError:
pass
return {
"elapsed_s": round(elapsed, 2),
"return_code": rc,
"tool_calls": tool_calls,
"total_tool_calls": sum(tool_calls.values()),
"total_steps": total_steps,
"output_chars": output_chars,
"transcript_chars": transcript.stat().st_size if transcript.is_file() else 0,
"stderr_tail": stderr_tail,
}
def main() -> int:
parser = argparse.ArgumentParser(description="Run a skill's artifact evals in isolation")
parser.add_argument("--skill-path", required=True, type=Path)
parser.add_argument("--evals-file", required=True, type=Path)
parser.add_argument("--project-root", required=True, type=Path)
parser.add_argument("--output-dir", required=True, type=Path)
parser.add_argument("--isolation", choices=("docker", "local"), required=True)
parser.add_argument("--workers", type=int, default=8)
parser.add_argument("--timeout", type=int, default=600)
parser.add_argument("--eval-ids", default=None, help="Comma-separated subset of eval ids to run")
parser.add_argument("--quiet", action="store_true")
args = parser.parse_args()
skill_path = args.skill_path.resolve()
project_root = args.project_root.resolve()
evals_file = args.evals_file.resolve()
if not evals_file.is_file():
print(f"evals file not found: {evals_file}", file=sys.stderr)
return 2
skill_name, _, _ = parse_skill_md(skill_path)
data = read_json(evals_file)
evals = data["evals"] if isinstance(data, dict) and "evals" in data else data
if args.eval_ids:
wanted = {x.strip() for x in args.eval_ids.split(",") if x.strip()}
evals = [e for e in evals if str(e.get("id")) in wanted]
run_id = new_run_id(skill_name)
run_dir = (args.output_dir / run_id).resolve()
run_dir.mkdir(parents=True, exist_ok=True)
write_json(run_dir / "run.json", {
"run_id": run_id,
"skill_name": skill_name,
"skill_path": str(skill_path),
"project_root": str(project_root),
"evals_file": str(evals_file),
"isolation": args.isolation,
"started_at": utc_now_iso(),
"eval_count": len(evals),
})
runner = run_eval_docker if args.isolation == "docker" else run_eval_local
results: list[dict] = []
if not args.quiet:
print(
f"[run_evals] {len(evals)} evals, isolation={args.isolation}, run_dir={run_dir}",
file=sys.stderr,
)
with ThreadPoolExecutor(max_workers=args.workers) as pool:
future_to_eval = {
pool.submit(
runner,
item,
run_dir,
skill_path,
project_root,
int(item.get("timeout", args.timeout)),
discover_setup_dirs(evals_file, str(item.get("id", ""))),
): item
for item in evals
}
for fut in as_completed(future_to_eval):
item = future_to_eval[fut]
try:
res = fut.result()
except Exception as e:
res = {"eval_id": str(item.get("id")), "error": str(e), "return_code": -1}
results.append(res)
if not args.quiet:
rc = res.get("return_code")
status = "ok" if rc == 0 else f"rc={rc}"
print(
f" [{status}] eval {res.get('eval_id')} ({res.get('elapsed_s', 0):.1f}s)",
file=sys.stderr,
)
summary = {
"run_id": run_id,
"completed_at": utc_now_iso(),
"total": len(evals),
"executed": len(results),
"exec_failures": sum(1 for r in results if r.get("return_code") != 0),
"run_dir": str(run_dir),
"results": results,
}
write_json(run_dir / "execution-summary.json", summary)
print(json.dumps(summary, indent=2))
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,366 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.9"
# ///
"""Run trigger evals: does the skill's description fire on each query?
Adapted from Anthropic skill-creator's run_eval.py
(https://github.com/anthropics/skills/tree/main/skills/skill-creator) with two
adaptations:
1. Isolation. Each query runs in either a fresh Docker container off
bmad-eval-runner:latest, or a fresh local tmp dir under ~/bmad-evals/<run-id>/
with HOME overridden to a clean directory. This prevents the host's global
CLAUDE.md and auto-memory from biasing whether the skill fires.
2. Output. Results are written to a run folder alongside the artifact eval
run-folder layout (so triggers and artifacts can share a single report).
Usage:
python3 run_triggers.py \\
--skill-path PATH \\
--triggers-file PATH/triggers.json \\
--output-dir PATH \\
--isolation docker|local \\
[--workers N] [--runs-per-query N] [--timeout SECS] [--threshold 0.5]
"""
from __future__ import annotations
import argparse
import json
import os
import shutil
import subprocess
import sys
import time
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(SCRIPT_DIR))
from utils import ( # noqa: E402
new_run_id,
parse_skill_md,
read_json,
read_macos_keychain_credentials,
stage_credentials,
utc_now_iso,
write_json,
)
DOCKER_IMAGE = "bmad-eval-runner:latest"
_KEYCHAIN_CREDS: str | None = read_macos_keychain_credentials()
def write_synthetic_skill(skills_dir: Path, skill_name: str, description: str, unique_id: str) -> tuple[Path, str]:
"""Place a synthetic skill at <skills_dir>/<clean_name>/SKILL.md.
The Skill tool only fires for entries discovered as actual skills (frontmatter
`name` + `description` under a `.claude/skills/<name>/SKILL.md`). Slash-commands
under `.claude/commands/` do not auto-invoke the Skill tool, so the previous
implementation could never observe a positive trigger. This places the synthetic
skill where Claude Code looks for skills, with a unique name so the detector
can disambiguate it from any pre-existing skill of the same display name.
"""
clean_name = f"{skill_name}-skill-{unique_id}"
skill_root = skills_dir / clean_name
skill_root.mkdir(parents=True, exist_ok=True)
path = skill_root / "SKILL.md"
indented_desc = "\n ".join(description.split("\n"))
path.write_text(
f"---\n"
f"name: {clean_name}\n"
f"description: |\n"
f" {indented_desc}\n"
f"---\n\n"
f"# {skill_name}\n\n"
f"This skill handles: {description}\n",
encoding="utf-8",
)
return path, clean_name
def parse_stream_for_trigger(buffer: str, clean_name: str) -> tuple[bool | None, str]:
"""Return (triggered_or_none, leftover_buffer). None means undecided yet."""
triggered: bool | None = None
pending_tool: str | None = None
accumulated_json = ""
leftover = ""
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
line = line.strip()
if not line:
continue
try:
evt = json.loads(line)
except json.JSONDecodeError:
continue
if evt.get("type") == "stream_event":
se = evt.get("event", {})
t = se.get("type", "")
if t == "content_block_start":
cb = se.get("content_block", {})
if cb.get("type") == "tool_use":
name = cb.get("name", "")
if name in ("Skill", "Read"):
pending_tool = name
accumulated_json = ""
else:
return False, ""
elif t == "content_block_delta" and pending_tool:
delta = se.get("delta", {})
if delta.get("type") == "input_json_delta":
accumulated_json += delta.get("partial_json", "")
if clean_name in accumulated_json:
return True, ""
elif t in ("content_block_stop", "message_stop"):
if pending_tool:
return clean_name in accumulated_json, ""
if t == "message_stop":
return False, ""
elif evt.get("type") == "assistant":
for item in evt.get("message", {}).get("content", []):
if item.get("type") != "tool_use":
continue
tname = item.get("name", "")
tinput = item.get("input", {})
if tname == "Skill" and clean_name in tinput.get("skill", ""):
return True, ""
if tname == "Read" and clean_name in tinput.get("file_path", ""):
return True, ""
return False, ""
elif evt.get("type") == "result":
return triggered if triggered is not None else False, ""
leftover = buffer
return triggered, leftover
def run_query_local(query: str, skill_name: str, description: str,
workspace_root: Path, timeout: int) -> bool:
workspace_root.mkdir(parents=True, exist_ok=True)
home_dir = workspace_root / ".home"
(home_dir / ".claude").mkdir(parents=True, exist_ok=True)
stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS)
project_dir = workspace_root / "project"
skills_dir = project_dir / ".claude" / "skills"
project_dir.mkdir(parents=True, exist_ok=True)
unique = uuid.uuid4().hex[:8]
cmd_file, clean_name = write_synthetic_skill(skills_dir, skill_name, description, unique)
env = {
"HOME": str(home_dir),
"CLAUDE_CONFIG_DIR": str(home_dir / ".claude"),
"PATH": os.environ.get("PATH", ""),
"ANTHROPIC_API_KEY": os.environ.get("ANTHROPIC_API_KEY", ""),
}
cmd = [
"claude", "-p", query,
"--output-format", "stream-json",
"--verbose",
"--include-partial-messages",
"--dangerously-skip-permissions",
]
try:
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
cwd=str(project_dir),
env=env,
)
buffer = ""
triggered: bool | None = None
start = time.time()
try:
while time.time() - start < timeout:
if proc.poll() is not None:
rest = proc.stdout.read()
if rest:
buffer += rest.decode("utf-8", errors="replace")
break
chunk = proc.stdout.read1(8192) if hasattr(proc.stdout, "read1") else proc.stdout.read(8192)
if not chunk:
time.sleep(0.05)
continue
buffer += chunk.decode("utf-8", errors="replace")
decided, buffer = parse_stream_for_trigger(buffer, clean_name)
if decided is not None:
triggered = decided
break
finally:
if proc.poll() is None:
proc.kill()
proc.wait()
if triggered is None:
decided, _ = parse_stream_for_trigger(buffer + "\n", clean_name)
triggered = bool(decided)
return bool(triggered)
finally:
try:
shutil.rmtree(cmd_file.parent, ignore_errors=True)
except OSError:
pass
def run_query_docker(query: str, skill_name: str, description: str,
workspace_root: Path, timeout: int) -> bool:
workspace_root.mkdir(parents=True, exist_ok=True)
unique = uuid.uuid4().hex[:8]
skills_in = workspace_root / "skills_in"
skills_in.mkdir(parents=True, exist_ok=True)
_, clean_name = write_synthetic_skill(skills_in, skill_name, description, unique)
creds_dir: Path | None = None
if _KEYCHAIN_CREDS:
creds_dir = workspace_root / "creds_in"
creds_dir.mkdir(parents=True, exist_ok=True)
(creds_dir / ".credentials.json").write_text(_KEYCHAIN_CREDS, encoding="utf-8")
container_script = f"""
set -e
mkdir -p /workspace/.claude/skills
cp -R /skills/. /workspace/.claude/skills/ 2>/dev/null || true
if [ -f /creds/.credentials.json ]; then
mkdir -p /home/evaluator/.claude
cp /creds/.credentials.json /home/evaluator/.claude/.credentials.json
fi
cd /workspace
claude -p "$EVAL_QUERY" \\
--output-format stream-json --verbose --include-partial-messages \\
--dangerously-skip-permissions \\
> /output/stream.jsonl 2>/dev/null || true
"""
output_dir = workspace_root / "output"
output_dir.mkdir(parents=True, exist_ok=True)
cmd = [
"docker", "run", "--rm",
"-v", f"{skills_in}:/skills:ro",
"-v", f"{output_dir}:/output",
"-e", "ANTHROPIC_API_KEY",
"-e", f"EVAL_QUERY={query}",
]
if creds_dir:
cmd += ["-v", f"{creds_dir}:/creds:ro"]
cmd += [DOCKER_IMAGE, "bash", "-c", container_script]
try:
subprocess.run(cmd, capture_output=True, timeout=timeout + 30)
except subprocess.TimeoutExpired:
pass
stream_file = output_dir / "stream.jsonl"
if not stream_file.is_file():
return False
decided, _ = parse_stream_for_trigger(stream_file.read_text(encoding="utf-8", errors="replace") + "\n", clean_name)
return bool(decided)
def main() -> int:
parser = argparse.ArgumentParser(description="Run trigger evals in isolation")
parser.add_argument("--skill-path", required=True, type=Path)
parser.add_argument("--triggers-file", required=True, type=Path)
parser.add_argument("--output-dir", required=True, type=Path)
parser.add_argument("--isolation", choices=("docker", "local"), required=True)
parser.add_argument("--workers", type=int, default=8)
parser.add_argument("--runs-per-query", type=int, default=3)
parser.add_argument("--timeout", type=int, default=45)
parser.add_argument("--threshold", type=float, default=0.5)
parser.add_argument("--quiet", action="store_true")
args = parser.parse_args()
skill_path = args.skill_path.resolve()
triggers_file = args.triggers_file.resolve()
if not triggers_file.is_file():
print(f"triggers file not found: {triggers_file}", file=sys.stderr)
return 2
skill_name, description, _ = parse_skill_md(skill_path)
queries = read_json(triggers_file)
run_id = new_run_id(f"{skill_name}-triggers")
run_dir = (args.output_dir / run_id).resolve()
(run_dir / "queries").mkdir(parents=True, exist_ok=True)
write_json(run_dir / "run.json", {
"run_id": run_id,
"skill_name": skill_name,
"description": description,
"isolation": args.isolation,
"started_at": utc_now_iso(),
"query_count": len(queries),
"runs_per_query": args.runs_per_query,
"threshold": args.threshold,
})
runner = run_query_docker if args.isolation == "docker" else run_query_local
def run_one(idx: int, q: dict, run_idx: int) -> tuple[int, bool]:
ws = run_dir / "queries" / f"q{idx:03d}-r{run_idx}"
triggered = runner(q["query"], skill_name, description, ws, args.timeout)
return idx, triggered
per_query: dict[int, list[bool]] = {}
if not args.quiet:
print(f"[run_triggers] {len(queries)} queries × {args.runs_per_query} runs, isolation={args.isolation}", file=sys.stderr)
with ThreadPoolExecutor(max_workers=args.workers) as pool:
futures = []
for idx, q in enumerate(queries):
for run_idx in range(args.runs_per_query):
futures.append(pool.submit(run_one, idx, q, run_idx))
for fut in as_completed(futures):
try:
idx, triggered = fut.result()
except Exception as e:
print(f"Warning: query failed: {e}", file=sys.stderr)
continue
per_query.setdefault(idx, []).append(triggered)
results = []
for idx, q in enumerate(queries):
triggers = per_query.get(idx, [])
rate = (sum(triggers) / len(triggers)) if triggers else 0.0
should = bool(q["should_trigger"])
if should:
passed = rate >= args.threshold
else:
passed = rate < args.threshold
results.append({
"query": q["query"],
"should_trigger": should,
"trigger_rate": rate,
"triggers": int(sum(triggers)),
"runs": len(triggers),
"pass": passed,
})
output = {
"run_id": run_id,
"completed_at": utc_now_iso(),
"skill_name": skill_name,
"description": description,
"isolation": args.isolation,
"results": results,
"summary": {
"total": len(results),
"passed": sum(1 for r in results if r["pass"]),
"failed": sum(1 for r in results if not r["pass"]),
},
}
write_json(run_dir / "triggers-result.json", output)
print(json.dumps(output, indent=2))
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,260 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.9"
# ///
"""Shared helpers for the eval runner."""
from __future__ import annotations
import json
import re
import shutil
import subprocess
import sys
from datetime import datetime, timezone
from pathlib import Path
def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
"""Return (name, description, body) from the skill's SKILL.md frontmatter."""
text = (skill_path / "SKILL.md").read_text(encoding="utf-8")
fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n(.*)$", text, re.DOTALL)
if not fm_match:
raise ValueError(f"SKILL.md at {skill_path} is missing frontmatter")
frontmatter, body = fm_match.group(1), fm_match.group(2)
name = None
description_lines: list[str] = []
in_description = False
for line in frontmatter.splitlines():
if line.startswith("name:"):
name = line.split(":", 1)[1].strip()
in_description = False
elif line.startswith("description:"):
value = line.split(":", 1)[1].strip()
if value in ("|", ">"):
in_description = True
else:
description_lines = [value]
in_description = False
elif in_description and line.startswith((" ", "\t")):
description_lines.append(line.strip())
elif in_description:
in_description = False
if not name:
raise ValueError(f"SKILL.md at {skill_path} is missing a name")
return name, " ".join(description_lines).strip(), body
def discover_project_root(skill_path: Path) -> Path:
"""Walk up from the skill looking for _bmad/ or .git; default to skill's grandparent."""
for parent in [skill_path, *skill_path.parents]:
if (parent / "_bmad").is_dir() or (parent / ".git").exists():
return parent
return skill_path.parent.parent
def discover_evals(
skill_path: Path,
project_root: Path,
explicit: Path | None,
) -> dict[str, Path]:
"""Locate evals.json and triggers.json. Return dict with keys 'evals' and/or 'triggers'."""
found: dict[str, Path] = {}
def check_dir(d: Path) -> None:
if not d.is_dir():
return
for key, fname in (("evals", "evals.json"), ("triggers", "triggers.json")):
candidate = d / fname
if candidate.is_file() and key not in found:
found[key] = candidate
if explicit is not None:
explicit = explicit.resolve()
if explicit.is_file():
if explicit.name == "evals.json":
found["evals"] = explicit
elif explicit.name == "triggers.json":
found["triggers"] = explicit
elif explicit.is_dir():
check_dir(explicit)
return found
skill_name = skill_path.name
candidates: list[Path] = [
skill_path / "evals",
skill_path.parent.parent / "evals" / skill_name,
project_root / "evals" / skill_name,
]
for d in candidates:
check_dir(d)
if found:
break
if not found:
evals_root = project_root / "evals"
if evals_root.is_dir():
for sub in evals_root.rglob(skill_name):
if sub.is_dir():
check_dir(sub)
if found:
break
return found
def utc_now_iso() -> str:
return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
def new_run_id(skill_name: str) -> str:
return f"{datetime.now().strftime('%Y%m%d-%H%M%S')}-{skill_name}"
def have_docker() -> bool:
if shutil.which("docker") is None:
return False
try:
result = subprocess.run(
["docker", "info"],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
timeout=5,
)
return result.returncode == 0
except Exception:
return False
def docker_image_present(image: str = "bmad-eval-runner:latest") -> bool:
if not have_docker():
return False
try:
result = subprocess.run(
["docker", "image", "inspect", image],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
timeout=10,
)
return result.returncode == 0
except Exception:
return False
def read_macos_keychain_credentials() -> str | None:
"""Read the Claude Code OAuth credentials JSON from the macOS Keychain.
Returns the raw JSON string stored under service "Claude Code-credentials",
or None if unavailable (non-macOS, entry missing, or access denied).
Called in the parent process — which owns the Keychain ACL — so the credential
can be staged into each isolated workspace's `.claude/.credentials.json` before
`claude -p` is launched. Without this, an isolated subprocess with HOME pointed
at an empty dir has no auth and every eval fails with "Not logged in."
"""
if sys.platform != "darwin":
return None
try:
result = subprocess.run(
["security", "find-generic-password", "-s", "Claude Code-credentials", "-w"],
capture_output=True,
timeout=5,
)
if result.returncode != 0:
return None
val = result.stdout.decode("utf-8", errors="replace").strip()
return val if val else None
except Exception:
return None
def stage_credentials(claude_dir: Path, credentials_json: str | None) -> None:
"""Write credentials_json to <claude_dir>/.credentials.json. No-op if None."""
if not credentials_json:
return
claude_dir.mkdir(parents=True, exist_ok=True)
(claude_dir / ".credentials.json").write_text(credentials_json, encoding="utf-8")
def write_json(path: Path, data: object) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
def read_json(path: Path) -> object:
return json.loads(path.read_text(encoding="utf-8"))
def parse_skill_dependencies(skill_path: Path) -> list[str]:
"""Return skill names declared under 'dependencies:' in SKILL.md frontmatter."""
try:
text = (skill_path / "SKILL.md").read_text(encoding="utf-8")
except (FileNotFoundError, OSError):
return []
fm = re.match(r"^---\s*\n(.*?)\n---", text, re.DOTALL)
if not fm:
return []
deps: list[str] = []
in_deps = False
for line in fm.group(1).splitlines():
if re.match(r"^dependencies\s*:", line):
in_deps = True
elif in_deps:
m = re.match(r"^\s+-\s+(\S+)", line)
if m:
deps.append(m.group(1))
elif not line.startswith((" ", "\t")):
break
return deps
def discover_setup_dirs(evals_file: Path, eval_id: str | None = None) -> list[Path]:
"""Return ordered list of setup overlay dirs that exist.
base: <evals_dir>/setup/
per-eval: <evals_dir>/<eval_id>/setup/
Applied base-first so per-eval overlays win on conflict.
"""
evals_dir = evals_file.parent
dirs: list[Path] = []
base = evals_dir / "setup"
if base.is_dir():
dirs.append(base)
if eval_id:
per_eval = evals_dir / eval_id / "setup"
if per_eval.is_dir():
dirs.append(per_eval)
return dirs
def apply_setup_overlay(setup_dirs: list[Path], dest: Path) -> None:
"""Rsync each setup dir onto dest in order (base first, per-eval last)."""
dest.mkdir(parents=True, exist_ok=True)
for src in setup_dirs:
if not src.is_dir():
continue
subprocess.run(
["rsync", "-a", f"{src}/", f"{dest}/"],
check=False,
)
__all__ = [
"parse_skill_md",
"discover_project_root",
"discover_evals",
"utc_now_iso",
"new_run_id",
"have_docker",
"docker_image_present",
"read_macos_keychain_credentials",
"stage_credentials",
"write_json",
"read_json",
"parse_skill_dependencies",
"discover_setup_dirs",
"apply_setup_overlay",
]