chore: initial monorepo scaffold + WDS Phase 1+2 artifacts
- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24) - apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004) - apps/web: React 19 + Vite 8 (ESM) - libs/shared/api-interface: Zod contract base - Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit - WDS artifacts: - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs) - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact) - Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md - AGENTS.md + README.md como entrada para devs/agentes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
91
.claude/skills/bmad-eval-runner/SKILL.md
Normal file
91
.claude/skills/bmad-eval-runner/SKILL.md
Normal file
@@ -0,0 +1,91 @@
|
||||
---
|
||||
name: bmad-eval-runner
|
||||
description: Run a skill's evals in a clean, isolated environment and report results. Use when the user wants to evaluate a skill, run evals, benchmark a skill, validate triggers, or grade skill outputs.
|
||||
---
|
||||
|
||||
# Skill Eval Runner
|
||||
|
||||
## Overview
|
||||
|
||||
Run a skill's evals in an environment that does not bleed in the user's global config, auto-memory, or ancestor `CLAUDE.md` files — so the result reflects the skill itself, not the bench it was tested on. Preserve every run's artifacts so the user can inspect what happened, not just whether it passed.
|
||||
|
||||
Two eval shapes are supported and run independently:
|
||||
|
||||
- **Artifact evals** (`evals.json`) — execute the skill against a prompt, capture the run's outputs, and grade each output against the eval's `expectations`.
|
||||
- **Trigger evals** (`triggers.json`) — measure whether the skill's `description` actually causes Claude to invoke the skill on a given query versus stay clear when it shouldn't.
|
||||
|
||||
You are an experienced eval engineer. The user wants signal, not theatre. Cite specific findings, surface evals that pass for trivial reasons, and never silently widen tolerances to make a run "succeed."
|
||||
|
||||
## Args
|
||||
|
||||
- Positional: a path to the skill being evaluated (directory containing `SKILL.md`).
|
||||
- `--evals <path>` — explicit path to evals folder or a specific `evals.json` / `triggers.json` file. If omitted, discover.
|
||||
- `--mode artifact|trigger|both` — which eval kind to run. Default: `both` if both files are found, else whichever exists.
|
||||
- `--isolation docker|local|auto` — sandbox strategy. Default: `auto` (Docker when available, otherwise local).
|
||||
- `--project-root <path>` — root of the project the skill belongs to. Default: walk up from skill path looking for `_bmad/` or `.git/`.
|
||||
- `--output-dir <path>` — where run folders are written. Default: `{bmad_builder_reports}/eval-runs/` if configured, else `~/bmad-evals/`.
|
||||
- `--workers <n>` — parallel evals. Default: 4.
|
||||
- `--headless` / `-H` — non-interactive; emit final JSON only.
|
||||
|
||||
## On Activation
|
||||
|
||||
1. Resolve config the same way `bmad-workflow-builder` does (`{project-root}/_bmad/config.yaml` then `config.user.yaml`, falling back to `bmb/config.yaml`). Resolve `{user_name}`, `{communication_language}`, `{bmad_builder_reports}`. Apply throughout the session.
|
||||
|
||||
2. If `--headless` was passed, set `{headless_mode}=true` and skip every confirmation below; pick the safest defaults and proceed.
|
||||
|
||||
3. Locate the skill. Verify `<skill-path>/SKILL.md` exists; halt with a clear error if it doesn't.
|
||||
|
||||
4. Discover evals — see `## Eval Discovery` below.
|
||||
|
||||
5. Choose isolation — see `## Isolation` below. On the first Docker run on this machine, the image will need to be built; surface that, ask once unless headless, then cache.
|
||||
|
||||
6. Confirm the run summary with the user (skill, evals found, mode, isolation, output dir) unless headless. Then execute.
|
||||
|
||||
## Eval Discovery
|
||||
|
||||
Look in this order, taking the first match:
|
||||
|
||||
1. `--evals` argument if provided. May point to a folder (containing `evals.json` and/or `triggers.json`) or a specific JSON file.
|
||||
2. `<skill-path>/evals/` — colocated with the skill.
|
||||
3. `<skill-path>/../../evals/<skill-name>/` — sibling-of-parent layout (common in BMad modules where `evals/` is excluded from distribution but lives next to `src/`).
|
||||
4. `<project-root>/evals/<skill-name>/` — top-level evals tree.
|
||||
5. `<project-root>/evals/**/<skill-name>/` — anywhere under project evals.
|
||||
|
||||
Surface what you found and where. If no evals are discovered, halt with a clear message — do not attempt to fabricate evals.
|
||||
|
||||
## Isolation
|
||||
|
||||
Run each eval in a fresh workspace so memory, project CLAUDE.md, prior runs, and host shell config cannot bias the result. Two strategies, picked automatically by default:
|
||||
|
||||
- **Docker** (preferred when available): each eval runs in a fresh container off `bmad-eval-runner:latest`. The host's `ANTHROPIC_API_KEY` is the only env passed in. The skill's project is bind-mounted read-only and copied into a writable scratch dir inside the container; `HOME` is a fresh in-container directory; there is no auto-memory and no host CLAUDE.md.
|
||||
|
||||
- **Local fallback** (when Docker is unavailable or the user opts out): each eval runs in a fresh `~/bmad-evals/<run-id>/<eval-id>/workspace/` directory with `HOME=<workspace>/.home` overridden so global memory and global CLAUDE.md do not leak. The project is copied (or hardlinked where supported) into the workspace. Tell the user this is the active mode and acknowledge that local isolation is best-effort, not hermetic.
|
||||
|
||||
The first time Docker is selected on this machine, build the image — `python3 {skill-root}/scripts/docker_setup.py --build` — and tell the user this is happening once.
|
||||
|
||||
Details and the exact mount layout live in `references/isolation.md`. Read that file when you need to debug an isolation issue or explain to the user what is being isolated.
|
||||
|
||||
## Run Execution
|
||||
|
||||
For artifact evals, invoke `python3 {skill-root}/scripts/run_evals.py` with the resolved arguments. The script handles isolation per eval, runs `claude -p` in the sandbox with the eval's prompt and any staged fixture files, and writes a per-eval folder with `prompt.txt`, `transcript.jsonl`, `artifacts/`, and `metrics.json`.
|
||||
|
||||
For trigger evals, invoke `python3 {skill-root}/scripts/run_triggers.py`. The script measures whether the skill's description causes the skill to fire for each query, with `runs-per-query` repeats for stability, and writes `triggers-result.json`. Trigger evals should run under Docker isolation when available — local mode can have the host's installed skills bleed in via cwd-based skill discovery, biasing the trigger signal. If Docker is unavailable, run trigger evals locally but say so explicitly.
|
||||
|
||||
After artifact runs complete, grade each eval. Spawn a grader subagent per eval in parallel (Agent tool, prompt loaded from `{skill-root}/agents/grader.md` plus the eval's `expectations` and the path to its outputs). Each grader writes `grading.json` next to the artifacts. The grader has license to flag weak assertions — relay that feedback to the user.
|
||||
|
||||
After all grading is done, generate the aggregate report — `python3 {skill-root}/scripts/generate_report.py --run-dir <run-id>` — which produces `report.html`. Tell the user where the run folder is and where the HTML report is.
|
||||
|
||||
## Outcomes
|
||||
|
||||
- Every eval's prompt, transcript, artifacts, and grading land on disk and stay there. Nothing is silently cleaned up.
|
||||
- The run honestly reflects the skill's behavior in a clean room — not the behavior of the host shell with its memories and configs.
|
||||
- The user knows whether Docker or local was used and why.
|
||||
- Failures cite specific expectations and evidence; passes that look superficial are flagged, not papered over.
|
||||
|
||||
## Constraints
|
||||
|
||||
- **Artifacts are forever.** Never delete, overwrite, or rotate run folders. Disk usage is the user's call.
|
||||
- **Auth boundary is narrow.** On macOS, the host's Claude Code OAuth credential is staged into each isolated `.claude/.credentials.json` so the subprocess can authenticate without inheriting host config. `ANTHROPIC_API_KEY`, if set, is also forwarded. Nothing else crosses.
|
||||
- **Trigger evals do not need real artifacts.** They use a stub command file and only measure description firing — keep them cheap and parallel.
|
||||
- **No silent fallbacks on grading.** If a grader subagent errors, mark that eval `grading_error` rather than substituting a default verdict.
|
||||
- **Stop when evals are missing.** If discovery returns nothing, halt with diagnostics — the runner does not invent test cases.
|
||||
93
.claude/skills/bmad-eval-runner/agents/grader.md
Normal file
93
.claude/skills/bmad-eval-runner/agents/grader.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# Grader Agent
|
||||
|
||||
Evaluate a single eval's expectations against its captured transcript and artifacts. Return pass/fail per expectation with evidence — and flag weak assertions when you see them.
|
||||
|
||||
You are not the executor. You are not allowed to "fix" the artifacts. Your only job is to inspect what was produced and answer: did each expectation hold?
|
||||
|
||||
## Inputs
|
||||
|
||||
You receive in your prompt:
|
||||
|
||||
- **eval_id**: identifier for this eval
|
||||
- **prompt**: the original user message that was sent to the skill
|
||||
- **expected_output**: human-readable description of what success looks like (context only, not scored against)
|
||||
- **expectations**: list of strings — the assertions you grade
|
||||
- **transcript_path**: absolute path to a stream-JSON transcript (`.jsonl`)
|
||||
- **artifacts_dir**: absolute path to the directory containing files the skill wrote
|
||||
- **grading_path**: absolute path where you write `grading.json`
|
||||
|
||||
## Process
|
||||
|
||||
1. **Read the transcript.** Open `transcript_path`. The transcript is stream-JSON: each line is a JSON event. Note:
|
||||
- The user prompt that was sent
|
||||
- Every tool call Claude made — `Write`, `Edit`, `Read`, `Skill`, `Bash`, etc. (the event has `type: "assistant"` and `content[].type: "tool_use"` with `name` and `input`)
|
||||
- The order tool calls happened in (events are line-ordered)
|
||||
- The final assistant message — often contains a JSON status block for headless runs
|
||||
- Any errors or warnings logged
|
||||
|
||||
2. **List and inspect artifacts.** Walk `artifacts_dir`. For each expectation, open the files it implicates and read their contents — do not rely on filenames alone. Note file modification times when ordering or read-only behavior matters.
|
||||
|
||||
3. **Grade each expectation independently.** For each entry in `expectations`, identify what kind of check it is and gather the right evidence:
|
||||
|
||||
- **Side-artifact existence + content** ("decision-log.md exists AND captures decision X") → open the file, read it, check the content matches.
|
||||
- **Transcript tool-call patterns** ("transcript contains a Skill call to bmad-editorial-review-prose") → scan the transcript for `tool_use` events with the matching `name` and `input`. Quote the matching event.
|
||||
- **Phase ordering** ("polish call occurs after the Write to brief.md and before the final JSON block") → find the line numbers / event indices of each landmark and verify the order.
|
||||
- **Read-only enforcement** ("input brief.md is byte-identical to the fixture; no Write/Edit calls targeted it") → compare file content if the original is available; AND scan the transcript for any Write/Edit `tool_use` whose `input.file_path` falls in the protected directory.
|
||||
- **YAML frontmatter** ("frontmatter contains title, status, created (ISO 8601), updated") → parse the frontmatter, check fields and their formats.
|
||||
- **JSON output blocks** ("final assistant message contains a JSON object with intent='create'") → look at the final `text` content of the last assistant message; extract the JSON object; check the field.
|
||||
- **Bidirectional fidelity** ("every decision in decision-log.md is reflected in brief.md AND no claim in brief.md is absent from the input prompt or log") → list decisions in the log, verify each appears in the brief; list substantive claims in the brief, verify each traces to either the prompt or the log.
|
||||
|
||||
4. **Decide PASS or FAIL with specific evidence.**
|
||||
- PASS only if there is clear, specific evidence the expectation holds AND the evidence reflects substance, not surface compliance (file exists AND contains correct content, not just the right filename).
|
||||
- FAIL when no evidence is found, evidence contradicts, or the assertion is technically satisfied but the underlying outcome is wrong.
|
||||
- Cite the evidence — quote a specific line, name a specific file with a path, point to a specific tool call with its index or input.
|
||||
|
||||
5. **Critique the evals.** After grading, surface assertions that look weak: ones that passed but would also pass for a clearly wrong output, or important outcomes you observed (good or bad) that no assertion checks. Keep the bar high — flag what an eval author would say "good catch" about, not nits.
|
||||
|
||||
6. **Write `grading.json`.** Save to `grading_path`.
|
||||
|
||||
## Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"eval_id": "<eval_id>",
|
||||
"expectations": [
|
||||
{
|
||||
"text": "brief.md exists in the run folder",
|
||||
"passed": true,
|
||||
"evidence": "Found at artifacts/2026-05-09-insulens/brief.md, 487 words"
|
||||
},
|
||||
{
|
||||
"text": "decision-log.md references having ingested the memo as source material",
|
||||
"passed": false,
|
||||
"evidence": "decision-log.md exists but contains only template placeholders; no mention of the memo"
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"passed": 1,
|
||||
"failed": 1,
|
||||
"total": 2,
|
||||
"pass_rate": 0.5
|
||||
},
|
||||
"eval_feedback": {
|
||||
"suggestions": [
|
||||
{
|
||||
"assertion": "brief.md exists in the run folder",
|
||||
"reason": "Existence is a weak check — an empty brief.md would also pass. Consider pairing with a content assertion (e.g., word count > 200, contains the project name)."
|
||||
}
|
||||
],
|
||||
"overall": "Assertions check structure but not content correctness in two places."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If `eval_feedback.suggestions` would be empty, set it to `[]` and `overall` to `"No suggestions; assertions look solid."`
|
||||
|
||||
## Guidelines
|
||||
|
||||
- **Be objective.** Verdicts come from evidence, not vibes.
|
||||
- **Be specific.** Quote, name files, point to line numbers.
|
||||
- **No partial credit.** Each expectation is pass or fail.
|
||||
- **Burden of proof is on the expectation.** When uncertain, fail.
|
||||
- **Do not edit artifacts.** You are read-only against the run folder.
|
||||
- **Do not silently substitute defaults.** If you genuinely cannot read a file or the transcript is missing, mark the affected expectations failed with that as the evidence.
|
||||
29
.claude/skills/bmad-eval-runner/assets/Dockerfile
Normal file
29
.claude/skills/bmad-eval-runner/assets/Dockerfile
Normal file
@@ -0,0 +1,29 @@
|
||||
FROM node:20-bookworm-slim
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y --no-install-recommends \
|
||||
git \
|
||||
python3 \
|
||||
python3-pip \
|
||||
ca-certificates \
|
||||
curl \
|
||||
jq \
|
||||
rsync \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN npm install -g @anthropic-ai/claude-code
|
||||
|
||||
RUN useradd -ms /bin/bash evaluator \
|
||||
&& mkdir -p /workspace /project /output /home/evaluator/.claude \
|
||||
&& chown -R evaluator:evaluator /workspace /output /home/evaluator
|
||||
|
||||
USER evaluator
|
||||
WORKDIR /workspace
|
||||
|
||||
ENV HOME=/home/evaluator
|
||||
ENV CLAUDE_CONFIG_DIR=/home/evaluator/.claude
|
||||
ENV PATH=/home/evaluator/.local/bin:$PATH
|
||||
|
||||
CMD ["bash"]
|
||||
147
.claude/skills/bmad-eval-runner/references/eval-formats.md
Normal file
147
.claude/skills/bmad-eval-runner/references/eval-formats.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Eval Formats
|
||||
|
||||
The runner accepts two file shapes, both compatible with Anthropic's skill-creator conventions.
|
||||
|
||||
## Artifact evals — `evals.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"skill_name": "bmad-product-brief",
|
||||
"evals": [
|
||||
{
|
||||
"id": 1,
|
||||
"prompt": "I want to create a brief for ...",
|
||||
"expected_output": "A run folder with brief.md and decision-log.md ...",
|
||||
"files": [
|
||||
"evals/.../files/some-fixture.md"
|
||||
],
|
||||
"expectations": [
|
||||
"brief.md exists in the run folder",
|
||||
"decision-log.md exists",
|
||||
"brief.md word count is between 250 and 1500"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Field semantics:
|
||||
|
||||
- **id**: stable identifier; used as the eval's directory name in the run folder.
|
||||
- **prompt**: the literal user message Claude will receive. Sent verbatim to `claude -p`.
|
||||
- **expected_output**: human-readable description, used for context only — the grader reads it but does not score against it directly.
|
||||
- **files**: optional fixture paths. Resolved relative to the project root (or the evals folder). Each file is staged into the eval's workspace before execution. Path semantics:
|
||||
- A bare filename is staged at the workspace root.
|
||||
- A nested path (`some-brief/brief.md`) preserves the directory structure inside the workspace.
|
||||
- **expectations**: list of pass/fail assertions evaluated by the grader subagent. Each is graded independently. The grader is instructed to flag weak assertions — assertions a wrong output would also trivially pass.
|
||||
|
||||
The grader writes `grading.json` next to each eval's artifacts; the runner aggregates.
|
||||
|
||||
## Trigger evals — `triggers.json`
|
||||
|
||||
```json
|
||||
[
|
||||
{ "query": "Help me write a product brief for ...", "should_trigger": true },
|
||||
{ "query": "Help me brainstorm ideas for ...", "should_trigger": false }
|
||||
]
|
||||
```
|
||||
|
||||
The runner creates a synthetic command file in the sandbox's `.claude/commands/<skill-name>.md` containing the skill's description, then runs each query against `claude -p` with stream-JSON output and detects whether the skill (or a Read of its SKILL.md) appears as a tool call. Each query is run `--runs-per-query` times (default 3); `trigger_rate` is the fraction of runs that fired.
|
||||
|
||||
A query passes when:
|
||||
- `should_trigger=true` and `trigger_rate >= --trigger-threshold` (default 0.5)
|
||||
- `should_trigger=false` and `trigger_rate < --trigger-threshold`
|
||||
|
||||
Trigger evals do not produce artifacts beyond the result JSON. They are cheap and parallelize aggressively.
|
||||
|
||||
## Where evals can live
|
||||
|
||||
The runner discovers evals in this order:
|
||||
|
||||
1. `--evals <path>` — explicit. May point to a folder or a specific `*.json`.
|
||||
2. `<skill-path>/evals/` — colocated with the skill.
|
||||
3. `<skill-path>/../../evals/<skill-name>/` — sibling-of-parent. Common pattern when evals are intentionally excluded from skill distribution.
|
||||
4. `<project-root>/evals/<skill-name>/`.
|
||||
5. `<project-root>/evals/**/<skill-name>/` — fuzzy search under the project's evals tree.
|
||||
|
||||
If both `evals.json` and `triggers.json` are found, both run unless `--mode` narrows it.
|
||||
|
||||
## Two patterns for single-shot evals
|
||||
|
||||
Most multi-turn workflow skills can be evaluated single-shot if you design the eval right. Two patterns cover the bulk of what you'd otherwise need a multi-turn simulator for:
|
||||
|
||||
### Pattern A — artifact correctness (headless + rich prompt)
|
||||
|
||||
Force the skill into headless mode and pack the prompt with everything Discovery would have surfaced. Grade what comes out: the artifact, its structure, whether it reflects the inputs without inventing.
|
||||
|
||||
Use when:
|
||||
- The deliverable is the artifact (brief, PRD, doc, plan)
|
||||
- You can write a complete pre-Discovery prompt
|
||||
- You want regression coverage on drafting/format/extraction
|
||||
|
||||
### Pattern B — process discipline (headless + transcript and side-artifact inspection)
|
||||
|
||||
Same single-shot mechanics, but the expectations look at *what the skill did internally* — not just the final output. The grader reads the stream-JSON transcript for tool calls, walks side-artifacts (decision logs, addenda, distillates), checks file mtimes, and verifies phase ordering.
|
||||
|
||||
Use when:
|
||||
- The skill enforces a protocol (decision log, polish phase, finalize sequence)
|
||||
- The skill has read-only intents (Validate must not write)
|
||||
- You need to catch "drafting works but the discipline went soft" regressions
|
||||
|
||||
These are deterministic checks against the transcript and filesystem — no LLM judgment needed for most of them.
|
||||
|
||||
### What single-shot can NOT cover
|
||||
|
||||
Facilitation arc: vague-input → sharper pushback → user clarifies → better artifact. That requires a multi-turn user simulator. Defer it to a separate eval mode for skills where conversation is the value (coaching, brainstorming, design thinking).
|
||||
|
||||
## Writing good expectations
|
||||
|
||||
The grader's job is easier when expectations are *discriminating* — hard to pass without actually doing the work.
|
||||
|
||||
**Weak patterns to avoid:**
|
||||
- **Filename-only checks** — "brief.md exists" passes for an empty file. Pair with a content check.
|
||||
- **Wholly subjective phrasing** — "the brief is high quality" cannot be evaluated. State the property concretely.
|
||||
- **Tautologies** — anything that follows from the prompt being understood is not a useful expectation.
|
||||
|
||||
**Strong patterns for artifact correctness (Pattern A):**
|
||||
- Specific facts that should appear ("incorporates at least 2 specific findings from section X")
|
||||
- Structural claims a wrong output would fail ("word count between 250 and 1500")
|
||||
- Negative assertions ("does not introduce content from unrelated sections")
|
||||
- YAML frontmatter checks ("frontmatter contains title, status, created, updated as ISO 8601")
|
||||
- Bounded JSON output ("final assistant message contains a JSON object with intent='create'")
|
||||
|
||||
**Strong patterns for process discipline (Pattern B):**
|
||||
- **Side-artifact existence + content** ("decision-log.md exists AND captures the pricing decision with rejected alternative and rationale")
|
||||
- **Transcript tool-call patterns** ("the transcript contains a Skill tool call invoking bmad-editorial-review-prose")
|
||||
- **Phase ordering** ("the polish-phase Skill calls occur after the brief body Write and before the final JSON status block")
|
||||
- **Read-only enforcement** ("the input brief.md is byte-identical to the staged fixture; no Write or Edit tool calls targeted the run folder")
|
||||
- **Bidirectional fidelity** ("every substantive entry in decision-log.md has a corresponding reflection in brief.md, AND no claim in brief.md is absent from the input prompt or decision-log.md")
|
||||
- **Timestamp checks** ("YAML frontmatter 'updated' field is later than 'created'; 'created' is unchanged from the input fixture")
|
||||
|
||||
## Headless mode — getting the skill to behave non-interactively
|
||||
|
||||
Most multi-turn skills expose a headless flag or keyword that suppresses clarifying questions and produces a structured JSON status block at the end. To use Pattern A or B, the eval prompt needs to trigger this. Common signals:
|
||||
|
||||
- The literal phrase `Run headless.` at the start of the prompt
|
||||
- Skill-specific flags or keywords as documented in the skill's `## Headless Mode` section
|
||||
- Sufficient context such that no clarification is genuinely needed
|
||||
|
||||
If the skill has no headless mode, single-shot evals will halt at the first clarifying question and you have two options: (1) add a headless mode to the skill, (2) defer that skill's evals to the multi-turn simulator.
|
||||
|
||||
## Pre-staging files (Update / Validate intents)
|
||||
|
||||
For Update and Validate evals, the workspace needs to contain an existing brief, decision log, addendum, etc. Use the `files` field — each path is staged into the workspace at the same relative location. The eval prompt then references the staged path explicitly:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "B5",
|
||||
"prompt": "Run headless. Update the brief at evals/skill-x/files/some-brief/brief.md — ...",
|
||||
"files": [
|
||||
"evals/skill-x/files/some-brief/brief.md",
|
||||
"evals/skill-x/files/some-brief/decision-log.md",
|
||||
"evals/skill-x/files/some-brief/addendum.md"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
For Validate (read-only) expectations, pair the staged files with byte-identical assertions and a no-Write/no-Edit transcript check.
|
||||
110
.claude/skills/bmad-eval-runner/references/isolation.md
Normal file
110
.claude/skills/bmad-eval-runner/references/isolation.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# Isolation Strategies
|
||||
|
||||
The eval runner offers two strategies. The intent is identical in both: every eval starts from a clean slate so the result reflects the skill itself, not the host's accumulated state.
|
||||
|
||||
## What we are isolating from
|
||||
|
||||
- The user's global `~/.claude/CLAUDE.md` (private global instructions)
|
||||
- Any ancestor `CLAUDE.md` in the project tree above the skill
|
||||
- Auto-memory at `~/.claude/projects/.../memory/MEMORY.md`
|
||||
- Cached settings, MCP configurations, IDE integrations
|
||||
- Prior conversation context bleeding via the shell
|
||||
|
||||
## Authentication
|
||||
|
||||
The isolated `claude -p` subprocess needs to authenticate, but cannot read the host's `~/.claude/` (HOME is overridden) or the macOS Keychain (Keychain ACLs are scoped to the process that wrote the entry). The runner solves this in the parent process:
|
||||
|
||||
1. On macOS, read the OAuth credential JSON from the Keychain entry `Claude Code-credentials` via `security find-generic-password -s "Claude Code-credentials" -w`. This succeeds because the parent runs as the same user that wrote the entry.
|
||||
2. Stage that JSON as `<workspace>/.home/.claude/.credentials.json` (local mode) or copy it into `/home/evaluator/.claude/.credentials.json` inside the container (Docker mode).
|
||||
3. The subprocess reads `.credentials.json` exactly the way Claude Code normally does, with no other host config bleed.
|
||||
|
||||
If the parent has `ANTHROPIC_API_KEY` set, that env var is also forwarded — and it takes precedence over the Keychain credential. On non-macOS hosts, the Keychain step is skipped and `ANTHROPIC_API_KEY` is the only auth path.
|
||||
|
||||
## Docker (preferred)
|
||||
|
||||
A single image, `bmad-eval-runner:latest`, is built once per machine. It contains Node 20, Claude Code (via `npm install -g @anthropic-ai/claude-code`), Python 3, and standard tools. The image is intentionally minimal — every eval starts from this baseline.
|
||||
|
||||
### Image build
|
||||
|
||||
`scripts/docker_setup.py --build` builds the image from `assets/Dockerfile`. This runs once. Re-runs are a no-op unless `--rebuild` is passed.
|
||||
|
||||
### Per-eval container
|
||||
|
||||
Each eval gets a fresh container:
|
||||
|
||||
```
|
||||
docker run --rm \
|
||||
-v "<project-root>:/project:ro" \
|
||||
-v "<output-dir>/<eval-id>:/output" \
|
||||
-v "<fixtures-dir>:/fixtures:ro" \
|
||||
-e ANTHROPIC_API_KEY \
|
||||
-e EVAL_PROMPT \
|
||||
-e EVAL_ID \
|
||||
-e SKILL_PATH \
|
||||
bmad-eval-runner:latest \
|
||||
/bin/bash -c "/scripts/run_one_eval.sh"
|
||||
```
|
||||
|
||||
Inside the container:
|
||||
|
||||
1. The project is copied from `/project` (read-only) to `/workspace` (writable, container-local). Copy is fast because the underlying layer is shared.
|
||||
2. Fixtures are copied into `/workspace/fixtures/`.
|
||||
3. `HOME` is `/home/evaluator`, an empty directory created by the image — no global `CLAUDE.md`, no memory.
|
||||
4. `claude -p "$EVAL_PROMPT" --output-format stream-json --verbose` runs at `/workspace`.
|
||||
5. The stream-json transcript is captured to `/output/transcript.jsonl`. Any files the skill writes under `/workspace` are rsynced to `/output/artifacts/` after the run completes.
|
||||
6. The container exits and is removed automatically.
|
||||
|
||||
The host then has `<output-dir>/<eval-id>/transcript.jsonl`, `<output-dir>/<eval-id>/artifacts/`, and timing data. Nothing on the host is touched.
|
||||
|
||||
### Why Docker is preferred
|
||||
|
||||
- The image is reproducible — every run starts from byte-identical state.
|
||||
- `HOME` is genuinely empty, not just overridden.
|
||||
- Filesystem isolation is real, not just convention.
|
||||
- Network can be locked down (`--network=none` for trigger evals; full network for artifact evals that may need it).
|
||||
|
||||
## Local fallback
|
||||
|
||||
When Docker is unavailable, the runner falls back to per-eval temp directories under `~/bmad-evals/<run-id>/<eval-id>/`. Layout:
|
||||
|
||||
```
|
||||
~/bmad-evals/<run-id>/<eval-id>/
|
||||
workspace/ # the eval's working directory
|
||||
.home/ # HOME override — empty .claude/ inside
|
||||
project/ # rsync'd copy of <project-root>
|
||||
fixtures/ # staged fixture files
|
||||
transcript.jsonl # claude -p stream output
|
||||
artifacts/ # files Claude wrote under workspace/
|
||||
metrics.json
|
||||
```
|
||||
|
||||
Per-eval invocation roughly:
|
||||
|
||||
```
|
||||
HOME="$WORKSPACE/.home" \
|
||||
CLAUDE_CONFIG_DIR="$WORKSPACE/.home/.claude" \
|
||||
ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
|
||||
claude -p "$EVAL_PROMPT" \
|
||||
--output-format stream-json --verbose \
|
||||
> transcript.jsonl
|
||||
```
|
||||
|
||||
### Limitations of local mode
|
||||
|
||||
- `HOME` override prevents global `CLAUDE.md` and memory loading, but ancestor discovery still happens from the workspace's cwd. If the workspace is created inside a directory tree that contains a `.claude/skills/` further up, the subprocess may discover those skills regardless of `HOME`. This matters most for trigger evals, where stray host skills can fire instead of the synthetic skill we're testing — **prefer Docker for trigger evals**, where filesystem isolation is real.
|
||||
- Filesystem isolation is by convention only — the skill could write outside its workspace if it tries. We don't sandbox syscalls.
|
||||
- Network is unrestricted.
|
||||
|
||||
Tell the user clearly when local mode is in use and that it is best-effort.
|
||||
|
||||
## Why a real skill, not a slash command, for trigger evals
|
||||
|
||||
The trigger runner stages a synthetic skill at `<workspace>/.claude/skills/<unique-name>/SKILL.md` — not at `.claude/commands/<name>.md`. Slash commands are user-invoked (`/<name>`); they do not surface as `Skill` tool calls and so a description placed there can never be observed firing the way a real skill would. Anthropic's reference `run_eval.py` uses the commands path and is known to report 0% trigger rates as a result. Placing the synthetic at `.claude/skills/` matches how real skills load and lets the detector observe genuine `Skill` (or `Read` of the synthetic SKILL.md) tool calls.
|
||||
|
||||
## Why not `--add-dir` only?
|
||||
|
||||
`claude -p --add-dir <skill>` would let Claude see the skill but would still inherit the user's `CLAUDE.md` and memory from the cwd's ancestors. The whole point of this runner is to test the skill, not the host's accumulated state. So we always either Docker-isolate or temp-dir-isolate.
|
||||
|
||||
## Artifact retention
|
||||
|
||||
Run folders are never deleted by this skill. Disk management is the user's responsibility. The runner emits the run folder path on completion; users who want to clean up old runs can delete `~/bmad-evals/<run-id>/` directly.
|
||||
115
.claude/skills/bmad-eval-runner/scripts/docker_setup.py
Normal file
115
.claude/skills/bmad-eval-runner/scripts/docker_setup.py
Normal file
@@ -0,0 +1,115 @@
|
||||
#!/usr/bin/env python3
|
||||
# /// script
|
||||
# requires-python = ">=3.9"
|
||||
# ///
|
||||
"""Detect Docker and build the bmad-eval-runner image when needed.
|
||||
|
||||
Usage:
|
||||
python3 docker_setup.py --check # exit 0 if image is ready, 1 otherwise
|
||||
python3 docker_setup.py --build # build the image (no-op if present)
|
||||
python3 docker_setup.py --rebuild # force rebuild
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
IMAGE_TAG = "bmad-eval-runner:latest"
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
DOCKERFILE = SCRIPT_DIR.parent / "assets" / "Dockerfile"
|
||||
|
||||
|
||||
def docker_available() -> tuple[bool, str]:
|
||||
if shutil.which("docker") is None:
|
||||
return False, "docker CLI not found on PATH"
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["docker", "info"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
return False, f"`docker info` failed: {result.stderr.strip().splitlines()[-1] if result.stderr.strip() else 'unknown'}"
|
||||
return True, "ok"
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "`docker info` timed out"
|
||||
except Exception as e:
|
||||
return False, f"docker check error: {e}"
|
||||
|
||||
|
||||
def image_present(tag: str = IMAGE_TAG) -> bool:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["docker", "image", "inspect", tag],
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
timeout=10,
|
||||
)
|
||||
return result.returncode == 0
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def build_image(tag: str = IMAGE_TAG, force: bool = False, verbose: bool = True) -> int:
|
||||
if not DOCKERFILE.is_file():
|
||||
print(f"Dockerfile missing at {DOCKERFILE}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
cmd = ["docker", "build", "-t", tag, "-f", str(DOCKERFILE), str(DOCKERFILE.parent)]
|
||||
if force:
|
||||
cmd.insert(2, "--no-cache")
|
||||
|
||||
if verbose:
|
||||
print(f"Building {tag} from {DOCKERFILE} ...", file=sys.stderr)
|
||||
|
||||
proc = subprocess.run(cmd, stdout=sys.stderr if verbose else subprocess.DEVNULL, stderr=sys.stderr)
|
||||
return proc.returncode
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Manage the bmad-eval-runner Docker image")
|
||||
group = parser.add_mutually_exclusive_group(required=True)
|
||||
group.add_argument("--check", action="store_true", help="Report status as JSON; exit 0 if image is ready")
|
||||
group.add_argument("--build", action="store_true", help="Build the image (no-op if already present)")
|
||||
group.add_argument("--rebuild", action="store_true", help="Force rebuild")
|
||||
parser.add_argument("--quiet", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
available, reason = docker_available()
|
||||
present = image_present() if available else False
|
||||
|
||||
if args.check:
|
||||
print(json.dumps({
|
||||
"docker_available": available,
|
||||
"docker_reason": reason,
|
||||
"image_present": present,
|
||||
"image_tag": IMAGE_TAG,
|
||||
}, indent=2))
|
||||
return 0 if (available and present) else 1
|
||||
|
||||
if not available:
|
||||
print(f"Docker is not available: {reason}", file=sys.stderr)
|
||||
return 3
|
||||
|
||||
if args.rebuild:
|
||||
return build_image(force=True, verbose=not args.quiet)
|
||||
|
||||
if args.build:
|
||||
if present:
|
||||
if not args.quiet:
|
||||
print(f"{IMAGE_TAG} already present; skipping build (use --rebuild to force).", file=sys.stderr)
|
||||
return 0
|
||||
return build_image(force=False, verbose=not args.quiet)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
184
.claude/skills/bmad-eval-runner/scripts/generate_report.py
Normal file
184
.claude/skills/bmad-eval-runner/scripts/generate_report.py
Normal file
@@ -0,0 +1,184 @@
|
||||
#!/usr/bin/env python3
|
||||
# /// script
|
||||
# requires-python = ">=3.9"
|
||||
# ///
|
||||
"""Generate an aggregate HTML report for a run folder.
|
||||
|
||||
Reads run.json, execution-summary.json, each <eval-id>/grading.json (if present),
|
||||
and triggers-result.json (if present), then renders a single-file HTML report.
|
||||
|
||||
Usage:
|
||||
python3 generate_report.py --run-dir PATH [-o report.html]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import html as html_lib
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def esc(s: object) -> str:
|
||||
return html_lib.escape(str(s), quote=True)
|
||||
|
||||
|
||||
def load(path: Path) -> dict | list | None:
|
||||
if not path.is_file():
|
||||
return None
|
||||
try:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
|
||||
def render(run_dir: Path) -> str:
|
||||
run_meta = load(run_dir / "run.json") or {}
|
||||
exec_summary = load(run_dir / "execution-summary.json") or {}
|
||||
triggers = load(run_dir / "triggers-result.json")
|
||||
|
||||
eval_blocks: list[str] = []
|
||||
grading_total = 0
|
||||
grading_passed = 0
|
||||
|
||||
for res in exec_summary.get("results", []):
|
||||
eval_id = str(res.get("eval_id", "?"))
|
||||
eval_dir = run_dir / eval_id
|
||||
grading = load(eval_dir / "grading.json")
|
||||
metrics = res.get("metrics") or load(eval_dir / "metrics.json") or {}
|
||||
rc = res.get("return_code")
|
||||
|
||||
rows: list[str] = []
|
||||
if grading:
|
||||
for exp in grading.get("expectations", []):
|
||||
passed = bool(exp.get("passed"))
|
||||
grading_total += 1
|
||||
if passed:
|
||||
grading_passed += 1
|
||||
rows.append(
|
||||
f'<tr class="{ "pass" if passed else "fail" }">'
|
||||
f'<td>{ "✔" if passed else "✘" }</td>'
|
||||
f'<td>{esc(exp.get("text", ""))}</td>'
|
||||
f'<td>{esc(exp.get("evidence", ""))}</td></tr>'
|
||||
)
|
||||
|
||||
feedback = (grading or {}).get("eval_feedback") or {}
|
||||
feedback_html = ""
|
||||
if feedback:
|
||||
sugg = feedback.get("suggestions") or []
|
||||
sugg_html = "".join(
|
||||
f"<li><strong>{esc(s.get('assertion','(general)'))}</strong>: {esc(s.get('reason',''))}</li>"
|
||||
for s in sugg
|
||||
)
|
||||
overall = esc(feedback.get("overall", ""))
|
||||
feedback_html = (
|
||||
f'<details class="feedback"><summary>Grader feedback on the evals</summary>'
|
||||
f'<p>{overall}</p>'
|
||||
f'{"<ul>" + sugg_html + "</ul>" if sugg_html else ""}'
|
||||
f'</details>'
|
||||
)
|
||||
|
||||
artifacts_listing = ""
|
||||
artifacts_dir = eval_dir / "artifacts"
|
||||
if artifacts_dir.is_dir():
|
||||
files = sorted(p for p in artifacts_dir.rglob("*") if p.is_file())
|
||||
if files:
|
||||
artifacts_listing = "<ul>" + "".join(
|
||||
f'<li><code>{esc(p.relative_to(eval_dir))}</code> '
|
||||
f'<span class="muted">({p.stat().st_size}b)</span></li>'
|
||||
for p in files
|
||||
) + "</ul>"
|
||||
|
||||
tool_calls = metrics.get("tool_calls", {})
|
||||
tool_summary = ", ".join(f"{k}={v}" for k, v in sorted(tool_calls.items())) or "—"
|
||||
|
||||
eval_blocks.append(f"""
|
||||
<section class="eval">
|
||||
<h3>Eval {esc(eval_id)} <span class="muted">rc={esc(rc)} · {esc(metrics.get('elapsed_s', '?'))}s</span></h3>
|
||||
<p class="muted">Tool calls: {esc(tool_summary)} · output {esc(metrics.get('output_chars', 0))}b · transcript {esc(metrics.get('transcript_chars', 0))}b</p>
|
||||
{ '<table><thead><tr><th></th><th>Expectation</th><th>Evidence</th></tr></thead><tbody>' + ''.join(rows) + '</tbody></table>' if rows else '<p class="muted">No grading.json yet.</p>' }
|
||||
{feedback_html}
|
||||
<details><summary>Artifacts</summary>{artifacts_listing or '<p class="muted">No artifacts captured.</p>'}</details>
|
||||
</section>
|
||||
""")
|
||||
|
||||
triggers_html = ""
|
||||
if triggers:
|
||||
rows = []
|
||||
for r in triggers.get("results", []):
|
||||
rows.append(
|
||||
f'<tr class="{ "pass" if r["pass"] else "fail" }">'
|
||||
f'<td>{ "✔" if r["pass"] else "✘" }</td>'
|
||||
f'<td>{esc(r["query"])}</td>'
|
||||
f'<td>{esc(r["should_trigger"])}</td>'
|
||||
f'<td>{r["triggers"]}/{r["runs"]} ({r["trigger_rate"]:.2f})</td></tr>'
|
||||
)
|
||||
s = triggers.get("summary", {})
|
||||
triggers_html = f"""
|
||||
<section class="triggers">
|
||||
<h2>Trigger Evals — {s.get('passed',0)}/{s.get('total',0)} pass</h2>
|
||||
<table><thead><tr><th></th><th>Query</th><th>Should fire</th><th>Rate</th></tr></thead>
|
||||
<tbody>{''.join(rows)}</tbody></table>
|
||||
</section>
|
||||
"""
|
||||
|
||||
artifact_summary = ""
|
||||
if exec_summary:
|
||||
artifact_summary = (
|
||||
f"<p>Executed {exec_summary.get('executed', 0)} / {exec_summary.get('total', 0)} "
|
||||
f"evals · {exec_summary.get('exec_failures', 0)} execution failures · "
|
||||
f"grader: {grading_passed}/{grading_total} expectations passed</p>"
|
||||
)
|
||||
|
||||
return f"""<!doctype html>
|
||||
<html><head><meta charset="utf-8"><title>Eval Run — {esc(run_meta.get('skill_name','?'))}</title>
|
||||
<style>
|
||||
body {{ font: 14px/1.5 system-ui, sans-serif; max-width: 1080px; margin: 2em auto; color: #222; padding: 0 1em; }}
|
||||
h1, h2, h3 {{ font-weight: 600; }}
|
||||
h1 {{ font-size: 1.6em; margin-bottom: 0.2em; }}
|
||||
.meta {{ color: #666; margin-bottom: 1.5em; }}
|
||||
.muted {{ color: #888; font-weight: normal; }}
|
||||
section.eval {{ border: 1px solid #ddd; border-radius: 6px; padding: 1em 1.2em; margin: 1em 0; background: #fafafa; }}
|
||||
table {{ width: 100%; border-collapse: collapse; margin: 0.5em 0; font-size: 13px; }}
|
||||
th, td {{ text-align: left; padding: 6px 8px; border-bottom: 1px solid #eee; vertical-align: top; }}
|
||||
tr.pass td:first-child {{ color: #2c8a3a; font-weight: 700; }}
|
||||
tr.fail td:first-child {{ color: #b3261e; font-weight: 700; }}
|
||||
tr.fail {{ background: #fdf3f2; }}
|
||||
details.feedback {{ margin-top: 0.6em; padding: 0.4em 0.7em; background: #fff8e1; border-radius: 4px; }}
|
||||
details summary {{ cursor: pointer; font-weight: 600; }}
|
||||
code {{ background: #eee; padding: 1px 4px; border-radius: 3px; }}
|
||||
</style></head>
|
||||
<body>
|
||||
<h1>{esc(run_meta.get('skill_name','?'))} — eval run</h1>
|
||||
<div class="meta">
|
||||
Run id: <code>{esc(run_meta.get('run_id','?'))}</code> ·
|
||||
isolation: {esc(run_meta.get('isolation','?'))} ·
|
||||
started: {esc(run_meta.get('started_at','?'))}
|
||||
</div>
|
||||
{artifact_summary}
|
||||
{''.join(eval_blocks)}
|
||||
{triggers_html}
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Generate HTML report for an eval run folder")
|
||||
parser.add_argument("--run-dir", required=True, type=Path)
|
||||
parser.add_argument("-o", "--output", type=Path, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
run_dir = args.run_dir.resolve()
|
||||
if not run_dir.is_dir():
|
||||
print(f"run-dir not found: {run_dir}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
out = args.output or (run_dir / "report.html")
|
||||
out.write_text(render(run_dir), encoding="utf-8")
|
||||
print(str(out))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
171
.claude/skills/bmad-eval-runner/scripts/pty_runner.py
Normal file
171
.claude/skills/bmad-eval-runner/scripts/pty_runner.py
Normal file
@@ -0,0 +1,171 @@
|
||||
#!/usr/bin/env python3
|
||||
# /// script
|
||||
# requires-python = ">=3.9"
|
||||
# ///
|
||||
"""Run claude interactively via PTY so the Skill tool is available.
|
||||
|
||||
In `claude -p` (print mode) the Skill tool is never offered — Claude handles
|
||||
everything inline. Running `claude` in interactive mode activates the Skill
|
||||
tool so dependency skills installed in .claude/skills/ can be properly invoked.
|
||||
|
||||
The PTY tricks claude into thinking it has a terminal (interactive mode) while
|
||||
we capture its stream-json output programmatically.
|
||||
|
||||
Usage:
|
||||
python3 pty_runner.py --prompt-file /path/to/prompt.txt \\
|
||||
--output /path/to/transcript.jsonl \\
|
||||
[--timeout 600]
|
||||
python3 pty_runner.py --prompt "Run headless. ..." --output transcript.jsonl
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import pty
|
||||
import re
|
||||
import select
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
ANSI_RE = re.compile(r"\x1b(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])|\r")
|
||||
|
||||
# How long to wait for claude to initialize before sending the prompt.
|
||||
# Claude loads skill registry, checks credentials, etc. on startup.
|
||||
INIT_WAIT_S = 5.0
|
||||
|
||||
# How long to wait after the stream-json 'result' event before killing claude.
|
||||
# Trailing tool-result output sometimes follows the result event.
|
||||
POST_RESULT_S = 4.0
|
||||
|
||||
|
||||
def _strip_ansi(text: str) -> str:
|
||||
return ANSI_RE.sub("", text)
|
||||
|
||||
|
||||
def run_interactive(prompt: str, output: Path, timeout: int = 600) -> None:
|
||||
"""Spawn claude interactively via PTY, send one prompt, capture transcript."""
|
||||
master, slave = pty.openpty()
|
||||
|
||||
proc = subprocess.Popen(
|
||||
[
|
||||
"claude",
|
||||
"--output-format", "stream-json",
|
||||
"--verbose",
|
||||
"--dangerously-skip-permissions",
|
||||
],
|
||||
stdin=slave,
|
||||
stdout=slave,
|
||||
stderr=slave,
|
||||
close_fds=True,
|
||||
)
|
||||
os.close(slave)
|
||||
|
||||
json_lines: list[str] = []
|
||||
buf = b""
|
||||
prompt_sent = False
|
||||
done_at: float | None = None
|
||||
start = time.time()
|
||||
|
||||
try:
|
||||
while True:
|
||||
elapsed = time.time() - start
|
||||
if elapsed > timeout:
|
||||
print(f"[pty_runner] timeout after {elapsed:.0f}s", file=sys.stderr)
|
||||
break
|
||||
if done_at is not None and (time.time() - done_at) > POST_RESULT_S:
|
||||
break
|
||||
|
||||
# Short select so we stay responsive but don't spin.
|
||||
r, _, _ = select.select([master], [], [], 0.3)
|
||||
|
||||
if r:
|
||||
try:
|
||||
chunk = os.read(master, 8192)
|
||||
except OSError:
|
||||
break # PTY closed — claude exited
|
||||
buf += chunk
|
||||
|
||||
# Process all complete lines in buffer.
|
||||
while b"\n" in buf:
|
||||
raw, buf = buf.split(b"\n", 1)
|
||||
line = _strip_ansi(raw.decode("utf-8", errors="replace")).strip()
|
||||
if not line.startswith("{"):
|
||||
continue
|
||||
json_lines.append(line)
|
||||
try:
|
||||
obj = json.loads(line)
|
||||
# 'result' marks end of a claude turn.
|
||||
if obj.get("type") == "result" and done_at is None:
|
||||
done_at = time.time()
|
||||
print(
|
||||
f"[pty_runner] result event at t={time.time()-start:.1f}s "
|
||||
f"({len(json_lines)} lines so far)",
|
||||
file=sys.stderr,
|
||||
)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
else:
|
||||
# Silence window — send prompt once claude has had time to init.
|
||||
if not prompt_sent and (time.time() - start) >= INIT_WAIT_S:
|
||||
os.write(master, (prompt + "\n").encode())
|
||||
prompt_sent = True
|
||||
print(
|
||||
f"[pty_runner] prompt sent at t={time.time()-start:.1f}s",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
finally:
|
||||
# Politely ask claude to exit, then hard-kill if needed.
|
||||
try:
|
||||
os.write(master, b"exit\n")
|
||||
time.sleep(0.3)
|
||||
except OSError:
|
||||
pass
|
||||
try:
|
||||
proc.terminate()
|
||||
proc.wait(timeout=5)
|
||||
except Exception:
|
||||
try:
|
||||
proc.kill()
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
os.close(master)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
output.parent.mkdir(parents=True, exist_ok=True)
|
||||
content = "\n".join(json_lines) + ("\n" if json_lines else "")
|
||||
output.write_text(content, encoding="utf-8")
|
||||
print(
|
||||
f"[pty_runner] wrote {len(json_lines)} transcript lines → {output}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(
|
||||
description="Run claude interactively via PTY and capture stream-json transcript"
|
||||
)
|
||||
grp = p.add_mutually_exclusive_group(required=True)
|
||||
grp.add_argument("--prompt", help="Prompt text")
|
||||
grp.add_argument("--prompt-file", type=Path, help="File containing the prompt")
|
||||
p.add_argument("--output", type=Path, required=True, help="Output .jsonl transcript file")
|
||||
p.add_argument("--timeout", type=int, default=600, help="Hard timeout in seconds")
|
||||
args = p.parse_args()
|
||||
|
||||
prompt = (
|
||||
args.prompt_file.read_text(encoding="utf-8").strip()
|
||||
if args.prompt_file
|
||||
else args.prompt
|
||||
)
|
||||
run_interactive(prompt, args.output, args.timeout)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
492
.claude/skills/bmad-eval-runner/scripts/run_evals.py
Normal file
492
.claude/skills/bmad-eval-runner/scripts/run_evals.py
Normal file
@@ -0,0 +1,492 @@
|
||||
#!/usr/bin/env python3
|
||||
# /// script
|
||||
# requires-python = ">=3.9"
|
||||
# ///
|
||||
"""Run a skill's artifact evals in isolated workspaces.
|
||||
|
||||
For each eval, the runner:
|
||||
1. Stages a fresh workspace (Docker container or local tmp dir under ~/bmad-evals).
|
||||
2. Applies the setup overlay (base then per-eval) so _bmad/ config and dependency
|
||||
skills land in the workspace BEFORE the skill is staged — the skill's own copy
|
||||
always wins over overlay content.
|
||||
3. Copies the skill into .claude/skills/ so it is discoverable by claude.
|
||||
4. Stages any fixture files declared in the eval's `files` list.
|
||||
5. Runs `claude -p '<prompt>' --output-format stream-json --verbose`, capturing
|
||||
the transcript. The Skill tool is available in -p mode and fires for installed
|
||||
skills, so dependency skills provided by the setup overlay are properly invokable.
|
||||
6. Rsyncs any files claude wrote into `<run-dir>/<eval-id>/artifacts/`.
|
||||
7. Writes `metrics.json` (tool-call counts, timing, output sizes).
|
||||
|
||||
Grading is performed separately by the parent skill's grader subagents.
|
||||
|
||||
Usage:
|
||||
python3 run_evals.py \\
|
||||
--skill-path PATH \\
|
||||
--evals-file PATH/evals.json \\
|
||||
--project-root PATH \\
|
||||
--output-dir PATH \\
|
||||
--isolation docker|local \\
|
||||
[--workers N] [--timeout SECS] [--eval-ids A1,B3] [--quiet]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
from utils import ( # noqa: E402
|
||||
apply_setup_overlay,
|
||||
discover_setup_dirs,
|
||||
new_run_id,
|
||||
parse_skill_md,
|
||||
read_json,
|
||||
read_macos_keychain_credentials,
|
||||
stage_credentials,
|
||||
utc_now_iso,
|
||||
write_json,
|
||||
)
|
||||
|
||||
DOCKER_IMAGE = "bmad-eval-runner:latest"
|
||||
_KEYCHAIN_CREDS: str | None = read_macos_keychain_credentials()
|
||||
RSYNC_EXCLUDES = (
|
||||
".git", ".bare", "node_modules", ".venv", "__pycache__",
|
||||
".pytest_cache", ".next", "dist", "build", ".cache",
|
||||
".DS_Store", "*.pyc",
|
||||
)
|
||||
|
||||
|
||||
def stage_workspace_local(
|
||||
workspace: Path,
|
||||
project_root: Path,
|
||||
skill_path: Path,
|
||||
fixtures: list[tuple[Path, str]],
|
||||
setup_dirs: list[Path] | None = None,
|
||||
) -> Path:
|
||||
"""Build a clean local workspace. Returns the project root inside workspace."""
|
||||
workspace.mkdir(parents=True, exist_ok=True)
|
||||
project_dest = workspace / "project"
|
||||
home_dir = workspace / ".home"
|
||||
(home_dir / ".claude").mkdir(parents=True, exist_ok=True)
|
||||
|
||||
excludes: list[str] = []
|
||||
for pat in RSYNC_EXCLUDES:
|
||||
excludes.extend(["--exclude", pat])
|
||||
|
||||
if shutil.which("rsync"):
|
||||
subprocess.run(
|
||||
["rsync", "-a", *excludes, f"{project_root}/", f"{project_dest}/"],
|
||||
check=True,
|
||||
)
|
||||
else:
|
||||
shutil.copytree(project_root, project_dest, dirs_exist_ok=True,
|
||||
ignore=shutil.ignore_patterns(*RSYNC_EXCLUDES))
|
||||
|
||||
# Apply setup overlay before staging the skill — the skill's own copy wins.
|
||||
if setup_dirs:
|
||||
apply_setup_overlay(setup_dirs, project_dest)
|
||||
|
||||
skill_link_dir = project_dest / ".claude" / "skills"
|
||||
skill_link_dir.mkdir(parents=True, exist_ok=True)
|
||||
skill_dest = skill_link_dir / skill_path.name
|
||||
if not skill_dest.exists():
|
||||
try:
|
||||
os.symlink(skill_path, skill_dest)
|
||||
except OSError:
|
||||
shutil.copytree(skill_path, skill_dest, dirs_exist_ok=True)
|
||||
|
||||
for src, dest_rel in fixtures:
|
||||
dest = project_dest / dest_rel
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy2(src, dest)
|
||||
|
||||
return project_dest
|
||||
|
||||
|
||||
def run_eval_local(
|
||||
eval_item: dict,
|
||||
run_dir: Path,
|
||||
skill_path: Path,
|
||||
project_root: Path,
|
||||
timeout: int,
|
||||
setup_dirs: list[Path] | None = None,
|
||||
) -> dict:
|
||||
eval_id = str(eval_item.get("id", "unnamed"))
|
||||
eval_dir = run_dir / eval_id
|
||||
workspace_root = eval_dir / "workspace"
|
||||
artifacts_dir = eval_dir / "artifacts"
|
||||
transcript_path = eval_dir / "transcript.jsonl"
|
||||
|
||||
eval_dir.mkdir(parents=True, exist_ok=True)
|
||||
artifacts_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
fixtures = resolve_fixtures(eval_item.get("files", []), project_root)
|
||||
workspace_project = stage_workspace_local(
|
||||
workspace_root, project_root, skill_path, fixtures, setup_dirs
|
||||
)
|
||||
|
||||
(eval_dir / "prompt.txt").write_text(eval_item["prompt"], encoding="utf-8")
|
||||
workspace_snapshot_before = snapshot_files(workspace_project)
|
||||
|
||||
home_dir = workspace_root / ".home"
|
||||
stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS)
|
||||
env = {
|
||||
"HOME": str(home_dir),
|
||||
"CLAUDE_CONFIG_DIR": str(home_dir / ".claude"),
|
||||
"PATH": os.environ.get("PATH", ""),
|
||||
"ANTHROPIC_API_KEY": os.environ.get("ANTHROPIC_API_KEY", ""),
|
||||
}
|
||||
|
||||
cmd = [
|
||||
"claude",
|
||||
"-p", eval_item["prompt"],
|
||||
"--output-format", "stream-json",
|
||||
"--verbose",
|
||||
"--dangerously-skip-permissions",
|
||||
]
|
||||
|
||||
start = time.time()
|
||||
try:
|
||||
with transcript_path.open("wb") as out:
|
||||
proc = subprocess.run(
|
||||
cmd,
|
||||
stdout=out,
|
||||
stderr=subprocess.PIPE,
|
||||
cwd=str(workspace_project),
|
||||
env=env,
|
||||
timeout=timeout,
|
||||
)
|
||||
elapsed = time.time() - start
|
||||
return_code = proc.returncode
|
||||
stderr_tail = (proc.stderr or b"").decode("utf-8", errors="replace")[-2000:]
|
||||
except subprocess.TimeoutExpired as e:
|
||||
elapsed = time.time() - start
|
||||
return_code = -1
|
||||
stderr_tail = f"TIMEOUT after {timeout}s"
|
||||
if e.stderr:
|
||||
stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:]
|
||||
|
||||
new_files = diff_workspace(workspace_project, workspace_snapshot_before)
|
||||
sync_artifacts(workspace_project, new_files, artifacts_dir)
|
||||
|
||||
metrics = compute_metrics(transcript_path, artifacts_dir, elapsed, return_code, stderr_tail)
|
||||
write_json(eval_dir / "metrics.json", metrics)
|
||||
|
||||
return {
|
||||
"eval_id": eval_id,
|
||||
"elapsed_s": elapsed,
|
||||
"return_code": return_code,
|
||||
"transcript": str(transcript_path.relative_to(run_dir)),
|
||||
"artifacts_dir": str(artifacts_dir.relative_to(run_dir)),
|
||||
"metrics": metrics,
|
||||
}
|
||||
|
||||
|
||||
def run_eval_docker(
|
||||
eval_item: dict,
|
||||
run_dir: Path,
|
||||
skill_path: Path,
|
||||
project_root: Path,
|
||||
timeout: int,
|
||||
setup_dirs: list[Path] | None = None,
|
||||
) -> dict:
|
||||
eval_id = str(eval_item.get("id", "unnamed"))
|
||||
eval_dir = run_dir / eval_id
|
||||
artifacts_dir = eval_dir / "artifacts"
|
||||
transcript_path = eval_dir / "transcript.jsonl"
|
||||
|
||||
eval_dir.mkdir(parents=True, exist_ok=True)
|
||||
artifacts_dir.mkdir(parents=True, exist_ok=True)
|
||||
fixtures_staging = eval_dir / "fixtures_in"
|
||||
fixtures_staging.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
fixtures = resolve_fixtures(eval_item.get("files", []), project_root)
|
||||
for src, dest_rel in fixtures:
|
||||
dest = fixtures_staging / dest_rel
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy2(src, dest)
|
||||
|
||||
(eval_dir / "prompt.txt").write_text(eval_item["prompt"], encoding="utf-8")
|
||||
|
||||
# Pre-merge setup overlay dirs on the host; mount as /setup:ro in the container.
|
||||
setup_merged: Path | None = None
|
||||
if setup_dirs:
|
||||
setup_merged = eval_dir / "setup_merged"
|
||||
apply_setup_overlay(setup_dirs, setup_merged)
|
||||
if not any(setup_merged.iterdir()):
|
||||
setup_merged = None
|
||||
|
||||
creds_dir: Path | None = None
|
||||
if _KEYCHAIN_CREDS:
|
||||
creds_dir = eval_dir / "creds"
|
||||
creds_dir.mkdir(parents=True, exist_ok=True)
|
||||
(creds_dir / ".credentials.json").write_text(_KEYCHAIN_CREDS, encoding="utf-8")
|
||||
|
||||
container_script = r"""
|
||||
set -e
|
||||
mkdir -p /workspace
|
||||
rsync -a \
|
||||
--exclude=.git --exclude=.bare --exclude=node_modules --exclude=.venv \
|
||||
--exclude=__pycache__ --exclude=.pytest_cache --exclude=.next \
|
||||
--exclude=dist --exclude=build --exclude=.cache --exclude=.DS_Store \
|
||||
/project/ /workspace/
|
||||
if [ -d /setup ]; then
|
||||
rsync -a /setup/ /workspace/
|
||||
fi
|
||||
mkdir -p /workspace/.claude/skills
|
||||
cp -R "$SKILL_SRC" "/workspace/.claude/skills/$SKILL_NAME"
|
||||
if [ -d /fixtures ]; then
|
||||
cp -R /fixtures/. /workspace/
|
||||
fi
|
||||
if [ -f /creds/.credentials.json ]; then
|
||||
mkdir -p /home/evaluator/.claude
|
||||
cp /creds/.credentials.json /home/evaluator/.claude/.credentials.json
|
||||
fi
|
||||
cd /workspace
|
||||
claude -p "$EVAL_PROMPT" \
|
||||
--output-format stream-json --verbose \
|
||||
--dangerously-skip-permissions \
|
||||
> /output/transcript.jsonl 2> /output/stderr.log || true
|
||||
mkdir -p /output/artifacts
|
||||
rsync -a --exclude=.claude --exclude=node_modules --exclude=.git \
|
||||
--filter='+ */' --filter='+ *' \
|
||||
/workspace/ /output/artifacts/
|
||||
"""
|
||||
|
||||
skill_name = skill_path.name
|
||||
cmd = [
|
||||
"docker", "run", "--rm",
|
||||
"-v", f"{project_root}:/project:ro",
|
||||
"-v", f"{skill_path}:/skill_src:ro",
|
||||
"-v", f"{eval_dir}:/output",
|
||||
"-e", "ANTHROPIC_API_KEY",
|
||||
"-e", f"EVAL_PROMPT={eval_item['prompt']}",
|
||||
"-e", f"SKILL_SRC=/skill_src",
|
||||
"-e", f"SKILL_NAME={skill_name}",
|
||||
]
|
||||
if creds_dir:
|
||||
cmd += ["-v", f"{creds_dir}:/creds:ro"]
|
||||
if fixtures:
|
||||
cmd += ["-v", f"{fixtures_staging}:/fixtures:ro"]
|
||||
if setup_merged:
|
||||
cmd += ["-v", f"{setup_merged}:/setup:ro"]
|
||||
cmd += [DOCKER_IMAGE, "bash", "-c", container_script]
|
||||
|
||||
start = time.time()
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
timeout=timeout + 30,
|
||||
)
|
||||
elapsed = time.time() - start
|
||||
return_code = proc.returncode
|
||||
stderr_tail = proc.stderr.decode("utf-8", errors="replace")[-2000:]
|
||||
if proc.stdout:
|
||||
(eval_dir / "docker.stdout.log").write_bytes(proc.stdout)
|
||||
except subprocess.TimeoutExpired as e:
|
||||
elapsed = time.time() - start
|
||||
return_code = -1
|
||||
stderr_tail = f"TIMEOUT after {timeout}s"
|
||||
if e.stderr:
|
||||
stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:]
|
||||
|
||||
metrics = compute_metrics(transcript_path, artifacts_dir, elapsed, return_code, stderr_tail)
|
||||
write_json(eval_dir / "metrics.json", metrics)
|
||||
shutil.rmtree(fixtures_staging, ignore_errors=True)
|
||||
|
||||
return {
|
||||
"eval_id": eval_id,
|
||||
"elapsed_s": elapsed,
|
||||
"return_code": return_code,
|
||||
"transcript": str(transcript_path.relative_to(run_dir)),
|
||||
"artifacts_dir": str(artifacts_dir.relative_to(run_dir)),
|
||||
"metrics": metrics,
|
||||
}
|
||||
|
||||
|
||||
def resolve_fixtures(files: list[str], project_root: Path) -> list[tuple[Path, str]]:
|
||||
out: list[tuple[Path, str]] = []
|
||||
for entry in files:
|
||||
candidate = (project_root / entry).resolve()
|
||||
if not candidate.is_file():
|
||||
alt = Path(entry).resolve()
|
||||
if alt.is_file():
|
||||
candidate = alt
|
||||
else:
|
||||
print(f"Warning: fixture not found: {entry}", file=sys.stderr)
|
||||
continue
|
||||
out.append((candidate, entry))
|
||||
return out
|
||||
|
||||
|
||||
def snapshot_files(root: Path) -> set[str]:
|
||||
snap: set[str] = set()
|
||||
for p in root.rglob("*"):
|
||||
if p.is_file():
|
||||
snap.add(str(p.relative_to(root)))
|
||||
return snap
|
||||
|
||||
|
||||
def diff_workspace(root: Path, before: set[str]) -> list[str]:
|
||||
after = snapshot_files(root)
|
||||
return sorted(after - before)
|
||||
|
||||
|
||||
def sync_artifacts(workspace: Path, new_files: list[str], dest: Path) -> None:
|
||||
for rel in new_files:
|
||||
src = workspace / rel
|
||||
if not src.is_file():
|
||||
continue
|
||||
if any(part in (".claude", "node_modules", ".git", ".venv") for part in src.parts):
|
||||
continue
|
||||
target = dest / rel
|
||||
target.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy2(src, target)
|
||||
|
||||
|
||||
def compute_metrics(transcript: Path, artifacts: Path, elapsed: float,
|
||||
rc: int, stderr_tail: str) -> dict:
|
||||
tool_calls: dict[str, int] = {}
|
||||
total_steps = 0
|
||||
if transcript.is_file():
|
||||
for raw in transcript.read_text(encoding="utf-8", errors="replace").splitlines():
|
||||
raw = raw.strip()
|
||||
if not raw:
|
||||
continue
|
||||
try:
|
||||
evt = json.loads(raw)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
if evt.get("type") == "assistant":
|
||||
total_steps += 1
|
||||
for item in evt.get("message", {}).get("content", []):
|
||||
if item.get("type") == "tool_use":
|
||||
name = item.get("name", "?")
|
||||
tool_calls[name] = tool_calls.get(name, 0) + 1
|
||||
|
||||
output_chars = 0
|
||||
for f in artifacts.rglob("*"):
|
||||
if f.is_file():
|
||||
try:
|
||||
output_chars += f.stat().st_size
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
return {
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"return_code": rc,
|
||||
"tool_calls": tool_calls,
|
||||
"total_tool_calls": sum(tool_calls.values()),
|
||||
"total_steps": total_steps,
|
||||
"output_chars": output_chars,
|
||||
"transcript_chars": transcript.stat().st_size if transcript.is_file() else 0,
|
||||
"stderr_tail": stderr_tail,
|
||||
}
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Run a skill's artifact evals in isolation")
|
||||
parser.add_argument("--skill-path", required=True, type=Path)
|
||||
parser.add_argument("--evals-file", required=True, type=Path)
|
||||
parser.add_argument("--project-root", required=True, type=Path)
|
||||
parser.add_argument("--output-dir", required=True, type=Path)
|
||||
parser.add_argument("--isolation", choices=("docker", "local"), required=True)
|
||||
parser.add_argument("--workers", type=int, default=8)
|
||||
parser.add_argument("--timeout", type=int, default=600)
|
||||
parser.add_argument("--eval-ids", default=None, help="Comma-separated subset of eval ids to run")
|
||||
parser.add_argument("--quiet", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
skill_path = args.skill_path.resolve()
|
||||
project_root = args.project_root.resolve()
|
||||
evals_file = args.evals_file.resolve()
|
||||
if not evals_file.is_file():
|
||||
print(f"evals file not found: {evals_file}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
skill_name, _, _ = parse_skill_md(skill_path)
|
||||
data = read_json(evals_file)
|
||||
evals = data["evals"] if isinstance(data, dict) and "evals" in data else data
|
||||
|
||||
if args.eval_ids:
|
||||
wanted = {x.strip() for x in args.eval_ids.split(",") if x.strip()}
|
||||
evals = [e for e in evals if str(e.get("id")) in wanted]
|
||||
|
||||
run_id = new_run_id(skill_name)
|
||||
run_dir = (args.output_dir / run_id).resolve()
|
||||
run_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
write_json(run_dir / "run.json", {
|
||||
"run_id": run_id,
|
||||
"skill_name": skill_name,
|
||||
"skill_path": str(skill_path),
|
||||
"project_root": str(project_root),
|
||||
"evals_file": str(evals_file),
|
||||
"isolation": args.isolation,
|
||||
"started_at": utc_now_iso(),
|
||||
"eval_count": len(evals),
|
||||
})
|
||||
|
||||
runner = run_eval_docker if args.isolation == "docker" else run_eval_local
|
||||
|
||||
results: list[dict] = []
|
||||
if not args.quiet:
|
||||
print(
|
||||
f"[run_evals] {len(evals)} evals, isolation={args.isolation}, run_dir={run_dir}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as pool:
|
||||
future_to_eval = {
|
||||
pool.submit(
|
||||
runner,
|
||||
item,
|
||||
run_dir,
|
||||
skill_path,
|
||||
project_root,
|
||||
int(item.get("timeout", args.timeout)),
|
||||
discover_setup_dirs(evals_file, str(item.get("id", ""))),
|
||||
): item
|
||||
for item in evals
|
||||
}
|
||||
for fut in as_completed(future_to_eval):
|
||||
item = future_to_eval[fut]
|
||||
try:
|
||||
res = fut.result()
|
||||
except Exception as e:
|
||||
res = {"eval_id": str(item.get("id")), "error": str(e), "return_code": -1}
|
||||
results.append(res)
|
||||
if not args.quiet:
|
||||
rc = res.get("return_code")
|
||||
status = "ok" if rc == 0 else f"rc={rc}"
|
||||
print(
|
||||
f" [{status}] eval {res.get('eval_id')} ({res.get('elapsed_s', 0):.1f}s)",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
summary = {
|
||||
"run_id": run_id,
|
||||
"completed_at": utc_now_iso(),
|
||||
"total": len(evals),
|
||||
"executed": len(results),
|
||||
"exec_failures": sum(1 for r in results if r.get("return_code") != 0),
|
||||
"run_dir": str(run_dir),
|
||||
"results": results,
|
||||
}
|
||||
write_json(run_dir / "execution-summary.json", summary)
|
||||
print(json.dumps(summary, indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
366
.claude/skills/bmad-eval-runner/scripts/run_triggers.py
Normal file
366
.claude/skills/bmad-eval-runner/scripts/run_triggers.py
Normal file
@@ -0,0 +1,366 @@
|
||||
#!/usr/bin/env python3
|
||||
# /// script
|
||||
# requires-python = ">=3.9"
|
||||
# ///
|
||||
"""Run trigger evals: does the skill's description fire on each query?
|
||||
|
||||
Adapted from Anthropic skill-creator's run_eval.py
|
||||
(https://github.com/anthropics/skills/tree/main/skills/skill-creator) with two
|
||||
adaptations:
|
||||
|
||||
1. Isolation. Each query runs in either a fresh Docker container off
|
||||
bmad-eval-runner:latest, or a fresh local tmp dir under ~/bmad-evals/<run-id>/
|
||||
with HOME overridden to a clean directory. This prevents the host's global
|
||||
CLAUDE.md and auto-memory from biasing whether the skill fires.
|
||||
|
||||
2. Output. Results are written to a run folder alongside the artifact eval
|
||||
run-folder layout (so triggers and artifacts can share a single report).
|
||||
|
||||
Usage:
|
||||
python3 run_triggers.py \\
|
||||
--skill-path PATH \\
|
||||
--triggers-file PATH/triggers.json \\
|
||||
--output-dir PATH \\
|
||||
--isolation docker|local \\
|
||||
[--workers N] [--runs-per-query N] [--timeout SECS] [--threshold 0.5]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
import uuid
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
from utils import ( # noqa: E402
|
||||
new_run_id,
|
||||
parse_skill_md,
|
||||
read_json,
|
||||
read_macos_keychain_credentials,
|
||||
stage_credentials,
|
||||
utc_now_iso,
|
||||
write_json,
|
||||
)
|
||||
|
||||
DOCKER_IMAGE = "bmad-eval-runner:latest"
|
||||
_KEYCHAIN_CREDS: str | None = read_macos_keychain_credentials()
|
||||
|
||||
|
||||
def write_synthetic_skill(skills_dir: Path, skill_name: str, description: str, unique_id: str) -> tuple[Path, str]:
|
||||
"""Place a synthetic skill at <skills_dir>/<clean_name>/SKILL.md.
|
||||
|
||||
The Skill tool only fires for entries discovered as actual skills (frontmatter
|
||||
`name` + `description` under a `.claude/skills/<name>/SKILL.md`). Slash-commands
|
||||
under `.claude/commands/` do not auto-invoke the Skill tool, so the previous
|
||||
implementation could never observe a positive trigger. This places the synthetic
|
||||
skill where Claude Code looks for skills, with a unique name so the detector
|
||||
can disambiguate it from any pre-existing skill of the same display name.
|
||||
"""
|
||||
clean_name = f"{skill_name}-skill-{unique_id}"
|
||||
skill_root = skills_dir / clean_name
|
||||
skill_root.mkdir(parents=True, exist_ok=True)
|
||||
path = skill_root / "SKILL.md"
|
||||
indented_desc = "\n ".join(description.split("\n"))
|
||||
path.write_text(
|
||||
f"---\n"
|
||||
f"name: {clean_name}\n"
|
||||
f"description: |\n"
|
||||
f" {indented_desc}\n"
|
||||
f"---\n\n"
|
||||
f"# {skill_name}\n\n"
|
||||
f"This skill handles: {description}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
return path, clean_name
|
||||
|
||||
|
||||
def parse_stream_for_trigger(buffer: str, clean_name: str) -> tuple[bool | None, str]:
|
||||
"""Return (triggered_or_none, leftover_buffer). None means undecided yet."""
|
||||
triggered: bool | None = None
|
||||
pending_tool: str | None = None
|
||||
accumulated_json = ""
|
||||
leftover = ""
|
||||
|
||||
while "\n" in buffer:
|
||||
line, buffer = buffer.split("\n", 1)
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
evt = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
if evt.get("type") == "stream_event":
|
||||
se = evt.get("event", {})
|
||||
t = se.get("type", "")
|
||||
if t == "content_block_start":
|
||||
cb = se.get("content_block", {})
|
||||
if cb.get("type") == "tool_use":
|
||||
name = cb.get("name", "")
|
||||
if name in ("Skill", "Read"):
|
||||
pending_tool = name
|
||||
accumulated_json = ""
|
||||
else:
|
||||
return False, ""
|
||||
elif t == "content_block_delta" and pending_tool:
|
||||
delta = se.get("delta", {})
|
||||
if delta.get("type") == "input_json_delta":
|
||||
accumulated_json += delta.get("partial_json", "")
|
||||
if clean_name in accumulated_json:
|
||||
return True, ""
|
||||
elif t in ("content_block_stop", "message_stop"):
|
||||
if pending_tool:
|
||||
return clean_name in accumulated_json, ""
|
||||
if t == "message_stop":
|
||||
return False, ""
|
||||
elif evt.get("type") == "assistant":
|
||||
for item in evt.get("message", {}).get("content", []):
|
||||
if item.get("type") != "tool_use":
|
||||
continue
|
||||
tname = item.get("name", "")
|
||||
tinput = item.get("input", {})
|
||||
if tname == "Skill" and clean_name in tinput.get("skill", ""):
|
||||
return True, ""
|
||||
if tname == "Read" and clean_name in tinput.get("file_path", ""):
|
||||
return True, ""
|
||||
return False, ""
|
||||
elif evt.get("type") == "result":
|
||||
return triggered if triggered is not None else False, ""
|
||||
leftover = buffer
|
||||
return triggered, leftover
|
||||
|
||||
|
||||
def run_query_local(query: str, skill_name: str, description: str,
|
||||
workspace_root: Path, timeout: int) -> bool:
|
||||
workspace_root.mkdir(parents=True, exist_ok=True)
|
||||
home_dir = workspace_root / ".home"
|
||||
(home_dir / ".claude").mkdir(parents=True, exist_ok=True)
|
||||
stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS)
|
||||
project_dir = workspace_root / "project"
|
||||
skills_dir = project_dir / ".claude" / "skills"
|
||||
project_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
unique = uuid.uuid4().hex[:8]
|
||||
cmd_file, clean_name = write_synthetic_skill(skills_dir, skill_name, description, unique)
|
||||
|
||||
env = {
|
||||
"HOME": str(home_dir),
|
||||
"CLAUDE_CONFIG_DIR": str(home_dir / ".claude"),
|
||||
"PATH": os.environ.get("PATH", ""),
|
||||
"ANTHROPIC_API_KEY": os.environ.get("ANTHROPIC_API_KEY", ""),
|
||||
}
|
||||
|
||||
cmd = [
|
||||
"claude", "-p", query,
|
||||
"--output-format", "stream-json",
|
||||
"--verbose",
|
||||
"--include-partial-messages",
|
||||
"--dangerously-skip-permissions",
|
||||
]
|
||||
|
||||
try:
|
||||
proc = subprocess.Popen(
|
||||
cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.DEVNULL,
|
||||
cwd=str(project_dir),
|
||||
env=env,
|
||||
)
|
||||
buffer = ""
|
||||
triggered: bool | None = None
|
||||
start = time.time()
|
||||
try:
|
||||
while time.time() - start < timeout:
|
||||
if proc.poll() is not None:
|
||||
rest = proc.stdout.read()
|
||||
if rest:
|
||||
buffer += rest.decode("utf-8", errors="replace")
|
||||
break
|
||||
chunk = proc.stdout.read1(8192) if hasattr(proc.stdout, "read1") else proc.stdout.read(8192)
|
||||
if not chunk:
|
||||
time.sleep(0.05)
|
||||
continue
|
||||
buffer += chunk.decode("utf-8", errors="replace")
|
||||
decided, buffer = parse_stream_for_trigger(buffer, clean_name)
|
||||
if decided is not None:
|
||||
triggered = decided
|
||||
break
|
||||
finally:
|
||||
if proc.poll() is None:
|
||||
proc.kill()
|
||||
proc.wait()
|
||||
if triggered is None:
|
||||
decided, _ = parse_stream_for_trigger(buffer + "\n", clean_name)
|
||||
triggered = bool(decided)
|
||||
return bool(triggered)
|
||||
finally:
|
||||
try:
|
||||
shutil.rmtree(cmd_file.parent, ignore_errors=True)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
def run_query_docker(query: str, skill_name: str, description: str,
|
||||
workspace_root: Path, timeout: int) -> bool:
|
||||
workspace_root.mkdir(parents=True, exist_ok=True)
|
||||
unique = uuid.uuid4().hex[:8]
|
||||
skills_in = workspace_root / "skills_in"
|
||||
skills_in.mkdir(parents=True, exist_ok=True)
|
||||
_, clean_name = write_synthetic_skill(skills_in, skill_name, description, unique)
|
||||
|
||||
creds_dir: Path | None = None
|
||||
if _KEYCHAIN_CREDS:
|
||||
creds_dir = workspace_root / "creds_in"
|
||||
creds_dir.mkdir(parents=True, exist_ok=True)
|
||||
(creds_dir / ".credentials.json").write_text(_KEYCHAIN_CREDS, encoding="utf-8")
|
||||
|
||||
container_script = f"""
|
||||
set -e
|
||||
mkdir -p /workspace/.claude/skills
|
||||
cp -R /skills/. /workspace/.claude/skills/ 2>/dev/null || true
|
||||
if [ -f /creds/.credentials.json ]; then
|
||||
mkdir -p /home/evaluator/.claude
|
||||
cp /creds/.credentials.json /home/evaluator/.claude/.credentials.json
|
||||
fi
|
||||
cd /workspace
|
||||
claude -p "$EVAL_QUERY" \\
|
||||
--output-format stream-json --verbose --include-partial-messages \\
|
||||
--dangerously-skip-permissions \\
|
||||
> /output/stream.jsonl 2>/dev/null || true
|
||||
"""
|
||||
|
||||
output_dir = workspace_root / "output"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
cmd = [
|
||||
"docker", "run", "--rm",
|
||||
"-v", f"{skills_in}:/skills:ro",
|
||||
"-v", f"{output_dir}:/output",
|
||||
"-e", "ANTHROPIC_API_KEY",
|
||||
"-e", f"EVAL_QUERY={query}",
|
||||
]
|
||||
if creds_dir:
|
||||
cmd += ["-v", f"{creds_dir}:/creds:ro"]
|
||||
cmd += [DOCKER_IMAGE, "bash", "-c", container_script]
|
||||
|
||||
try:
|
||||
subprocess.run(cmd, capture_output=True, timeout=timeout + 30)
|
||||
except subprocess.TimeoutExpired:
|
||||
pass
|
||||
|
||||
stream_file = output_dir / "stream.jsonl"
|
||||
if not stream_file.is_file():
|
||||
return False
|
||||
decided, _ = parse_stream_for_trigger(stream_file.read_text(encoding="utf-8", errors="replace") + "\n", clean_name)
|
||||
return bool(decided)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Run trigger evals in isolation")
|
||||
parser.add_argument("--skill-path", required=True, type=Path)
|
||||
parser.add_argument("--triggers-file", required=True, type=Path)
|
||||
parser.add_argument("--output-dir", required=True, type=Path)
|
||||
parser.add_argument("--isolation", choices=("docker", "local"), required=True)
|
||||
parser.add_argument("--workers", type=int, default=8)
|
||||
parser.add_argument("--runs-per-query", type=int, default=3)
|
||||
parser.add_argument("--timeout", type=int, default=45)
|
||||
parser.add_argument("--threshold", type=float, default=0.5)
|
||||
parser.add_argument("--quiet", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
skill_path = args.skill_path.resolve()
|
||||
triggers_file = args.triggers_file.resolve()
|
||||
if not triggers_file.is_file():
|
||||
print(f"triggers file not found: {triggers_file}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
skill_name, description, _ = parse_skill_md(skill_path)
|
||||
queries = read_json(triggers_file)
|
||||
|
||||
run_id = new_run_id(f"{skill_name}-triggers")
|
||||
run_dir = (args.output_dir / run_id).resolve()
|
||||
(run_dir / "queries").mkdir(parents=True, exist_ok=True)
|
||||
|
||||
write_json(run_dir / "run.json", {
|
||||
"run_id": run_id,
|
||||
"skill_name": skill_name,
|
||||
"description": description,
|
||||
"isolation": args.isolation,
|
||||
"started_at": utc_now_iso(),
|
||||
"query_count": len(queries),
|
||||
"runs_per_query": args.runs_per_query,
|
||||
"threshold": args.threshold,
|
||||
})
|
||||
|
||||
runner = run_query_docker if args.isolation == "docker" else run_query_local
|
||||
|
||||
def run_one(idx: int, q: dict, run_idx: int) -> tuple[int, bool]:
|
||||
ws = run_dir / "queries" / f"q{idx:03d}-r{run_idx}"
|
||||
triggered = runner(q["query"], skill_name, description, ws, args.timeout)
|
||||
return idx, triggered
|
||||
|
||||
per_query: dict[int, list[bool]] = {}
|
||||
if not args.quiet:
|
||||
print(f"[run_triggers] {len(queries)} queries × {args.runs_per_query} runs, isolation={args.isolation}", file=sys.stderr)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as pool:
|
||||
futures = []
|
||||
for idx, q in enumerate(queries):
|
||||
for run_idx in range(args.runs_per_query):
|
||||
futures.append(pool.submit(run_one, idx, q, run_idx))
|
||||
for fut in as_completed(futures):
|
||||
try:
|
||||
idx, triggered = fut.result()
|
||||
except Exception as e:
|
||||
print(f"Warning: query failed: {e}", file=sys.stderr)
|
||||
continue
|
||||
per_query.setdefault(idx, []).append(triggered)
|
||||
|
||||
results = []
|
||||
for idx, q in enumerate(queries):
|
||||
triggers = per_query.get(idx, [])
|
||||
rate = (sum(triggers) / len(triggers)) if triggers else 0.0
|
||||
should = bool(q["should_trigger"])
|
||||
if should:
|
||||
passed = rate >= args.threshold
|
||||
else:
|
||||
passed = rate < args.threshold
|
||||
results.append({
|
||||
"query": q["query"],
|
||||
"should_trigger": should,
|
||||
"trigger_rate": rate,
|
||||
"triggers": int(sum(triggers)),
|
||||
"runs": len(triggers),
|
||||
"pass": passed,
|
||||
})
|
||||
|
||||
output = {
|
||||
"run_id": run_id,
|
||||
"completed_at": utc_now_iso(),
|
||||
"skill_name": skill_name,
|
||||
"description": description,
|
||||
"isolation": args.isolation,
|
||||
"results": results,
|
||||
"summary": {
|
||||
"total": len(results),
|
||||
"passed": sum(1 for r in results if r["pass"]),
|
||||
"failed": sum(1 for r in results if not r["pass"]),
|
||||
},
|
||||
}
|
||||
write_json(run_dir / "triggers-result.json", output)
|
||||
print(json.dumps(output, indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
260
.claude/skills/bmad-eval-runner/scripts/utils.py
Normal file
260
.claude/skills/bmad-eval-runner/scripts/utils.py
Normal file
@@ -0,0 +1,260 @@
|
||||
#!/usr/bin/env python3
|
||||
# /// script
|
||||
# requires-python = ">=3.9"
|
||||
# ///
|
||||
"""Shared helpers for the eval runner."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
|
||||
"""Return (name, description, body) from the skill's SKILL.md frontmatter."""
|
||||
text = (skill_path / "SKILL.md").read_text(encoding="utf-8")
|
||||
fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n(.*)$", text, re.DOTALL)
|
||||
if not fm_match:
|
||||
raise ValueError(f"SKILL.md at {skill_path} is missing frontmatter")
|
||||
frontmatter, body = fm_match.group(1), fm_match.group(2)
|
||||
|
||||
name = None
|
||||
description_lines: list[str] = []
|
||||
in_description = False
|
||||
for line in frontmatter.splitlines():
|
||||
if line.startswith("name:"):
|
||||
name = line.split(":", 1)[1].strip()
|
||||
in_description = False
|
||||
elif line.startswith("description:"):
|
||||
value = line.split(":", 1)[1].strip()
|
||||
if value in ("|", ">"):
|
||||
in_description = True
|
||||
else:
|
||||
description_lines = [value]
|
||||
in_description = False
|
||||
elif in_description and line.startswith((" ", "\t")):
|
||||
description_lines.append(line.strip())
|
||||
elif in_description:
|
||||
in_description = False
|
||||
|
||||
if not name:
|
||||
raise ValueError(f"SKILL.md at {skill_path} is missing a name")
|
||||
return name, " ".join(description_lines).strip(), body
|
||||
|
||||
|
||||
def discover_project_root(skill_path: Path) -> Path:
|
||||
"""Walk up from the skill looking for _bmad/ or .git; default to skill's grandparent."""
|
||||
for parent in [skill_path, *skill_path.parents]:
|
||||
if (parent / "_bmad").is_dir() or (parent / ".git").exists():
|
||||
return parent
|
||||
return skill_path.parent.parent
|
||||
|
||||
|
||||
def discover_evals(
|
||||
skill_path: Path,
|
||||
project_root: Path,
|
||||
explicit: Path | None,
|
||||
) -> dict[str, Path]:
|
||||
"""Locate evals.json and triggers.json. Return dict with keys 'evals' and/or 'triggers'."""
|
||||
found: dict[str, Path] = {}
|
||||
|
||||
def check_dir(d: Path) -> None:
|
||||
if not d.is_dir():
|
||||
return
|
||||
for key, fname in (("evals", "evals.json"), ("triggers", "triggers.json")):
|
||||
candidate = d / fname
|
||||
if candidate.is_file() and key not in found:
|
||||
found[key] = candidate
|
||||
|
||||
if explicit is not None:
|
||||
explicit = explicit.resolve()
|
||||
if explicit.is_file():
|
||||
if explicit.name == "evals.json":
|
||||
found["evals"] = explicit
|
||||
elif explicit.name == "triggers.json":
|
||||
found["triggers"] = explicit
|
||||
elif explicit.is_dir():
|
||||
check_dir(explicit)
|
||||
return found
|
||||
|
||||
skill_name = skill_path.name
|
||||
candidates: list[Path] = [
|
||||
skill_path / "evals",
|
||||
skill_path.parent.parent / "evals" / skill_name,
|
||||
project_root / "evals" / skill_name,
|
||||
]
|
||||
for d in candidates:
|
||||
check_dir(d)
|
||||
if found:
|
||||
break
|
||||
|
||||
if not found:
|
||||
evals_root = project_root / "evals"
|
||||
if evals_root.is_dir():
|
||||
for sub in evals_root.rglob(skill_name):
|
||||
if sub.is_dir():
|
||||
check_dir(sub)
|
||||
if found:
|
||||
break
|
||||
|
||||
return found
|
||||
|
||||
|
||||
def utc_now_iso() -> str:
|
||||
return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
|
||||
def new_run_id(skill_name: str) -> str:
|
||||
return f"{datetime.now().strftime('%Y%m%d-%H%M%S')}-{skill_name}"
|
||||
|
||||
|
||||
def have_docker() -> bool:
|
||||
if shutil.which("docker") is None:
|
||||
return False
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["docker", "info"],
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
timeout=5,
|
||||
)
|
||||
return result.returncode == 0
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def docker_image_present(image: str = "bmad-eval-runner:latest") -> bool:
|
||||
if not have_docker():
|
||||
return False
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["docker", "image", "inspect", image],
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
timeout=10,
|
||||
)
|
||||
return result.returncode == 0
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def read_macos_keychain_credentials() -> str | None:
|
||||
"""Read the Claude Code OAuth credentials JSON from the macOS Keychain.
|
||||
|
||||
Returns the raw JSON string stored under service "Claude Code-credentials",
|
||||
or None if unavailable (non-macOS, entry missing, or access denied).
|
||||
|
||||
Called in the parent process — which owns the Keychain ACL — so the credential
|
||||
can be staged into each isolated workspace's `.claude/.credentials.json` before
|
||||
`claude -p` is launched. Without this, an isolated subprocess with HOME pointed
|
||||
at an empty dir has no auth and every eval fails with "Not logged in."
|
||||
"""
|
||||
if sys.platform != "darwin":
|
||||
return None
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["security", "find-generic-password", "-s", "Claude Code-credentials", "-w"],
|
||||
capture_output=True,
|
||||
timeout=5,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
return None
|
||||
val = result.stdout.decode("utf-8", errors="replace").strip()
|
||||
return val if val else None
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def stage_credentials(claude_dir: Path, credentials_json: str | None) -> None:
|
||||
"""Write credentials_json to <claude_dir>/.credentials.json. No-op if None."""
|
||||
if not credentials_json:
|
||||
return
|
||||
claude_dir.mkdir(parents=True, exist_ok=True)
|
||||
(claude_dir / ".credentials.json").write_text(credentials_json, encoding="utf-8")
|
||||
|
||||
|
||||
def write_json(path: Path, data: object) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def read_json(path: Path) -> object:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def parse_skill_dependencies(skill_path: Path) -> list[str]:
|
||||
"""Return skill names declared under 'dependencies:' in SKILL.md frontmatter."""
|
||||
try:
|
||||
text = (skill_path / "SKILL.md").read_text(encoding="utf-8")
|
||||
except (FileNotFoundError, OSError):
|
||||
return []
|
||||
fm = re.match(r"^---\s*\n(.*?)\n---", text, re.DOTALL)
|
||||
if not fm:
|
||||
return []
|
||||
deps: list[str] = []
|
||||
in_deps = False
|
||||
for line in fm.group(1).splitlines():
|
||||
if re.match(r"^dependencies\s*:", line):
|
||||
in_deps = True
|
||||
elif in_deps:
|
||||
m = re.match(r"^\s+-\s+(\S+)", line)
|
||||
if m:
|
||||
deps.append(m.group(1))
|
||||
elif not line.startswith((" ", "\t")):
|
||||
break
|
||||
return deps
|
||||
|
||||
|
||||
def discover_setup_dirs(evals_file: Path, eval_id: str | None = None) -> list[Path]:
|
||||
"""Return ordered list of setup overlay dirs that exist.
|
||||
|
||||
base: <evals_dir>/setup/
|
||||
per-eval: <evals_dir>/<eval_id>/setup/
|
||||
|
||||
Applied base-first so per-eval overlays win on conflict.
|
||||
"""
|
||||
evals_dir = evals_file.parent
|
||||
dirs: list[Path] = []
|
||||
base = evals_dir / "setup"
|
||||
if base.is_dir():
|
||||
dirs.append(base)
|
||||
if eval_id:
|
||||
per_eval = evals_dir / eval_id / "setup"
|
||||
if per_eval.is_dir():
|
||||
dirs.append(per_eval)
|
||||
return dirs
|
||||
|
||||
|
||||
def apply_setup_overlay(setup_dirs: list[Path], dest: Path) -> None:
|
||||
"""Rsync each setup dir onto dest in order (base first, per-eval last)."""
|
||||
dest.mkdir(parents=True, exist_ok=True)
|
||||
for src in setup_dirs:
|
||||
if not src.is_dir():
|
||||
continue
|
||||
subprocess.run(
|
||||
["rsync", "-a", f"{src}/", f"{dest}/"],
|
||||
check=False,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
"parse_skill_md",
|
||||
"discover_project_root",
|
||||
"discover_evals",
|
||||
"utc_now_iso",
|
||||
"new_run_id",
|
||||
"have_docker",
|
||||
"docker_image_present",
|
||||
"read_macos_keychain_credentials",
|
||||
"stage_credentials",
|
||||
"write_json",
|
||||
"read_json",
|
||||
"parse_skill_dependencies",
|
||||
"discover_setup_dirs",
|
||||
"apply_setup_overlay",
|
||||
]
|
||||
Reference in New Issue
Block a user