chore: initial monorepo scaffold + WDS Phase 1+2 artifacts

- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24) - apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004) - apps/web: React 19 + Vite 8 (ESM) - libs/shared/api-interface: Zod contract base - Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit - WDS artifacts: - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs) - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact) - Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md - AGENTS.md + README.md como entrada para devs/agentes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:34:20 +00:00
commit 17c08e6392
3631 changed files with 855518 additions and 0 deletions
--- a/.claude/skills/bmad-eval-runner/references/eval-formats.md
+++ b/.claude/skills/bmad-eval-runner/references/eval-formats.md
@@ -0,0 +1,147 @@
+# Eval Formats
+
+The runner accepts two file shapes, both compatible with Anthropic's skill-creator conventions.
+
+## Artifact evals — `evals.json`
+
+```json
+{
+  "skill_name": "bmad-product-brief",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I want to create a brief for ...",
+      "expected_output": "A run folder with brief.md and decision-log.md ...",
+      "files": [
+        "evals/.../files/some-fixture.md"
+      ],
+      "expectations": [
+        "brief.md exists in the run folder",
+        "decision-log.md exists",
+        "brief.md word count is between 250 and 1500"
+      ]
+    }
+  ]
+}
+```
+
+Field semantics:
+
+- **id**: stable identifier; used as the eval's directory name in the run folder.
+- **prompt**: the literal user message Claude will receive. Sent verbatim to `claude -p`.
+- **expected_output**: human-readable description, used for context only — the grader reads it but does not score against it directly.
+- **files**: optional fixture paths. Resolved relative to the project root (or the evals folder). Each file is staged into the eval's workspace before execution. Path semantics:
+  - A bare filename is staged at the workspace root.
+  - A nested path (`some-brief/brief.md`) preserves the directory structure inside the workspace.
+- **expectations**: list of pass/fail assertions evaluated by the grader subagent. Each is graded independently. The grader is instructed to flag weak assertions — assertions a wrong output would also trivially pass.
+
+The grader writes `grading.json` next to each eval's artifacts; the runner aggregates.
+
+## Trigger evals — `triggers.json`
+
+```json
+[
+  { "query": "Help me write a product brief for ...", "should_trigger": true },
+  { "query": "Help me brainstorm ideas for ...",      "should_trigger": false }
+]
+```
+
+The runner creates a synthetic command file in the sandbox's `.claude/commands/<skill-name>.md` containing the skill's description, then runs each query against `claude -p` with stream-JSON output and detects whether the skill (or a Read of its SKILL.md) appears as a tool call. Each query is run `--runs-per-query` times (default 3); `trigger_rate` is the fraction of runs that fired.
+
+A query passes when:
+- `should_trigger=true` and `trigger_rate >= --trigger-threshold` (default 0.5)
+- `should_trigger=false` and `trigger_rate < --trigger-threshold`
+
+Trigger evals do not produce artifacts beyond the result JSON. They are cheap and parallelize aggressively.
+
+## Where evals can live
+
+The runner discovers evals in this order:
+
+1. `--evals <path>` — explicit. May point to a folder or a specific `*.json`.
+2. `<skill-path>/evals/` — colocated with the skill.
+3. `<skill-path>/../../evals/<skill-name>/` — sibling-of-parent. Common pattern when evals are intentionally excluded from skill distribution.
+4. `<project-root>/evals/<skill-name>/`.
+5. `<project-root>/evals/**/<skill-name>/` — fuzzy search under the project's evals tree.
+
+If both `evals.json` and `triggers.json` are found, both run unless `--mode` narrows it.
+
+## Two patterns for single-shot evals
+
+Most multi-turn workflow skills can be evaluated single-shot if you design the eval right. Two patterns cover the bulk of what you'd otherwise need a multi-turn simulator for:
+
+### Pattern A — artifact correctness (headless + rich prompt)
+
+Force the skill into headless mode and pack the prompt with everything Discovery would have surfaced. Grade what comes out: the artifact, its structure, whether it reflects the inputs without inventing.
+
+Use when:
+- The deliverable is the artifact (brief, PRD, doc, plan)
+- You can write a complete pre-Discovery prompt
+- You want regression coverage on drafting/format/extraction
+
+### Pattern B — process discipline (headless + transcript and side-artifact inspection)
+
+Same single-shot mechanics, but the expectations look at *what the skill did internally* — not just the final output. The grader reads the stream-JSON transcript for tool calls, walks side-artifacts (decision logs, addenda, distillates), checks file mtimes, and verifies phase ordering.
+
+Use when:
+- The skill enforces a protocol (decision log, polish phase, finalize sequence)
+- The skill has read-only intents (Validate must not write)
+- You need to catch "drafting works but the discipline went soft" regressions
+
+These are deterministic checks against the transcript and filesystem — no LLM judgment needed for most of them.
+
+### What single-shot can NOT cover
+
+Facilitation arc: vague-input → sharper pushback → user clarifies → better artifact. That requires a multi-turn user simulator. Defer it to a separate eval mode for skills where conversation is the value (coaching, brainstorming, design thinking).
+
+## Writing good expectations
+
+The grader's job is easier when expectations are *discriminating* — hard to pass without actually doing the work.
+
+**Weak patterns to avoid:**
+- **Filename-only checks** — "brief.md exists" passes for an empty file. Pair with a content check.
+- **Wholly subjective phrasing** — "the brief is high quality" cannot be evaluated. State the property concretely.
+- **Tautologies** — anything that follows from the prompt being understood is not a useful expectation.
+
+**Strong patterns for artifact correctness (Pattern A):**
+- Specific facts that should appear ("incorporates at least 2 specific findings from section X")
+- Structural claims a wrong output would fail ("word count between 250 and 1500")
+- Negative assertions ("does not introduce content from unrelated sections")
+- YAML frontmatter checks ("frontmatter contains title, status, created, updated as ISO 8601")
+- Bounded JSON output ("final assistant message contains a JSON object with intent='create'")
+
+**Strong patterns for process discipline (Pattern B):**
+- **Side-artifact existence + content** ("decision-log.md exists AND captures the pricing decision with rejected alternative and rationale")
+- **Transcript tool-call patterns** ("the transcript contains a Skill tool call invoking bmad-editorial-review-prose")
+- **Phase ordering** ("the polish-phase Skill calls occur after the brief body Write and before the final JSON status block")
+- **Read-only enforcement** ("the input brief.md is byte-identical to the staged fixture; no Write or Edit tool calls targeted the run folder")
+- **Bidirectional fidelity** ("every substantive entry in decision-log.md has a corresponding reflection in brief.md, AND no claim in brief.md is absent from the input prompt or decision-log.md")
+- **Timestamp checks** ("YAML frontmatter 'updated' field is later than 'created'; 'created' is unchanged from the input fixture")
+
+## Headless mode — getting the skill to behave non-interactively
+
+Most multi-turn skills expose a headless flag or keyword that suppresses clarifying questions and produces a structured JSON status block at the end. To use Pattern A or B, the eval prompt needs to trigger this. Common signals:
+
+- The literal phrase `Run headless.` at the start of the prompt
+- Skill-specific flags or keywords as documented in the skill's `## Headless Mode` section
+- Sufficient context such that no clarification is genuinely needed
+
+If the skill has no headless mode, single-shot evals will halt at the first clarifying question and you have two options: (1) add a headless mode to the skill, (2) defer that skill's evals to the multi-turn simulator.
+
+## Pre-staging files (Update / Validate intents)
+
+For Update and Validate evals, the workspace needs to contain an existing brief, decision log, addendum, etc. Use the `files` field — each path is staged into the workspace at the same relative location. The eval prompt then references the staged path explicitly:
+
+```json
+{
+  "id": "B5",
+  "prompt": "Run headless. Update the brief at evals/skill-x/files/some-brief/brief.md — ...",
+  "files": [
+    "evals/skill-x/files/some-brief/brief.md",
+    "evals/skill-x/files/some-brief/decision-log.md",
+    "evals/skill-x/files/some-brief/addendum.md"
+  ]
+}
+```
+
+For Validate (read-only) expectations, pair the staged files with byte-identical assertions and a no-Write/no-Edit transcript check.
--- a/.claude/skills/bmad-eval-runner/references/isolation.md
+++ b/.claude/skills/bmad-eval-runner/references/isolation.md
@@ -0,0 +1,110 @@
+# Isolation Strategies
+
+The eval runner offers two strategies. The intent is identical in both: every eval starts from a clean slate so the result reflects the skill itself, not the host's accumulated state.
+
+## What we are isolating from
+
+- The user's global `~/.claude/CLAUDE.md` (private global instructions)
+- Any ancestor `CLAUDE.md` in the project tree above the skill
+- Auto-memory at `~/.claude/projects/.../memory/MEMORY.md`
+- Cached settings, MCP configurations, IDE integrations
+- Prior conversation context bleeding via the shell
+
+## Authentication
+
+The isolated `claude -p` subprocess needs to authenticate, but cannot read the host's `~/.claude/` (HOME is overridden) or the macOS Keychain (Keychain ACLs are scoped to the process that wrote the entry). The runner solves this in the parent process:
+
+1. On macOS, read the OAuth credential JSON from the Keychain entry `Claude Code-credentials` via `security find-generic-password -s "Claude Code-credentials" -w`. This succeeds because the parent runs as the same user that wrote the entry.
+2. Stage that JSON as `<workspace>/.home/.claude/.credentials.json` (local mode) or copy it into `/home/evaluator/.claude/.credentials.json` inside the container (Docker mode).
+3. The subprocess reads `.credentials.json` exactly the way Claude Code normally does, with no other host config bleed.
+
+If the parent has `ANTHROPIC_API_KEY` set, that env var is also forwarded — and it takes precedence over the Keychain credential. On non-macOS hosts, the Keychain step is skipped and `ANTHROPIC_API_KEY` is the only auth path.
+
+## Docker (preferred)
+
+A single image, `bmad-eval-runner:latest`, is built once per machine. It contains Node 20, Claude Code (via `npm install -g @anthropic-ai/claude-code`), Python 3, and standard tools. The image is intentionally minimal — every eval starts from this baseline.
+
+### Image build
+
+`scripts/docker_setup.py --build` builds the image from `assets/Dockerfile`. This runs once. Re-runs are a no-op unless `--rebuild` is passed.
+
+### Per-eval container
+
+Each eval gets a fresh container:
+
+```
+docker run --rm \
+  -v "<project-root>:/project:ro" \
+  -v "<output-dir>/<eval-id>:/output" \
+  -v "<fixtures-dir>:/fixtures:ro" \
+  -e ANTHROPIC_API_KEY \
+  -e EVAL_PROMPT \
+  -e EVAL_ID \
+  -e SKILL_PATH \
+  bmad-eval-runner:latest \
+  /bin/bash -c "/scripts/run_one_eval.sh"
+```
+
+Inside the container:
+
+1. The project is copied from `/project` (read-only) to `/workspace` (writable, container-local). Copy is fast because the underlying layer is shared.
+2. Fixtures are copied into `/workspace/fixtures/`.
+3. `HOME` is `/home/evaluator`, an empty directory created by the image — no global `CLAUDE.md`, no memory.
+4. `claude -p "$EVAL_PROMPT" --output-format stream-json --verbose` runs at `/workspace`.
+5. The stream-json transcript is captured to `/output/transcript.jsonl`. Any files the skill writes under `/workspace` are rsynced to `/output/artifacts/` after the run completes.
+6. The container exits and is removed automatically.
+
+The host then has `<output-dir>/<eval-id>/transcript.jsonl`, `<output-dir>/<eval-id>/artifacts/`, and timing data. Nothing on the host is touched.
+
+### Why Docker is preferred
+
+- The image is reproducible — every run starts from byte-identical state.
+- `HOME` is genuinely empty, not just overridden.
+- Filesystem isolation is real, not just convention.
+- Network can be locked down (`--network=none` for trigger evals; full network for artifact evals that may need it).
+
+## Local fallback
+
+When Docker is unavailable, the runner falls back to per-eval temp directories under `~/bmad-evals/<run-id>/<eval-id>/`. Layout:
+
+```
+~/bmad-evals/<run-id>/<eval-id>/
+  workspace/         # the eval's working directory
+    .home/           # HOME override — empty .claude/ inside
+    project/         # rsync'd copy of <project-root>
+    fixtures/        # staged fixture files
+  transcript.jsonl   # claude -p stream output
+  artifacts/         # files Claude wrote under workspace/
+  metrics.json
+```
+
+Per-eval invocation roughly:
+
+```
+HOME="$WORKSPACE/.home" \
+CLAUDE_CONFIG_DIR="$WORKSPACE/.home/.claude" \
+ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
+  claude -p "$EVAL_PROMPT" \
+    --output-format stream-json --verbose \
+    > transcript.jsonl
+```
+
+### Limitations of local mode
+
+- `HOME` override prevents global `CLAUDE.md` and memory loading, but ancestor discovery still happens from the workspace's cwd. If the workspace is created inside a directory tree that contains a `.claude/skills/` further up, the subprocess may discover those skills regardless of `HOME`. This matters most for trigger evals, where stray host skills can fire instead of the synthetic skill we're testing — **prefer Docker for trigger evals**, where filesystem isolation is real.
+- Filesystem isolation is by convention only — the skill could write outside its workspace if it tries. We don't sandbox syscalls.
+- Network is unrestricted.
+
+Tell the user clearly when local mode is in use and that it is best-effort.
+
+## Why a real skill, not a slash command, for trigger evals
+
+The trigger runner stages a synthetic skill at `<workspace>/.claude/skills/<unique-name>/SKILL.md` — not at `.claude/commands/<name>.md`. Slash commands are user-invoked (`/<name>`); they do not surface as `Skill` tool calls and so a description placed there can never be observed firing the way a real skill would. Anthropic's reference `run_eval.py` uses the commands path and is known to report 0% trigger rates as a result. Placing the synthetic at `.claude/skills/` matches how real skills load and lets the detector observe genuine `Skill` (or `Read` of the synthetic SKILL.md) tool calls.
+
+## Why not `--add-dir` only?
+
+`claude -p --add-dir <skill>` would let Claude see the skill but would still inherit the user's `CLAUDE.md` and memory from the cwd's ancestors. The whole point of this runner is to test the skill, not the host's accumulated state. So we always either Docker-isolate or temp-dir-isolate.
+
+## Artifact retention
+
+Run folders are never deleted by this skill. Disk management is the user's responsibility. The runner emits the run folder path on completion; users who want to clean up old runs can delete `~/bmad-evals/<run-id>/` directly.