Julian/sar

Files

julian 17c08e6392 chore: initial monorepo scaffold + WDS Phase 1+2 artifacts

- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 14:34:20 +00:00

7.9 KiB

Raw Permalink Blame History

Eval Formats

The runner accepts two file shapes, both compatible with Anthropic's skill-creator conventions.

Artifact evals — `evals.json`

{
  "skill_name": "bmad-product-brief",
  "evals": [
    {
      "id": 1,
      "prompt": "I want to create a brief for ...",
      "expected_output": "A run folder with brief.md and decision-log.md ...",
      "files": [
        "evals/.../files/some-fixture.md"
      ],
      "expectations": [
        "brief.md exists in the run folder",
        "decision-log.md exists",
        "brief.md word count is between 250 and 1500"
      ]
    }
  ]
}

Field semantics:

id: stable identifier; used as the eval's directory name in the run folder.
prompt: the literal user message Claude will receive. Sent verbatim to claude -p.
expected_output: human-readable description, used for context only — the grader reads it but does not score against it directly.
files: optional fixture paths. Resolved relative to the project root (or the evals folder). Each file is staged into the eval's workspace before execution. Path semantics:
- A bare filename is staged at the workspace root.
- A nested path (some-brief/brief.md) preserves the directory structure inside the workspace.
expectations: list of pass/fail assertions evaluated by the grader subagent. Each is graded independently. The grader is instructed to flag weak assertions — assertions a wrong output would also trivially pass.

The grader writes grading.json next to each eval's artifacts; the runner aggregates.

Trigger evals — `triggers.json`

[
  { "query": "Help me write a product brief for ...", "should_trigger": true },
  { "query": "Help me brainstorm ideas for ...",      "should_trigger": false }
]

The runner creates a synthetic command file in the sandbox's .claude/commands/<skill-name>.md containing the skill's description, then runs each query against claude -p with stream-JSON output and detects whether the skill (or a Read of its SKILL.md) appears as a tool call. Each query is run --runs-per-query times (default 3); trigger_rate is the fraction of runs that fired.

A query passes when:

should_trigger=true and trigger_rate >= --trigger-threshold (default 0.5)
should_trigger=false and trigger_rate < --trigger-threshold

Trigger evals do not produce artifacts beyond the result JSON. They are cheap and parallelize aggressively.

Where evals can live

The runner discovers evals in this order:

--evals <path> — explicit. May point to a folder or a specific *.json.
<skill-path>/evals/ — colocated with the skill.
<skill-path>/../../evals/<skill-name>/ — sibling-of-parent. Common pattern when evals are intentionally excluded from skill distribution.
<project-root>/evals/<skill-name>/.
<project-root>/evals/**/<skill-name>/ — fuzzy search under the project's evals tree.

If both evals.json and triggers.json are found, both run unless --mode narrows it.

Two patterns for single-shot evals

Most multi-turn workflow skills can be evaluated single-shot if you design the eval right. Two patterns cover the bulk of what you'd otherwise need a multi-turn simulator for:

Pattern A — artifact correctness (headless + rich prompt)

Force the skill into headless mode and pack the prompt with everything Discovery would have surfaced. Grade what comes out: the artifact, its structure, whether it reflects the inputs without inventing.

Use when:

The deliverable is the artifact (brief, PRD, doc, plan)
You can write a complete pre-Discovery prompt
You want regression coverage on drafting/format/extraction

Pattern B — process discipline (headless + transcript and side-artifact inspection)

Same single-shot mechanics, but the expectations look at what the skill did internally — not just the final output. The grader reads the stream-JSON transcript for tool calls, walks side-artifacts (decision logs, addenda, distillates), checks file mtimes, and verifies phase ordering.

Use when:

The skill enforces a protocol (decision log, polish phase, finalize sequence)
The skill has read-only intents (Validate must not write)
You need to catch "drafting works but the discipline went soft" regressions

These are deterministic checks against the transcript and filesystem — no LLM judgment needed for most of them.

What single-shot can NOT cover

Facilitation arc: vague-input → sharper pushback → user clarifies → better artifact. That requires a multi-turn user simulator. Defer it to a separate eval mode for skills where conversation is the value (coaching, brainstorming, design thinking).

Writing good expectations

The grader's job is easier when expectations are discriminating — hard to pass without actually doing the work.

Weak patterns to avoid:

Filename-only checks — "brief.md exists" passes for an empty file. Pair with a content check.
Wholly subjective phrasing — "the brief is high quality" cannot be evaluated. State the property concretely.
Tautologies — anything that follows from the prompt being understood is not a useful expectation.

Strong patterns for artifact correctness (Pattern A):

Specific facts that should appear ("incorporates at least 2 specific findings from section X")
Structural claims a wrong output would fail ("word count between 250 and 1500")
Negative assertions ("does not introduce content from unrelated sections")
YAML frontmatter checks ("frontmatter contains title, status, created, updated as ISO 8601")
Bounded JSON output ("final assistant message contains a JSON object with intent='create'")

Strong patterns for process discipline (Pattern B):

Side-artifact existence + content ("decision-log.md exists AND captures the pricing decision with rejected alternative and rationale")
Transcript tool-call patterns ("the transcript contains a Skill tool call invoking bmad-editorial-review-prose")
Phase ordering ("the polish-phase Skill calls occur after the brief body Write and before the final JSON status block")
Read-only enforcement ("the input brief.md is byte-identical to the staged fixture; no Write or Edit tool calls targeted the run folder")
Bidirectional fidelity ("every substantive entry in decision-log.md has a corresponding reflection in brief.md, AND no claim in brief.md is absent from the input prompt or decision-log.md")
Timestamp checks ("YAML frontmatter 'updated' field is later than 'created'; 'created' is unchanged from the input fixture")

Headless mode — getting the skill to behave non-interactively

Most multi-turn skills expose a headless flag or keyword that suppresses clarifying questions and produces a structured JSON status block at the end. To use Pattern A or B, the eval prompt needs to trigger this. Common signals:

The literal phrase Run headless. at the start of the prompt
Skill-specific flags or keywords as documented in the skill's ## Headless Mode section
Sufficient context such that no clarification is genuinely needed

If the skill has no headless mode, single-shot evals will halt at the first clarifying question and you have two options: (1) add a headless mode to the skill, (2) defer that skill's evals to the multi-turn simulator.

Pre-staging files (Update / Validate intents)

For Update and Validate evals, the workspace needs to contain an existing brief, decision log, addendum, etc. Use the files field — each path is staged into the workspace at the same relative location. The eval prompt then references the staged path explicitly:

{
  "id": "B5",
  "prompt": "Run headless. Update the brief at evals/skill-x/files/some-brief/brief.md — ...",
  "files": [
    "evals/skill-x/files/some-brief/brief.md",
    "evals/skill-x/files/some-brief/decision-log.md",
    "evals/skill-x/files/some-brief/addendum.md"
  ]
}

For Validate (read-only) expectations, pair the staged files with byte-identical assertions and a no-Write/no-Edit transcript check.

7.9 KiB Raw Permalink Blame History