chore: initial monorepo scaffold + WDS Phase 1+2 artifacts

- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24) - apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004) - apps/web: React 19 + Vite 8 (ESM) - libs/shared/api-interface: Zod contract base - Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit - WDS artifacts: - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs) - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact) - Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md - AGENTS.md + README.md como entrada para devs/agentes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:34:20 +00:00
commit 17c08e6392
3631 changed files with 855518 additions and 0 deletions
--- a/.agents/skills/bmad-story-automator/data/crash-recovery.md
+++ b/.agents/skills/bmad-story-automator/data/crash-recovery.md
@@ -0,0 +1,174 @@
+# Crash Recovery Pattern
+
+**Purpose:** Handle sessions that crash or disappear unexpectedly.
+
+---
+
+## Detection
+
+The status script returns `session_state` in CSV column 6:
+- `crashed` - Session exited with non-zero exit code (column 5 = exit code, column 4 = output file)
+- `not_found` - Session disappeared (killed, crashed without trace)
+
+---
+
+## Recovery Logic
+
+| Condition | Action |
+|-----------|--------|
+| `crashed` with output file | Read output, check partial progress, retry |
+| `not_found` (no output) | Session died silently, retry immediately |
+| Retry 1 failed | Retry with `-r2` suffix in session name |
+| Retry 2 failed | Escalate to user with diagnostics |
+
+---
+
+## Retry Pattern
+
+```bash
+# On crash/not_found, spawn retry with unique suffix
+project_slug=$(basename "$PWD" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]' | cut -c1-8)
+timestamp=$(date +%y%m%d-%H%M%S)
+session_name="sa-${project_slug}-${timestamp}-e{epic}-s{story_suffix}-{step}-r2"
+
+# Clear stale state (project-scoped v2.0)
+PROJECT_HASH=$(echo -n "$PWD" | md5sum 2>/dev/null | cut -c1-8 || echo -n "$PWD" | md5 -q 2>/dev/null | cut -c1-8)
+rm -f "/tmp/.sa-${PROJECT_HASH}-session-${session_name}-state.json"
+# ... spawn and monitor as normal
+```
+
+---
+
+## Agent Fallback (v3.0.0)
+
+**Before escalating**, check if fallback agent is configured:
+
+```bash
+# Resolve agents for this story/task from agents file
+selection=$("$scripts" orchestrator-helper agents-resolve \
+  --state-file "$state_file" --story "{story_id}" --task "{task}")
+primary=$(echo "$selection" | jq -r '.primary')
+fallback=$(echo "$selection" | jq -r '.fallback')
+
+if [ "$fallback" != "false" ] && [ -n "$fallback" ]; then
+  if [ "$current_agent" = "$primary" ]; then
+    export AI_AGENT="$fallback"
+    retry_count=0
+    session=$("$scripts" tmux-wrapper spawn dev {epic} {story_id} \
+      --command "$("$scripts" tmux-wrapper build-cmd dev {story_id})")
+    # Continue monitoring...
+  fi
+fi
+```
+
+**Fallback flow:**
+1. Primary agent crashes after 2 retries
+2. IF `fallback != "false"` AND haven't tried fallback yet
+3. Switch `AI_AGENT` to fallback agent
+4. Reset retry counter to 0
+5. Retry with fallback agent (gets 2 more attempts)
+6. IF fallback also fails after 2 retries → CRITICAL escalation
+
+**Log message:**
+"Primary agent (claude) failed after 2 attempts. Switching to fallback agent (codex)..."
+
+---
+
+## Escalation (after exhausting all retries)
+
+Display:
+```
+**Session crashed for Story {N}**
+
+Primary agent: {primary} - Failed after 2 attempts
+Fallback agent: {fallback} - Failed after 2 attempts
+
+Exit code: {exit_code}
+Partial progress: {tasks_completed}/{tasks_total}
+
+[R]etry with primary
+[F]allback retry
+[S]kip story (mark deferred)
+[A]bort orchestration
+```
+
+Show any partial output captured for diagnostics.
+
+---
+
+## Integration with Adaptive Retry
+
+Crash recovery is SEPARATE from adaptive retry:
+- **Adaptive retry** = session completed but FAILED (wrong output, tests failed)
+- **Crash recovery** = session DIED unexpectedly (context limit, API error, kill)
+
+Both can occur: a session might crash on attempt 1, then fail normally on attempt 2.
+Track both counters independently.
+
+---
+
+## Orchestrator Monitoring Task Crash (v1.9.0)
+
+### The Problem
+
+When the orchestrator uses background tasks (e.g., Bash with `run_in_background`) to monitor tmux sessions, the monitoring task itself can crash. This is **different** from the tmux session crashing.
+
+**Observed failure mode:**
+1. Orchestrator spawns background task to run create+dev+monitor loop
+2. Background task crashes after dev-story completes
+3. TaskOutput shows "running" but task is dead
+4. Tmux session actually completed successfully
+5. Orchestrator waits forever on dead monitoring task
+6. Code-review never runs because monitoring never returned
+
+### Detection
+
+Signs that your monitoring task has crashed (not the tmux session):
+
+| Signal | Meaning |
+|--------|---------|
+| `TaskOutput` returns empty 2+ times | Task may be dead |
+| Output file path doesn't exist | Task never wrote results |
+| "running" status but no progress | Task is stuck or dead |
+| Background task ID invalid | Task crashed |
+
+### Recovery Sequence
+
+**See `monitoring-fallback.md` for detailed fallback patterns.**
+
+Quick reference:
+1. Stop waiting on dead monitoring task
+2. Find tmux sessions: `tmux list-sessions | grep "sa-.*e{epic}-s{story}"`
+3. Check session status directly: `story-automator tmux-status-check`
+4. Verify source of truth: story file, sprint-status.yaml
+5. Resume based on verified state
+
+### Prevention
+
+**NEVER chain multiple workflow steps in a single background task:**
+
+```bash
+# ❌ WRONG - If this task crashes, all subsequent steps are lost
+for step in create dev review; do
+    session=$(...spawn...)
+    result=$(...monitor...)
+done
+
+# ✅ CORRECT - Each step is monitored separately
+# Step 1
+session=$(...spawn create...)
+result=$(...monitor...)
+# Verify state
+
+# Step 2 (only after Step 1 verified)
+session=$(...spawn dev...)
+result=$(...monitor...)
+# Verify state
+```
+
+### Key Principle
+
+**The tmux session is the source of truth for session state.**
+**The story file and sprint-status.yaml are the source of truth for workflow state.**
+
+Monitoring is just observation - if monitoring fails, verify from source of truth and continue.