chore: initial monorepo scaffold + WDS Phase 1+2 artifacts
- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24) - apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004) - apps/web: React 19 + Vite 8 (ESM) - libs/shared/api-interface: Zod contract base - Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit - WDS artifacts: - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs) - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact) - Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md - AGENTS.md + README.md como entrada para devs/agentes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
174
.agents/skills/bmad-story-automator/data/crash-recovery.md
Normal file
174
.agents/skills/bmad-story-automator/data/crash-recovery.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Crash Recovery Pattern
|
||||
|
||||
**Purpose:** Handle sessions that crash or disappear unexpectedly.
|
||||
|
||||
---
|
||||
|
||||
## Detection
|
||||
|
||||
The status script returns `session_state` in CSV column 6:
|
||||
- `crashed` - Session exited with non-zero exit code (column 5 = exit code, column 4 = output file)
|
||||
- `not_found` - Session disappeared (killed, crashed without trace)
|
||||
|
||||
---
|
||||
|
||||
## Recovery Logic
|
||||
|
||||
| Condition | Action |
|
||||
|-----------|--------|
|
||||
| `crashed` with output file | Read output, check partial progress, retry |
|
||||
| `not_found` (no output) | Session died silently, retry immediately |
|
||||
| Retry 1 failed | Retry with `-r2` suffix in session name |
|
||||
| Retry 2 failed | Escalate to user with diagnostics |
|
||||
|
||||
---
|
||||
|
||||
## Retry Pattern
|
||||
|
||||
```bash
|
||||
# On crash/not_found, spawn retry with unique suffix
|
||||
project_slug=$(basename "$PWD" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]' | cut -c1-8)
|
||||
timestamp=$(date +%y%m%d-%H%M%S)
|
||||
session_name="sa-${project_slug}-${timestamp}-e{epic}-s{story_suffix}-{step}-r2"
|
||||
|
||||
# Clear stale state (project-scoped v2.0)
|
||||
PROJECT_HASH=$(echo -n "$PWD" | md5sum 2>/dev/null | cut -c1-8 || echo -n "$PWD" | md5 -q 2>/dev/null | cut -c1-8)
|
||||
rm -f "/tmp/.sa-${PROJECT_HASH}-session-${session_name}-state.json"
|
||||
# ... spawn and monitor as normal
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent Fallback (v3.0.0)
|
||||
|
||||
**Before escalating**, check if fallback agent is configured:
|
||||
|
||||
```bash
|
||||
# Resolve agents for this story/task from agents file
|
||||
selection=$("$scripts" orchestrator-helper agents-resolve \
|
||||
--state-file "$state_file" --story "{story_id}" --task "{task}")
|
||||
primary=$(echo "$selection" | jq -r '.primary')
|
||||
fallback=$(echo "$selection" | jq -r '.fallback')
|
||||
|
||||
if [ "$fallback" != "false" ] && [ -n "$fallback" ]; then
|
||||
if [ "$current_agent" = "$primary" ]; then
|
||||
export AI_AGENT="$fallback"
|
||||
retry_count=0
|
||||
session=$("$scripts" tmux-wrapper spawn dev {epic} {story_id} \
|
||||
--command "$("$scripts" tmux-wrapper build-cmd dev {story_id})")
|
||||
# Continue monitoring...
|
||||
fi
|
||||
fi
|
||||
```
|
||||
|
||||
**Fallback flow:**
|
||||
1. Primary agent crashes after 2 retries
|
||||
2. IF `fallback != "false"` AND haven't tried fallback yet
|
||||
3. Switch `AI_AGENT` to fallback agent
|
||||
4. Reset retry counter to 0
|
||||
5. Retry with fallback agent (gets 2 more attempts)
|
||||
6. IF fallback also fails after 2 retries → CRITICAL escalation
|
||||
|
||||
**Log message:**
|
||||
"Primary agent (claude) failed after 2 attempts. Switching to fallback agent (codex)..."
|
||||
|
||||
---
|
||||
|
||||
## Escalation (after exhausting all retries)
|
||||
|
||||
Display:
|
||||
```
|
||||
**Session crashed for Story {N}**
|
||||
|
||||
Primary agent: {primary} - Failed after 2 attempts
|
||||
Fallback agent: {fallback} - Failed after 2 attempts
|
||||
|
||||
Exit code: {exit_code}
|
||||
Partial progress: {tasks_completed}/{tasks_total}
|
||||
|
||||
[R]etry with primary
|
||||
[F]allback retry
|
||||
[S]kip story (mark deferred)
|
||||
[A]bort orchestration
|
||||
```
|
||||
|
||||
Show any partial output captured for diagnostics.
|
||||
|
||||
---
|
||||
|
||||
## Integration with Adaptive Retry
|
||||
|
||||
Crash recovery is SEPARATE from adaptive retry:
|
||||
- **Adaptive retry** = session completed but FAILED (wrong output, tests failed)
|
||||
- **Crash recovery** = session DIED unexpectedly (context limit, API error, kill)
|
||||
|
||||
Both can occur: a session might crash on attempt 1, then fail normally on attempt 2.
|
||||
Track both counters independently.
|
||||
|
||||
---
|
||||
|
||||
## Orchestrator Monitoring Task Crash (v1.9.0)
|
||||
|
||||
### The Problem
|
||||
|
||||
When the orchestrator uses background tasks (e.g., Bash with `run_in_background`) to monitor tmux sessions, the monitoring task itself can crash. This is **different** from the tmux session crashing.
|
||||
|
||||
**Observed failure mode:**
|
||||
1. Orchestrator spawns background task to run create+dev+monitor loop
|
||||
2. Background task crashes after dev-story completes
|
||||
3. TaskOutput shows "running" but task is dead
|
||||
4. Tmux session actually completed successfully
|
||||
5. Orchestrator waits forever on dead monitoring task
|
||||
6. Code-review never runs because monitoring never returned
|
||||
|
||||
### Detection
|
||||
|
||||
Signs that your monitoring task has crashed (not the tmux session):
|
||||
|
||||
| Signal | Meaning |
|
||||
|--------|---------|
|
||||
| `TaskOutput` returns empty 2+ times | Task may be dead |
|
||||
| Output file path doesn't exist | Task never wrote results |
|
||||
| "running" status but no progress | Task is stuck or dead |
|
||||
| Background task ID invalid | Task crashed |
|
||||
|
||||
### Recovery Sequence
|
||||
|
||||
**See `monitoring-fallback.md` for detailed fallback patterns.**
|
||||
|
||||
Quick reference:
|
||||
1. Stop waiting on dead monitoring task
|
||||
2. Find tmux sessions: `tmux list-sessions | grep "sa-.*e{epic}-s{story}"`
|
||||
3. Check session status directly: `story-automator tmux-status-check`
|
||||
4. Verify source of truth: story file, sprint-status.yaml
|
||||
5. Resume based on verified state
|
||||
|
||||
### Prevention
|
||||
|
||||
**NEVER chain multiple workflow steps in a single background task:**
|
||||
|
||||
```bash
|
||||
# ❌ WRONG - If this task crashes, all subsequent steps are lost
|
||||
for step in create dev review; do
|
||||
session=$(...spawn...)
|
||||
result=$(...monitor...)
|
||||
done
|
||||
|
||||
# ✅ CORRECT - Each step is monitored separately
|
||||
# Step 1
|
||||
session=$(...spawn create...)
|
||||
result=$(...monitor...)
|
||||
# Verify state
|
||||
|
||||
# Step 2 (only after Step 1 verified)
|
||||
session=$(...spawn dev...)
|
||||
result=$(...monitor...)
|
||||
# Verify state
|
||||
```
|
||||
|
||||
### Key Principle
|
||||
|
||||
**The tmux session is the source of truth for session state.**
|
||||
**The story file and sprint-status.yaml are the source of truth for workflow state.**
|
||||
|
||||
Monitoring is just observation - if monitoring fails, verify from source of truth and continue.
|
||||
Reference in New Issue
Block a user