- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24) - apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004) - apps/web: React 19 + Vite 8 (ESM) - libs/shared/api-interface: Zod contract base - Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit - WDS artifacts: - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs) - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact) - Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md - AGENTS.md + README.md como entrada para devs/agentes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.2 KiB
Crash Recovery Pattern
Purpose: Handle sessions that crash or disappear unexpectedly.
Detection
The status script returns session_state in CSV column 6:
crashed- Session exited with non-zero exit code (column 5 = exit code, column 4 = output file)not_found- Session disappeared (killed, crashed without trace)
Recovery Logic
| Condition | Action |
|---|---|
crashed with output file |
Read output, check partial progress, retry |
not_found (no output) |
Session died silently, retry immediately |
| Retry 1 failed | Retry with -r2 suffix in session name |
| Retry 2 failed | Escalate to user with diagnostics |
Retry Pattern
# On crash/not_found, spawn retry with unique suffix
project_slug=$(basename "$PWD" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]' | cut -c1-8)
timestamp=$(date +%y%m%d-%H%M%S)
session_name="sa-${project_slug}-${timestamp}-e{epic}-s{story_suffix}-{step}-r2"
# Clear stale state (project-scoped v2.0)
PROJECT_HASH=$(echo -n "$PWD" | md5sum 2>/dev/null | cut -c1-8 || echo -n "$PWD" | md5 -q 2>/dev/null | cut -c1-8)
rm -f "/tmp/.sa-${PROJECT_HASH}-session-${session_name}-state.json"
# ... spawn and monitor as normal
Agent Fallback (v3.0.0)
Before escalating, check if fallback agent is configured:
# Resolve agents for this story/task from agents file
selection=$("$scripts" orchestrator-helper agents-resolve \
--state-file "$state_file" --story "{story_id}" --task "{task}")
primary=$(echo "$selection" | jq -r '.primary')
fallback=$(echo "$selection" | jq -r '.fallback')
if [ "$fallback" != "false" ] && [ -n "$fallback" ]; then
if [ "$current_agent" = "$primary" ]; then
export AI_AGENT="$fallback"
retry_count=0
session=$("$scripts" tmux-wrapper spawn dev {epic} {story_id} \
--command "$("$scripts" tmux-wrapper build-cmd dev {story_id})")
# Continue monitoring...
fi
fi
Fallback flow:
- Primary agent crashes after 2 retries
- IF
fallback != "false"AND haven't tried fallback yet - Switch
AI_AGENTto fallback agent - Reset retry counter to 0
- Retry with fallback agent (gets 2 more attempts)
- IF fallback also fails after 2 retries → CRITICAL escalation
Log message: "Primary agent (claude) failed after 2 attempts. Switching to fallback agent (codex)..."
Escalation (after exhausting all retries)
Display:
**Session crashed for Story {N}**
Primary agent: {primary} - Failed after 2 attempts
Fallback agent: {fallback} - Failed after 2 attempts
Exit code: {exit_code}
Partial progress: {tasks_completed}/{tasks_total}
[R]etry with primary
[F]allback retry
[S]kip story (mark deferred)
[A]bort orchestration
Show any partial output captured for diagnostics.
Integration with Adaptive Retry
Crash recovery is SEPARATE from adaptive retry:
- Adaptive retry = session completed but FAILED (wrong output, tests failed)
- Crash recovery = session DIED unexpectedly (context limit, API error, kill)
Both can occur: a session might crash on attempt 1, then fail normally on attempt 2. Track both counters independently.
Orchestrator Monitoring Task Crash (v1.9.0)
The Problem
When the orchestrator uses background tasks (e.g., Bash with run_in_background) to monitor tmux sessions, the monitoring task itself can crash. This is different from the tmux session crashing.
Observed failure mode:
- Orchestrator spawns background task to run create+dev+monitor loop
- Background task crashes after dev-story completes
- TaskOutput shows "running" but task is dead
- Tmux session actually completed successfully
- Orchestrator waits forever on dead monitoring task
- Code-review never runs because monitoring never returned
Detection
Signs that your monitoring task has crashed (not the tmux session):
| Signal | Meaning |
|---|---|
TaskOutput returns empty 2+ times |
Task may be dead |
| Output file path doesn't exist | Task never wrote results |
| "running" status but no progress | Task is stuck or dead |
| Background task ID invalid | Task crashed |
Recovery Sequence
See monitoring-fallback.md for detailed fallback patterns.
Quick reference:
- Stop waiting on dead monitoring task
- Find tmux sessions:
tmux list-sessions | grep "sa-.*e{epic}-s{story}" - Check session status directly:
story-automator tmux-status-check - Verify source of truth: story file, sprint-status.yaml
- Resume based on verified state
Prevention
NEVER chain multiple workflow steps in a single background task:
# ❌ WRONG - If this task crashes, all subsequent steps are lost
for step in create dev review; do
session=$(...spawn...)
result=$(...monitor...)
done
# ✅ CORRECT - Each step is monitored separately
# Step 1
session=$(...spawn create...)
result=$(...monitor...)
# Verify state
# Step 2 (only after Step 1 verified)
session=$(...spawn dev...)
result=$(...monitor...)
# Verify state
Key Principle
The tmux session is the source of truth for session state. The story file and sprint-status.yaml are the source of truth for workflow state.
Monitoring is just observation - if monitoring fails, verify from source of truth and continue.