Julian/sar

Files

julian 17c08e6392 chore: initial monorepo scaffold + WDS Phase 1+2 artifacts

- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 14:34:20 +00:00

5.2 KiB

Raw Permalink Blame History

Crash Recovery Pattern

Purpose: Handle sessions that crash or disappear unexpectedly.

Detection

The status script returns session_state in CSV column 6:

crashed - Session exited with non-zero exit code (column 5 = exit code, column 4 = output file)
not_found - Session disappeared (killed, crashed without trace)

Recovery Logic

Condition	Action
`crashed` with output file	Read output, check partial progress, retry
`not_found` (no output)	Session died silently, retry immediately
Retry 1 failed	Retry with `-r2` suffix in session name
Retry 2 failed	Escalate to user with diagnostics

Retry Pattern

# On crash/not_found, spawn retry with unique suffix
project_slug=$(basename "$PWD" | tr '[:upper:]' '[:lower:]' | tr -cd '[:alnum:]' | cut -c1-8)
timestamp=$(date +%y%m%d-%H%M%S)
session_name="sa-${project_slug}-${timestamp}-e{epic}-s{story_suffix}-{step}-r2"

# Clear stale state (project-scoped v2.0)
PROJECT_HASH=$(echo -n "$PWD" | md5sum 2>/dev/null | cut -c1-8 || echo -n "$PWD" | md5 -q 2>/dev/null | cut -c1-8)
rm -f "/tmp/.sa-${PROJECT_HASH}-session-${session_name}-state.json"
# ... spawn and monitor as normal

Agent Fallback (v3.0.0)

Before escalating, check if fallback agent is configured:

# Resolve agents for this story/task from agents file
selection=$("$scripts" orchestrator-helper agents-resolve \
  --state-file "$state_file" --story "{story_id}" --task "{task}")
primary=$(echo "$selection" | jq -r '.primary')
fallback=$(echo "$selection" | jq -r '.fallback')

if [ "$fallback" != "false" ] && [ -n "$fallback" ]; then
  if [ "$current_agent" = "$primary" ]; then
    export AI_AGENT="$fallback"
    retry_count=0
    session=$("$scripts" tmux-wrapper spawn dev {epic} {story_id} \
      --command "$("$scripts" tmux-wrapper build-cmd dev {story_id})")
    # Continue monitoring...
  fi
fi

Fallback flow:

Primary agent crashes after 2 retries
IF fallback != "false" AND haven't tried fallback yet
Switch AI_AGENT to fallback agent
Reset retry counter to 0
Retry with fallback agent (gets 2 more attempts)
IF fallback also fails after 2 retries → CRITICAL escalation

Log message: "Primary agent (claude) failed after 2 attempts. Switching to fallback agent (codex)..."

Escalation (after exhausting all retries)

Display:

**Session crashed for Story {N}**

Primary agent: {primary} - Failed after 2 attempts
Fallback agent: {fallback} - Failed after 2 attempts

Exit code: {exit_code}
Partial progress: {tasks_completed}/{tasks_total}

[R]etry with primary
[F]allback retry
[S]kip story (mark deferred)
[A]bort orchestration

Show any partial output captured for diagnostics.

Integration with Adaptive Retry

Crash recovery is SEPARATE from adaptive retry:

Adaptive retry = session completed but FAILED (wrong output, tests failed)
Crash recovery = session DIED unexpectedly (context limit, API error, kill)

Both can occur: a session might crash on attempt 1, then fail normally on attempt 2. Track both counters independently.

Orchestrator Monitoring Task Crash (v1.9.0)

The Problem

When the orchestrator uses background tasks (e.g., Bash with run_in_background) to monitor tmux sessions, the monitoring task itself can crash. This is different from the tmux session crashing.

Observed failure mode:

Orchestrator spawns background task to run create+dev+monitor loop
Background task crashes after dev-story completes
TaskOutput shows "running" but task is dead
Tmux session actually completed successfully
Orchestrator waits forever on dead monitoring task
Code-review never runs because monitoring never returned

Detection

Signs that your monitoring task has crashed (not the tmux session):

Signal	Meaning
`TaskOutput` returns empty 2+ times	Task may be dead
Output file path doesn't exist	Task never wrote results
"running" status but no progress	Task is stuck or dead
Background task ID invalid	Task crashed

Recovery Sequence

See monitoring-fallback.md for detailed fallback patterns.

Quick reference:

Stop waiting on dead monitoring task
Find tmux sessions: tmux list-sessions | grep "sa-.*e{epic}-s{story}"
Check session status directly: story-automator tmux-status-check
Verify source of truth: story file, sprint-status.yaml
Resume based on verified state

Prevention

NEVER chain multiple workflow steps in a single background task:

# ❌ WRONG - If this task crashes, all subsequent steps are lost
for step in create dev review; do
    session=$(...spawn...)
    result=$(...monitor...)
done

# ✅ CORRECT - Each step is monitored separately
# Step 1
session=$(...spawn create...)
result=$(...monitor...)
# Verify state

# Step 2 (only after Step 1 verified)
session=$(...spawn dev...)
result=$(...monitor...)
# Verify state

Key Principle

The tmux session is the source of truth for session state. The story file and sprint-status.yaml are the source of truth for workflow state.

Monitoring is just observation - if monitoring fails, verify from source of truth and continue.

5.2 KiB Raw Permalink Blame History

Crash Recovery Pattern

Detection

Recovery Logic

Retry Pattern

Agent Fallback (v3.0.0)

Escalation (after exhausting all retries)

Integration with Adaptive Retry

Orchestrator Monitoring Task Crash (v1.9.0)

The Problem

Detection

Recovery Sequence

Prevention

Key Principle

5.2 KiB

Raw Permalink Blame History