Files
sar/.claude/skills/bmad-story-automator/data/retry-fallback-strategy.md
julian 17c08e6392 chore: initial monorepo scaffold + WDS Phase 1+2 artifacts
- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:34:20 +00:00

3.7 KiB

Retry & Fallback Strategy

Purpose: Universal retry and fallback agent pattern for all workflow steps (create, dev, auto, review).

Version: 2.0.0


Core Principle

NEVER escalate to user on first failure. Exhaust all retry options first:

  1. Try fallback agent (if configured for this task)
  2. Retry with alternating agents up to 5 total attempts
  3. Sleep between retries if network issues detected
  4. Only escalate after all attempts exhausted

Agent Configuration (v3.0.0)

Deterministic agent resolution via agents file:

# Resolve agent for a specific task (create, dev, auto, review)
# Uses agents file generated during preflight (complexity-aware)
resolve_agent_for_task() {
    local task="$1"
    local state_file="$2"
    local story_id="$3"

    result=$("$scripts" orchestrator-helper agents-resolve \
        --state-file "$state_file" \
        --story "$story_id" \
        --task "$task")

    primary_agent=$(echo "$result" | jq -r '.primary')
    fallback_agent=$(echo "$result" | jq -r '.fallback')

    # Handle "false"/null meaning disabled
    [ "$fallback_agent" = "false" ] && fallback_agent=""
}

# Usage:
resolve_agent_for_task "review" "$state_file" "{story_id}"
echo "Review task: primary=$primary_agent, fallback=$fallback_agent"

Fallback behavior:

  • If fallback_agent is empty, "false", or same as primary → retry with primary only
  • If fallback_agent differs → alternate between agents on retries
  • Complexity overrides win per task, then per-task overrides, then defaults

Retry Sequence (5 Attempts Max)

Attempt Agent Delay Before Notes
1 primary none Initial attempt
2 fallback 0-60s Switch agent; delay if network error
3 primary 0-60s Back to primary
4 fallback 60s Always delay by attempt 4
5 primary 60s Final attempt

If no fallback configured: All 5 attempts use primary agent.


Network Error Detection

Indicators of network/transient issues:

  • Session output contains: "connection refused", "timeout", "rate limit", "503", "502"
  • Session crashed with zero output
  • story-automator monitor-session returns final_state: "crashed" with empty output
  • Session stuck at "never_active" state (no response from API)

On network error detection:

  • Sleep 60 seconds before next attempt
  • Log: "Network issue detected, waiting 60s before retry..."

Implementation & Validation Examples

Detailed bash patterns and step-specific validation examples are moved to:

  • retry-fallback-implementation.md (implementation wrapper + per-step validation)

Escalation (After All Attempts)

Only after exhausting all 5 attempts:

  1. Update state: status = "AWAITING_DECISION"
  2. Log all attempt details:
    [timestamp] ESCALATION: {step} failed after 5 attempts
    - Attempt 1 (primary): {result}
    - Attempt 2 (fallback): {result}
    - Attempt 3 (primary): {result}
    - Attempt 4 (fallback): {result}
    - Attempt 5 (primary): {result}
    
  3. Present options to user:
    • Retry with different settings
    • Skip this story
    • Abort orchestration

Integration with Adaptive Retry

This strategy replaces the simple retry logic. The adaptive-retry.md plateau detection still applies within this framework:

  • If same task plateau detected across 3+ attempts → DEFER instead of escalate
  • Plateau detection runs AFTER agent switching (so both agents hit same wall)

Logging

All retry attempts should be logged in the action log:

[timestamp] {step} attempt {N}/{max} with {agent}: {result}

On success after retry:

[timestamp] {step} succeeded on attempt {N} with {agent} (after {N-1} failures)