- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24) - apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004) - apps/web: React 19 + Vite 8 (ESM) - libs/shared/api-interface: Zod contract base - Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit - WDS artifacts: - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs) - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact) - Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md - AGENTS.md + README.md como entrada para devs/agentes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.9 KiB
Agent Fallback Strategy (v3.0.0)
Multi-Agent Support: The orchestrator can use Claude or Codex as AI coding agents, with automatic fallback on failure.
Configuration
From state document (v3.0.0):
agentConfig:
defaultPrimary: "auto"
defaultFallback: false
perTask:
dev:
primary: "codex"
fallback: "claude"
complexityOverrides:
low:
dev:
primary: "claude"
fallback: false
Agent selection is resolved via the deterministic agents file created in preflight:
_bmad-output/story-automator/agents/agents-{state_filename}.md
Agent Differences
| Agent | CLI | Prompt Style | Timeout | Todo Tracking |
|---|---|---|---|---|
| Claude | claude --dangerously-skip-permissions |
Natural language skill prompt | 60min | ☒/☐ checkboxes |
| Codex | codex exec --full-auto |
Natural language prompt | 90min (1.5x) | Not supported |
CRITICAL: Both Claude and Codex prompts must name the skill/workflow to execute and include the story ID.
The story-automator tmux-wrapper build-cmd function automatically generates the correct prompt format based on AI_AGENT environment variable.
See workflow-commands.md for complete prompt templates.
Fallback Behavior
When to fallback:
- Primary agent session crashes (non-zero exit)
- Retries exhausted with primary agent
fallbackis configured for the task and not disabled ("false")
Fallback procedure:
- Log: "Primary agent ({primary}) failed after {retries} attempts. Trying fallback ({fallback})..."
- Set environment:
AI_AGENT={fallback} - Respawn session with fallback agent
- Monitor as normal (timeouts auto-adjust based on agent type)
- If fallback also fails → CRITICAL escalation
Environment Variable:
# Set before spawning session
export AI_AGENT="codex" # or "claude"
# story-automator tmux-wrapper reads this automatically and generates correct prompt format
session=$("$scripts" tmux-wrapper spawn dev {epic} {story_id} \
--command "$("$scripts" tmux-wrapper build-cmd dev {story_id})")
Codex Monitoring Notes
- No todo checkboxes: Codex doesn't use ☒/☐ -
todos_doneandtodos_totalwill be 0 - Longer waits: Status check script returns 90s wait estimate for Codex (vs 60s for Claude)
- Different activity detection: Uses output freshness + heartbeat (no marker reliance)
- Output staleness window:
CODEX_OUTPUT_STALE_SECONDS(default: 300) - 1.5x timeout multiplier:
story-automator monitor-sessionapplies 1.5x multiplier when--agent codex - Fake todo progress (v2.2): When Codex is idle after activity, reports
1/1to indicate "work done, needs verification" - Idle vs Completed (v2.2): Codex sessions report "idle" instead of "completed" when CLI stops but no terminal markers
⚠️ Codex Code-Review Limitations (v1.5.0)
CRITICAL: Codex is NOT recommended for code-review workflow.
Known Issue: Sprint-Status Not Updated
Codex code-review sessions often complete (CLI exits) WITHOUT updating sprint-status.yaml to "done". This causes:
- Monitor reports "completed" but sprint-status unchanged
- Orchestrator loops indefinitely, spawning new review cycles
- 8+ cycles with 0 progress (observed in Story 8.2)
Root Cause
Codex runs non-interactively via codex exec. When it finishes:
- Tmux session goes idle (no active CLI process)
- Monitor sees "idle" and marks as "completed"
- But workflow step 5 (update sprint-status) may not have executed
- No way to verify workflow actually finished
Recommended Configuration
agentConfig:
defaultPrimary: "codex"
defaultFallback: "claude"
perTask:
review:
primary: "claude" # Never use Codex for code-review
fallback: false
"incomplete" State (v2.2)
The monitoring system now detects when Codex finishes but sprint-status wasn't updated:
final_state: "completed"→ Verified: sprint-status shows "done"final_state: "incomplete"→ Session idle but sprint-status NOT "done"
When "incomplete" is detected:
- Do NOT retry automatically (prevents infinite loop)
- Escalate to user with options:
- Manual fix (update sprint-status yourself)
- Run code-review with Claude
- Skip this story
Verification Command (v2.2)
Check if code-review actually completed:
"$scripts" orchestrator-helper verify-code-review {story_id}
# Returns: {"verified":true/false, "sprint_status":"...", ...}
Backwards Compatibility
- If
agentConfigis missing, the primary agent resolves from the active runtime provider and fallback is disabled - If
aiCommandis set (legacy), use it directly with the generated natural language prompt - New orchestrations should use
agentConfiginstead ofaiCommand - Agents file is authoritative when present
Troubleshooting
See agent-fallback-troubleshooting.md for detailed troubleshooting steps.