Julian/sar

Files

julian 17c08e6392 chore: initial monorepo scaffold + WDS Phase 1+2 artifacts

- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 14:34:20 +00:00

4.9 KiB

Raw Permalink Blame History

Agent Fallback Strategy (v3.0.0)

Multi-Agent Support: The orchestrator can use Claude or Codex as AI coding agents, with automatic fallback on failure.

Configuration

From state document (v3.0.0):

agentConfig:
  defaultPrimary: "auto"
  defaultFallback: false
  perTask:
    dev:
      primary: "codex"
      fallback: "claude"
  complexityOverrides:
    low:
      dev:
        primary: "claude"
        fallback: false

Agent selection is resolved via the deterministic agents file created in preflight: _bmad-output/story-automator/agents/agents-{state_filename}.md

Agent Differences

Agent	CLI	Prompt Style	Timeout	Todo Tracking
Claude	`claude --dangerously-skip-permissions`	Natural language skill prompt	60min	☒/☐ checkboxes
Codex	`codex exec --full-auto`	Natural language prompt	90min (1.5x)	Not supported

CRITICAL: Both Claude and Codex prompts must name the skill/workflow to execute and include the story ID.

The story-automator tmux-wrapper build-cmd function automatically generates the correct prompt format based on AI_AGENT environment variable.

See workflow-commands.md for complete prompt templates.

Fallback Behavior

When to fallback:

Primary agent session crashes (non-zero exit)
Retries exhausted with primary agent
fallback is configured for the task and not disabled ("false")

Fallback procedure:

Log: "Primary agent ({primary}) failed after {retries} attempts. Trying fallback ({fallback})..."
Set environment: AI_AGENT={fallback}
Respawn session with fallback agent
Monitor as normal (timeouts auto-adjust based on agent type)
If fallback also fails → CRITICAL escalation

Environment Variable:

# Set before spawning session
export AI_AGENT="codex"  # or "claude"

# story-automator tmux-wrapper reads this automatically and generates correct prompt format
session=$("$scripts" tmux-wrapper spawn dev {epic} {story_id} \
  --command "$("$scripts" tmux-wrapper build-cmd dev {story_id})")

Codex Monitoring Notes

No todo checkboxes: Codex doesn't use ☒/☐ - todos_done and todos_total will be 0
Longer waits: Status check script returns 90s wait estimate for Codex (vs 60s for Claude)
Different activity detection: Uses output freshness + heartbeat (no marker reliance)
Output staleness window: CODEX_OUTPUT_STALE_SECONDS (default: 300)
1.5x timeout multiplier: story-automator monitor-session applies 1.5x multiplier when --agent codex
Fake todo progress (v2.2): When Codex is idle after activity, reports 1/1 to indicate "work done, needs verification"
Idle vs Completed (v2.2): Codex sessions report "idle" instead of "completed" when CLI stops but no terminal markers

⚠️ Codex Code-Review Limitations (v1.5.0)

CRITICAL: Codex is NOT recommended for code-review workflow.

Known Issue: Sprint-Status Not Updated

Codex code-review sessions often complete (CLI exits) WITHOUT updating sprint-status.yaml to "done". This causes:

Monitor reports "completed" but sprint-status unchanged
Orchestrator loops indefinitely, spawning new review cycles
8+ cycles with 0 progress (observed in Story 8.2)

Root Cause

Codex runs non-interactively via codex exec. When it finishes:

Tmux session goes idle (no active CLI process)
Monitor sees "idle" and marks as "completed"
But workflow step 5 (update sprint-status) may not have executed
No way to verify workflow actually finished

Recommended Configuration

agentConfig:
  defaultPrimary: "codex"
  defaultFallback: "claude"
  perTask:
    review:
      primary: "claude"   # Never use Codex for code-review
      fallback: false

"incomplete" State (v2.2)

The monitoring system now detects when Codex finishes but sprint-status wasn't updated:

final_state: "completed" → Verified: sprint-status shows "done"
final_state: "incomplete" → Session idle but sprint-status NOT "done"

When "incomplete" is detected:

Do NOT retry automatically (prevents infinite loop)
Escalate to user with options:
1. Manual fix (update sprint-status yourself)
2. Run code-review with Claude
3. Skip this story

Verification Command (v2.2)

Check if code-review actually completed:

"$scripts" orchestrator-helper verify-code-review {story_id}
# Returns: {"verified":true/false, "sprint_status":"...", ...}

Backwards Compatibility

If agentConfig is missing, the primary agent resolves from the active runtime provider and fallback is disabled
If aiCommand is set (legacy), use it directly with the generated natural language prompt
New orchestrations should use agentConfig instead of aiCommand
Agents file is authoritative when present

Troubleshooting

See agent-fallback-troubleshooting.md for detailed troubleshooting steps.

4.9 KiB Raw Permalink Blame History