Files
sar/.claude/skills/bmad-story-automator/data/agent-fallback-troubleshooting.md
julian 17c08e6392 chore: initial monorepo scaffold + WDS Phase 1+2 artifacts
- Nx 22.7 monorepo (pnpm 11.1, TypeScript 5.9, Node 24)
- apps/api: NestJS 11 (CJS conforme CODING-RULES.md PGD-DB-004)
- apps/web: React 19 + Vite 8 (ESM)
- libs/shared/api-interface: Zod contract base
- Docker Compose dev: Postgres 18, Valkey 8, MinIO, Mailpit
- WDS artifacts:
  - design-artifacts/A-Product-Brief/ (5 docs canônicos + 16 dialogs)
  - design-artifacts/B-Trigger-Map/ (hub + 4 personas + feature impact)
- Stack canon: STACK.md v2.2 + CODING-RULES.md v2.0 + brand.md
- AGENTS.md + README.md como entrada para devs/agentes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:34:20 +00:00

181 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent Fallback Troubleshooting
### Issue: Session spawns Claude instead of Codex
**Symptoms:**
- Output shows Claude-specific messages (e.g., "You've used 84% of your weekly limit")
- Expected Codex but got Claude
**Cause:** The `--agent` flag must be passed to `story-automator tmux-wrapper spawn`, not to `build-cmd`.
**Correct Usage (v1.4.0+):**
```bash
# Method 1: Use --agent flag on spawn (RECOMMENDED)
session=$("$scripts" tmux-wrapper spawn dev "$epic" "$story_id" \
--agent codex \
--command "$("$scripts" tmux-wrapper build-cmd dev "$story_id")")
# Method 2: Set environment variable before spawn
export AI_AGENT="codex"
session=$("$scripts" tmux-wrapper spawn dev "$epic" "$story_id" \
--command "$("$scripts" tmux-wrapper build-cmd dev "$story_id")")
```
**Wrong Usage:**
```bash
# WRONG - this doesn't work
session=$("$scripts" tmux-wrapper spawn dev "$epic" "$story_id" \
--command "$("$scripts" tmux-wrapper build-cmd dev "$story_id" --agent codex)")
```
### Issue: Monitor reports "stuck" but Codex is active
**Symptoms:**
- `story-automator monitor-session` returns `stuck` state after 4 polls
- Manual inspection shows Codex still producing output (no prompt, output continues to grow)
**Cause:** The monitoring script relied on marker detection instead of output freshness.
**Fixed in v2.4.0:**
- Output freshness tracking (no marker reliance)
- `CODEX_OUTPUT_STALE_SECONDS` controls how long Codex can be silent before "stuck"
- Codex still gets 6 poll grace period before "stuck"
**Verification:**
```bash
# Check if session has AI_AGENT set
tmux show-environment -t "session-name" AI_AGENT
# Manual session status check
"$scripts" tmux-status-check "session-name" --project-root "$PWD"
```
### Issue: log command error when using --agent flag
**Symptoms:**
```
log: Unknown subcommand 'Codex agent detected - applying 1.5x timeout (90min)'
```
**Cause:** macOS has `/usr/bin/log` system command. If the `log()` bash function wasn't defined before first use, bash fell through to the system command.
**Fixed in v1.4.0:** The `log()` function is now defined before argument parsing in `story-automator monitor-session`.
### Issue: Manual polling required as workaround
**If monitoring still fails**, use this manual polling approach:
```bash
for i in {1..60}; do
sleep 30
# Check if session still exists
if ! tmux has-session -t "session-name" 2>/dev/null; then
echo "Session ended"
break
fi
# Check for shell prompt (completion indicator)
last_line=$(tmux capture-pane -t "session-name" -p | tail -1)
if echo "$last_line" | grep -qE '$|\$$|#$'; then
echo "Session complete (shell prompt detected)"
break
fi
done
```
### Issue: Codex sessions explore files but don't execute full workflow (v1.4.0)
**Symptoms:**
- Session output shows file exploration (`sed`, `rg`, `cat` commands)
- No actual review findings or story updates
- Sprint-status never changes from "review" to "done"
- Session completes but workflow steps 1-5 weren't followed
**Cause:** Codex uses natural language prompts and may not follow structured workflow instructions as reliably as Claude.
**Mitigation strategies:**
1. **Use Claude for code-review by default** - More reliable at following multi-step workflows
2. **Add explicit step markers** - Tell Codex to output "STEP 1 COMPLETE", "STEP 2 COMPLETE" etc.
3. **Verify after session** - Check story file Status field, not just sprint-status
**Recommended agent configuration for deterministic reliability:**
```yaml
agentConfig:
defaultPrimary: "auto"
defaultFallback: false # Disable global fallback; opt in per task
perTask:
# create-story: Either agent works well
create:
primary: "claude"
# dev-story: Either agent works, Codex may be faster for simple tasks
dev:
primary: "codex"
fallback: "claude"
# code-review: Claude recommended - more reliable at following workflow
review:
primary: "claude"
fallback: false
```
### Issue: Code-review doesn't update sprint-status.yaml
**Symptoms:**
- Code-review session completes
- Story file shows review was done (Dev Agent Record updated)
- But sprint-status.yaml still shows "review" instead of "done"
**Cause:** Code-review workflow step 5 updates sprint-status, but session may not reach step 5 or may use wrong story key format.
**Verification (v1.4.0):**
```bash
# Check story file status directly
"$scripts" orchestrator-helper story-file-status 8.2
# Compare with sprint-status
"$scripts" orchestrator-helper sprint-status get "8-2-flipside-crypto-provider"
# If story file shows "done" but sprint-status doesn't, manually sync:
# Edit _bmad-output/implementation-artifacts/sprint-status.yaml and change "8-2-story-name: review" to "done"
```
### When to manually intervene
**Intervene immediately if:**
1. **5 code-review cycles with no progress** - Agent likely stuck in a loop
2. **Story file shows "done" but sprint-status doesn't** - Sync issue, manual fix is faster
3. **Tests passing but review keeps finding issues** - May be false positives
4. **Codex sessions consistently incomplete** - Switch to Claude for that workflow
**Steps for manual intervention:**
```bash
# 1. Check actual story status
"$scripts" orchestrator-helper story-file-status {story_id}
# 2. Run tests to verify code quality
go test ./src/... || npm test
# 3. If tests pass, manually update sprint-status
# Edit: _bmad-output/implementation-artifacts/sprint-status.yaml
# Change: "8-2-story-name: review" to "8-2-story-name: done"
# 4. Resume orchestration - it will see "done" and proceed to commit
```
### Debugging Agent Detection
```bash
# Check current agent type detection
"$scripts" tmux-wrapper agent-type
# Check what CLI command would be used
"$scripts" tmux-wrapper agent-cli
# Check what command prefix would be used
"$scripts" tmux-wrapper skill-prefix
# View session environment
tmux show-environment -t "session-name"
# Check story key normalization (v1.4.0)
"$scripts" orchestrator-helper normalize-key "8.2"
"$scripts" orchestrator-helper normalize-key "8-2-flipside-crypto-provider"
```