feat(atlas): enforce mandatory manual code review and direct boulder state checks

- VERIFICATION_REMINDER: add Step 2 manual code review (non-negotiable) - Require Read of EVERY changed file line by line - Cross-check subagent claims vs actual code - Verify logic correctness, completeness, edge cases, patterns - Add Step 5: direct boulder state check via Read plan file - Count remaining tasks directly, no cached state - BOULDER_CONTINUATION_PROMPT: add first rule to read plan file immediately - verification-reminders.ts: restructure steps 5-8 for boulder/todo checks - Atlas default.ts (Claude): enhance 3.4 QA with A/B/C/D sections - A: Automated verification - B: Manual code review (non-negotiable) - C: Hands-on QA (if applicable) - D: Check boulder state directly - Atlas gpt.ts (GPT-5.2): apply same QA enhancements with GPT-optimized structure - verification_rules: update both Claude and GPT versions with manual review requirements Addresses issue where Atlas would skip manual code inspection after delegation, leading to rubber-stamping of broken or incomplete work.
2026-02-10 15:40:06 +09:00
parent f84ef532c1
commit 45dfc4ec66
4 changed files with 154 additions and 54 deletions
--- a/src/agents/atlas/default.ts
+++ b/src/agents/atlas/default.ts
@@ -178,34 +178,54 @@ task(
 )
 \`\`\`

-### 3.4 Verify (PROJECT-LEVEL QA)
+### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)

-**After EVERY delegation, YOU must verify:**
+**You are the QA gate. Subagents lie. Automated checks alone are NOT enough.**

-1. **Project-level diagnostics**:
-   \`lsp_diagnostics(filePath="src/")\` or \`lsp_diagnostics(filePath=".")\`
-   MUST return ZERO errors
+After EVERY delegation, complete ALL of these steps — no shortcuts:

-2. **Build verification**:
-   \`bun run build\` or \`bun run typecheck\`
-   Exit code MUST be 0
+#### A. Automated Verification
+1. \`lsp_diagnostics(filePath=".")\` → ZERO errors at project level
+2. \`bun run build\` or \`bun run typecheck\` → exit code 0
+3. \`bun test\` → ALL tests pass

-3. **Test verification**:
-   \`bun test\`
-   ALL tests MUST pass
+#### B. Manual Code Review (NON-NEGOTIABLE — DO NOT SKIP)

-4. **Manual inspection**:
-   - Read changed files
-   - Confirm changes match requirements
-   - Check for regressions
+**This is the step you are most tempted to skip. DO NOT SKIP IT.**

-**Checklist:**
+1. \`Read\` EVERY file the subagent created or modified — no exceptions
+2. For EACH file, check line by line:
+   - Does the logic actually implement the task requirement?
+   - Are there stubs, TODOs, placeholders, or hardcoded values?
+   - Are there logic errors or missing edge cases?
+   - Does it follow the existing codebase patterns?
+   - Are imports correct and complete?
+3. Cross-reference: compare what subagent CLAIMED vs what the code ACTUALLY does
+4. If anything doesn't match → resume session and fix immediately
+
+**If you cannot explain what the changed code does, you have not reviewed it.**
+
+#### C. Hands-On QA (if applicable)
+| Deliverable | Method | Tool |
+|-------------|--------|------|
+| Frontend/UI | Browser | \`/playwright\` |
+| TUI/CLI | Interactive | \`interactive_bash\` |
+| API/Backend | Real requests | curl |
+
+#### D. Check Boulder State Directly
+
+After verification, READ the plan file directly — every time, no exceptions:
 \`\`\`
-[ ] lsp_diagnostics at project level - ZERO errors
-[ ] Build command - exit 0
-[ ] Test suite - all pass
-[ ] Files exist and match requirements
-[ ] No regressions
+Read(".sisyphus/tasks/{plan-name}.yaml")
+\`\`\`
+Count remaining \`- [ ]\` tasks. This is your ground truth for what comes next.
+
+**Checklist (ALL must be checked):**
+\`\`\`
+[ ] Automated: lsp_diagnostics clean, build passes, tests pass
+[ ] Manual: Read EVERY changed file, verified logic matches requirements
+[ ] Cross-check: Subagent claims match actual code
+[ ] Boulder: Read plan file, confirmed current progress
 \`\`\`

 **If verification fails**: Resume the SAME session with the ACTUAL error output:
@@ -325,22 +345,25 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 4..

 You are the QA gate. Subagents lie. Verify EVERYTHING.

-**After each delegation**:
-1. \`lsp_diagnostics\` at PROJECT level (not file level)
-2. Run build command
-3. Run test suite
-4. Read changed files manually
-5. Confirm requirements met
+**After each delegation — BOTH automated AND manual verification are MANDATORY:**
+
+1. \`lsp_diagnostics\` at PROJECT level → ZERO errors
+2. Run build command → exit 0
+3. Run test suite → ALL pass
+4. **\`Read\` EVERY changed file line by line** → logic matches requirements
+5. **Cross-check**: subagent's claims vs actual code — do they match?
+6. **Check boulder state**: Read the plan file directly, count remaining tasks

 **Evidence required**:
 | Action | Evidence |
 |--------|----------|
-| Code change | lsp_diagnostics clean at project level |
+| Code change | lsp_diagnostics clean + manual Read of every changed file |
 | Build | Exit code 0 |
 | Tests | All pass |
-| Delegation | Verified independently |
+| Logic correct | You read the code and can explain what it does |
+| Boulder state | Read plan file, confirmed progress |

-**No evidence = not complete.**
+**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
 </verification_rules>

 <boundaries>
--- a/src/agents/atlas/gpt.ts
+++ b/src/agents/atlas/gpt.ts
@@ -182,19 +182,51 @@ Extract wisdom → include in prompt.
 task(category="[cat]", load_skills=["[skills]"], run_in_background=false, prompt=\`[6-SECTION PROMPT]\`)
 \`\`\`

-### 3.4 Verify (PROJECT-LEVEL QA)
+### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)

-After EVERY delegation:
+After EVERY delegation, complete ALL steps — no shortcuts:
+
+#### A. Automated Verification
 1. \`lsp_diagnostics(filePath=".")\` → ZERO errors
 2. \`Bash("bun run build")\` → exit 0
 3. \`Bash("bun test")\` → all pass
-4. \`Read\` changed files → confirm requirements met

-Checklist:
- [ ] lsp_diagnostics clean
- [ ] Build passes
- [ ] Tests pass
- [ ] Files match requirements
+#### B. Manual Code Review (NON-NEGOTIABLE)
+1. \`Read\` EVERY file the subagent touched — no exceptions
+2. For each file, verify line by line:
+
+| Check | What to Look For |
+|-------|------------------|
+| Logic correctness | Does implementation match task requirements? |
+| Completeness | No stubs, TODOs, placeholders, hardcoded values? |
+| Edge cases | Off-by-one, null checks, error paths handled? |
+| Patterns | Follows existing codebase conventions? |
+| Imports | Correct, complete, no unused? |
+
+3. Cross-check: subagent's claims vs actual code — do they match?
+4. If mismatch found → resume session with \`session_id\` and fix
+
+**If you cannot explain what the changed code does, you have not reviewed it.**
+
+#### C. Hands-On QA (if applicable)
+| Deliverable | Method | Tool |
+|-------------|--------|------|
+| Frontend/UI | Browser | \`/playwright\` |
+| TUI/CLI | Interactive | \`interactive_bash\` |
+| API/Backend | Real requests | curl |
+
+#### D. Check Boulder State Directly
+After verification, READ the plan file — every time:
+\`\`\`
+Read(".sisyphus/tasks/{plan-name}.yaml")
+\`\`\`
+Count remaining \`- [ ]\` tasks. This is your ground truth.
+
+Checklist (ALL required):
+- [ ] Automated: diagnostics clean, build passes, tests pass
+- [ ] Manual: Read EVERY changed file, logic matches requirements
+- [ ] Cross-check: subagent claims match actual code
+- [ ] Boulder: Read plan file, confirmed current progress

 ### 3.5 Handle Failures

@@ -269,15 +301,23 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 3..
 <verification_rules>
 You are the QA gate. Subagents lie. Verify EVERYTHING.

-**After each delegation**:
+**After each delegation — BOTH automated AND manual verification are MANDATORY**:
+
 | Step | Tool | Expected |
 |------|------|----------|
 | 1 | \`lsp_diagnostics(".")\` | ZERO errors |
 | 2 | \`Bash("bun run build")\` | exit 0 |
 | 3 | \`Bash("bun test")\` | all pass |
-| 4 | \`Read\` changed files | matches requirements |
+| 4 | \`Read\` EVERY changed file | logic matches requirements |
+| 5 | Cross-check claims vs code | subagent's report matches reality |
+| 6 | \`Read\` plan file | boulder state confirmed |

-**No evidence = not complete.**
+**Manual code review (Step 4) is NON-NEGOTIABLE:**
+- Read every line of every changed file
+- Verify logic correctness, completeness, edge cases
+- If you can't explain what the code does, you haven't reviewed it
+
+**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
 </verification_rules>

 <boundaries>
--- a/src/hooks/atlas/system-reminder-templates.ts
+++ b/src/hooks/atlas/system-reminder-templates.ts
@@ -33,6 +33,7 @@ export const BOULDER_CONTINUATION_PROMPT = `${createSystemDirective(SystemDirect
 You have an active work plan with incomplete tasks. Continue working.

 RULES:
+- **FIRST**: Read the plan file NOW to check exact current progress — count remaining \`- [ ]\` tasks
 - Proceed without asking for permission
 - Change \`- [ ]\` to \`- [x]\` in the plan file when done
 - Use the notepad at .sisyphus/notepads/{PLAN_NAME}/ to record learnings
@@ -48,15 +49,36 @@ Tests FAILING, code has ERRORS, implementation INCOMPLETE - but they say "done".

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

-**STEP 1: VERIFY WITH YOUR OWN TOOL CALLS (DO THIS NOW)**
+**STEP 1: AUTOMATED VERIFICATION (DO THIS FIRST)**

 Run these commands YOURSELF - do NOT trust agent's claims:
 1. \`lsp_diagnostics\` on changed files → Must be CLEAN
 2. \`bash\` to run tests → Must PASS
 3. \`bash\` to run build/typecheck → Must succeed
-4. \`Read\` the actual code → Must match requirements

-**STEP 2: DETERMINE IF HANDS-ON QA IS NEEDED**
+**STEP 2: MANUAL CODE REVIEW (NON-NEGOTIABLE — DO NOT SKIP)**
+
+Automated checks are NECESSARY but INSUFFICIENT. You MUST read the actual code.
+
+**RIGHT NOW — \`Read\` EVERY file the subagent touched. No exceptions.**
+
+For EACH changed file, verify:
+1. Does the implementation logic ACTUALLY match the task requirements?
+2. Are there incomplete stubs (TODO comments, placeholder code, hardcoded values)?
+3. Are there logic errors, off-by-one bugs, or missing edge cases?
+4. Does it follow existing codebase patterns and conventions?
+5. Are imports correct? No unused or missing imports?
+6. Is error handling present where needed?
+
+**Cross-check the subagent's claims against reality:**
+- Subagent said "Updated X" → READ X. Is it actually updated?
+- Subagent said "Added tests" → READ tests. Do they test the RIGHT behavior?
+- Subagent said "Follows patterns" → COMPARE with reference. Does it actually?
+
+**If you cannot explain what the changed code does, you have not reviewed it.**
+**If you skip this step, you are rubber-stamping broken work.**
+
+**STEP 3: DETERMINE IF HANDS-ON QA IS NEEDED**

 | Deliverable Type | QA Method | Tool |
 |------------------|-----------|------|
@@ -66,7 +88,7 @@ Run these commands YOURSELF - do NOT trust agent's claims:

 Static analysis CANNOT catch: visual bugs, animation issues, user flow breakages.

-**STEP 3: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**
+**STEP 4: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**

 \`\`\`
 todowrite([
@@ -76,7 +98,8 @@ todowrite([

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

-**BLOCKING: DO NOT proceed to Step 4 until Steps 1-3 are VERIFIED.**`
+**BLOCKING: DO NOT proceed until Steps 1-4 are ALL completed.**
+**Skipping Step 2 (manual code review) = unverified work = FAILURE.**`

 export const ORCHESTRATOR_DELEGATION_REQUIRED = `

--- a/src/hooks/atlas/verification-reminders.ts
+++ b/src/hooks/atlas/verification-reminders.ts
@@ -26,7 +26,16 @@ export function buildOrchestratorReminder(

 ${buildVerificationReminder(sessionId)}

-**STEP 4: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)**
+**STEP 5: CHECK BOULDER STATE DIRECTLY (EVERY TIME — NO EXCEPTIONS)**
+
+Do NOT rely on cached progress. Read the plan file NOW:
+\`\`\`
+Read(".sisyphus/tasks/${planName}.yaml")
+\`\`\`
+Count exactly: how many \`- [ ]\` remain? How many \`- [x]\` completed?
+This is YOUR ground truth. Use it to decide what comes next.
+
+**STEP 6: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)**

 RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.

@@ -36,14 +45,14 @@ Update the plan file \`.sisyphus/tasks/${planName}.yaml\`:

 **DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.**

-**STEP 5: COMMIT ATOMIC UNIT**
+**STEP 7: COMMIT ATOMIC UNIT**

 - Stage ONLY the verified changes
 - Commit with clear message describing what was done

-**STEP 6: PROCEED TO NEXT TASK**
+**STEP 8: PROCEED TO NEXT TASK**

- Read the plan file to identify the next \`- [ ]\` task
+- Read the plan file AGAIN to identify the next \`- [ ]\` task
 - Start immediately - DO NOT STOP

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@@ -57,7 +66,12 @@ export function buildStandaloneVerificationReminder(sessionId: string): string {

 ${buildVerificationReminder(sessionId)}

-**STEP 4: UPDATE TODO STATUS (IMMEDIATELY)**
+**STEP 5: CHECK YOUR PROGRESS DIRECTLY (EVERY TIME — NO EXCEPTIONS)**
+
+Do NOT rely on memory or cached state. Run \`todoread\` NOW to see exact current state.
+Count pending vs completed tasks. This is your ground truth for what comes next.
+
+**STEP 6: UPDATE TODO STATUS (IMMEDIATELY)**

 RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.

@@ -66,15 +80,15 @@ RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.

 **DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.**

-**STEP 5: EXECUTE QA TASKS (IF ANY)**
+**STEP 7: EXECUTE QA TASKS (IF ANY)**

 If QA tasks exist in your todo list:
 - Execute them BEFORE proceeding
 - Mark each QA task complete after successful verification

-**STEP 6: PROCEED TO NEXT PENDING TASK**
+**STEP 8: PROCEED TO NEXT PENDING TASK**

- Identify the next \`pending\` task from your todo list
+- Run \`todoread\` AGAIN to identify the next \`pending\` task
 - Start immediately - DO NOT STOP

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━