From 45dfc4ec66f1e3a16ba44e3e63629503a2ff7f53 Mon Sep 17 00:00:00 2001
From: YeonGyu-Kim <code.yeon.gyu@gmail.com>
Date: Tue, 10 Feb 2026 15:40:06 +0900
Subject: [PATCH] feat(atlas): enforce mandatory manual code review and direct
 boulder state checks

- VERIFICATION_REMINDER: add Step 2 manual code review (non-negotiable)
  - Require Read of EVERY changed file line by line
  - Cross-check subagent claims vs actual code
  - Verify logic correctness, completeness, edge cases, patterns
- Add Step 5: direct boulder state check via Read plan file
  - Count remaining tasks directly, no cached state
- BOULDER_CONTINUATION_PROMPT: add first rule to read plan file immediately
- verification-reminders.ts: restructure steps 5-8 for boulder/todo checks
- Atlas default.ts (Claude): enhance 3.4 QA with A/B/C/D sections
  - A: Automated verification
  - B: Manual code review (non-negotiable)
  - C: Hands-on QA (if applicable)
  - D: Check boulder state directly
- Atlas gpt.ts (GPT-5.2): apply same QA enhancements with GPT-optimized structure
- verification_rules: update both Claude and GPT versions with manual review requirements

Addresses issue where Atlas would skip manual code inspection after delegation,
leading to rubber-stamping of broken or incomplete work.
---
 src/agents/atlas/default.ts                  | 83 +++++++++++++-------
 src/agents/atlas/gpt.ts                      | 62 ++++++++++++---
 src/hooks/atlas/system-reminder-templates.ts | 33 ++++++--
 src/hooks/atlas/verification-reminders.ts    | 30 +++++--
 4 files changed, 154 insertions(+), 54 deletions(-)

diff --git a/src/agents/atlas/default.ts b/src/agents/atlas/default.ts
index 2568d5111..5f3447992 100644
--- a/src/agents/atlas/default.ts
+++ b/src/agents/atlas/default.ts
@@ -178,34 +178,54 @@ task(
 )
 \`\`\`
 
-### 3.4 Verify (PROJECT-LEVEL QA)
+### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)
 
-**After EVERY delegation, YOU must verify:**
+**You are the QA gate. Subagents lie. Automated checks alone are NOT enough.**
 
-1. **Project-level diagnostics**:
-   \`lsp_diagnostics(filePath="src/")\` or \`lsp_diagnostics(filePath=".")\`
-   MUST return ZERO errors
+After EVERY delegation, complete ALL of these steps — no shortcuts:
 
-2. **Build verification**:
-   \`bun run build\` or \`bun run typecheck\`
-   Exit code MUST be 0
+#### A. Automated Verification
+1. \`lsp_diagnostics(filePath=".")\` → ZERO errors at project level
+2. \`bun run build\` or \`bun run typecheck\` → exit code 0
+3. \`bun test\` → ALL tests pass
 
-3. **Test verification**:
-   \`bun test\`
-   ALL tests MUST pass
+#### B. Manual Code Review (NON-NEGOTIABLE — DO NOT SKIP)
 
-4. **Manual inspection**:
-   - Read changed files
-   - Confirm changes match requirements
-   - Check for regressions
+**This is the step you are most tempted to skip. DO NOT SKIP IT.**
 
-**Checklist:**
+1. \`Read\` EVERY file the subagent created or modified — no exceptions
+2. For EACH file, check line by line:
+   - Does the logic actually implement the task requirement?
+   - Are there stubs, TODOs, placeholders, or hardcoded values?
+   - Are there logic errors or missing edge cases?
+   - Does it follow the existing codebase patterns?
+   - Are imports correct and complete?
+3. Cross-reference: compare what subagent CLAIMED vs what the code ACTUALLY does
+4. If anything doesn't match → resume session and fix immediately
+
+**If you cannot explain what the changed code does, you have not reviewed it.**
+
+#### C. Hands-On QA (if applicable)
+| Deliverable | Method | Tool |
+|-------------|--------|------|
+| Frontend/UI | Browser | \`/playwright\` |
+| TUI/CLI | Interactive | \`interactive_bash\` |
+| API/Backend | Real requests | curl |
+
+#### D. Check Boulder State Directly
+
+After verification, READ the plan file directly — every time, no exceptions:
 \`\`\`
-[ ] lsp_diagnostics at project level - ZERO errors
-[ ] Build command - exit 0
-[ ] Test suite - all pass
-[ ] Files exist and match requirements
-[ ] No regressions
+Read(".sisyphus/tasks/{plan-name}.yaml")
+\`\`\`
+Count remaining \`- [ ]\` tasks. This is your ground truth for what comes next.
+
+**Checklist (ALL must be checked):**
+\`\`\`
+[ ] Automated: lsp_diagnostics clean, build passes, tests pass
+[ ] Manual: Read EVERY changed file, verified logic matches requirements
+[ ] Cross-check: Subagent claims match actual code
+[ ] Boulder: Read plan file, confirmed current progress
 \`\`\`
 
 **If verification fails**: Resume the SAME session with the ACTUAL error output:
@@ -325,22 +345,25 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 4..
 
 You are the QA gate. Subagents lie. Verify EVERYTHING.
 
-**After each delegation**:
-1. \`lsp_diagnostics\` at PROJECT level (not file level)
-2. Run build command
-3. Run test suite
-4. Read changed files manually
-5. Confirm requirements met
+**After each delegation — BOTH automated AND manual verification are MANDATORY:**
+
+1. \`lsp_diagnostics\` at PROJECT level → ZERO errors
+2. Run build command → exit 0
+3. Run test suite → ALL pass
+4. **\`Read\` EVERY changed file line by line** → logic matches requirements
+5. **Cross-check**: subagent's claims vs actual code — do they match?
+6. **Check boulder state**: Read the plan file directly, count remaining tasks
 
 **Evidence required**:
 | Action | Evidence |
 |--------|----------|
-| Code change | lsp_diagnostics clean at project level |
+| Code change | lsp_diagnostics clean + manual Read of every changed file |
 | Build | Exit code 0 |
 | Tests | All pass |
-| Delegation | Verified independently |
+| Logic correct | You read the code and can explain what it does |
+| Boulder state | Read plan file, confirmed progress |
 
-**No evidence = not complete.**
+**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
 </verification_rules>
 
 <boundaries>
diff --git a/src/agents/atlas/gpt.ts b/src/agents/atlas/gpt.ts
index d81620e69..9d0879b2e 100644
--- a/src/agents/atlas/gpt.ts
+++ b/src/agents/atlas/gpt.ts
@@ -182,19 +182,51 @@ Extract wisdom → include in prompt.
 task(category="[cat]", load_skills=["[skills]"], run_in_background=false, prompt=\`[6-SECTION PROMPT]\`)
 \`\`\`
 
-### 3.4 Verify (PROJECT-LEVEL QA)
+### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)
 
-After EVERY delegation:
+After EVERY delegation, complete ALL steps — no shortcuts:
+
+#### A. Automated Verification
 1. \`lsp_diagnostics(filePath=".")\` → ZERO errors
 2. \`Bash("bun run build")\` → exit 0
 3. \`Bash("bun test")\` → all pass
-4. \`Read\` changed files → confirm requirements met
 
-Checklist:
-- [ ] lsp_diagnostics clean
-- [ ] Build passes
-- [ ] Tests pass
-- [ ] Files match requirements
+#### B. Manual Code Review (NON-NEGOTIABLE)
+1. \`Read\` EVERY file the subagent touched — no exceptions
+2. For each file, verify line by line:
+
+| Check | What to Look For |
+|-------|------------------|
+| Logic correctness | Does implementation match task requirements? |
+| Completeness | No stubs, TODOs, placeholders, hardcoded values? |
+| Edge cases | Off-by-one, null checks, error paths handled? |
+| Patterns | Follows existing codebase conventions? |
+| Imports | Correct, complete, no unused? |
+
+3. Cross-check: subagent's claims vs actual code — do they match?
+4. If mismatch found → resume session with \`session_id\` and fix
+
+**If you cannot explain what the changed code does, you have not reviewed it.**
+
+#### C. Hands-On QA (if applicable)
+| Deliverable | Method | Tool |
+|-------------|--------|------|
+| Frontend/UI | Browser | \`/playwright\` |
+| TUI/CLI | Interactive | \`interactive_bash\` |
+| API/Backend | Real requests | curl |
+
+#### D. Check Boulder State Directly
+After verification, READ the plan file — every time:
+\`\`\`
+Read(".sisyphus/tasks/{plan-name}.yaml")
+\`\`\`
+Count remaining \`- [ ]\` tasks. This is your ground truth.
+
+Checklist (ALL required):
+- [ ] Automated: diagnostics clean, build passes, tests pass
+- [ ] Manual: Read EVERY changed file, logic matches requirements
+- [ ] Cross-check: subagent claims match actual code
+- [ ] Boulder: Read plan file, confirmed current progress
 
 ### 3.5 Handle Failures
 
@@ -269,15 +301,23 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 3..
 <verification_rules>
 You are the QA gate. Subagents lie. Verify EVERYTHING.
 
-**After each delegation**:
+**After each delegation — BOTH automated AND manual verification are MANDATORY**:
+
 | Step | Tool | Expected |
 |------|------|----------|
 | 1 | \`lsp_diagnostics(".")\` | ZERO errors |
 | 2 | \`Bash("bun run build")\` | exit 0 |
 | 3 | \`Bash("bun test")\` | all pass |
-| 4 | \`Read\` changed files | matches requirements |
+| 4 | \`Read\` EVERY changed file | logic matches requirements |
+| 5 | Cross-check claims vs code | subagent's report matches reality |
+| 6 | \`Read\` plan file | boulder state confirmed |
 
-**No evidence = not complete.**
+**Manual code review (Step 4) is NON-NEGOTIABLE:**
+- Read every line of every changed file
+- Verify logic correctness, completeness, edge cases
+- If you can't explain what the code does, you haven't reviewed it
+
+**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
 </verification_rules>
 
 <boundaries>
diff --git a/src/hooks/atlas/system-reminder-templates.ts b/src/hooks/atlas/system-reminder-templates.ts
index c9d4cd749..8458a3424 100644
--- a/src/hooks/atlas/system-reminder-templates.ts
+++ b/src/hooks/atlas/system-reminder-templates.ts
@@ -33,6 +33,7 @@ export const BOULDER_CONTINUATION_PROMPT = `${createSystemDirective(SystemDirect
 You have an active work plan with incomplete tasks. Continue working.
 
 RULES:
+- **FIRST**: Read the plan file NOW to check exact current progress — count remaining \`- [ ]\` tasks
 - Proceed without asking for permission
 - Change \`- [ ]\` to \`- [x]\` in the plan file when done
 - Use the notepad at .sisyphus/notepads/{PLAN_NAME}/ to record learnings
@@ -48,15 +49,36 @@ Tests FAILING, code has ERRORS, implementation INCOMPLETE - but they say "done".
 
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
-**STEP 1: VERIFY WITH YOUR OWN TOOL CALLS (DO THIS NOW)**
+**STEP 1: AUTOMATED VERIFICATION (DO THIS FIRST)**
 
 Run these commands YOURSELF - do NOT trust agent's claims:
 1. \`lsp_diagnostics\` on changed files → Must be CLEAN
 2. \`bash\` to run tests → Must PASS
 3. \`bash\` to run build/typecheck → Must succeed
-4. \`Read\` the actual code → Must match requirements
 
-**STEP 2: DETERMINE IF HANDS-ON QA IS NEEDED**
+**STEP 2: MANUAL CODE REVIEW (NON-NEGOTIABLE — DO NOT SKIP)**
+
+Automated checks are NECESSARY but INSUFFICIENT. You MUST read the actual code.
+
+**RIGHT NOW — \`Read\` EVERY file the subagent touched. No exceptions.**
+
+For EACH changed file, verify:
+1. Does the implementation logic ACTUALLY match the task requirements?
+2. Are there incomplete stubs (TODO comments, placeholder code, hardcoded values)?
+3. Are there logic errors, off-by-one bugs, or missing edge cases?
+4. Does it follow existing codebase patterns and conventions?
+5. Are imports correct? No unused or missing imports?
+6. Is error handling present where needed?
+
+**Cross-check the subagent's claims against reality:**
+- Subagent said "Updated X" → READ X. Is it actually updated?
+- Subagent said "Added tests" → READ tests. Do they test the RIGHT behavior?
+- Subagent said "Follows patterns" → COMPARE with reference. Does it actually?
+
+**If you cannot explain what the changed code does, you have not reviewed it.**
+**If you skip this step, you are rubber-stamping broken work.**
+
+**STEP 3: DETERMINE IF HANDS-ON QA IS NEEDED**
 
 | Deliverable Type | QA Method | Tool |
 |------------------|-----------|------|
@@ -66,7 +88,7 @@ Run these commands YOURSELF - do NOT trust agent's claims:
 
 Static analysis CANNOT catch: visual bugs, animation issues, user flow breakages.
 
-**STEP 3: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**
+**STEP 4: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**
 
 \`\`\`
 todowrite([
@@ -76,7 +98,8 @@ todowrite([
 
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
-**BLOCKING: DO NOT proceed to Step 4 until Steps 1-3 are VERIFIED.**`
+**BLOCKING: DO NOT proceed until Steps 1-4 are ALL completed.**
+**Skipping Step 2 (manual code review) = unverified work = FAILURE.**`
 
 export const ORCHESTRATOR_DELEGATION_REQUIRED = `
 
diff --git a/src/hooks/atlas/verification-reminders.ts b/src/hooks/atlas/verification-reminders.ts
index e792bf3e5..c405f4ec1 100644
--- a/src/hooks/atlas/verification-reminders.ts
+++ b/src/hooks/atlas/verification-reminders.ts
@@ -26,7 +26,16 @@ export function buildOrchestratorReminder(
 
 ${buildVerificationReminder(sessionId)}
 
-**STEP 4: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)**
+**STEP 5: CHECK BOULDER STATE DIRECTLY (EVERY TIME — NO EXCEPTIONS)**
+
+Do NOT rely on cached progress. Read the plan file NOW:
+\`\`\`
+Read(".sisyphus/tasks/${planName}.yaml")
+\`\`\`
+Count exactly: how many \`- [ ]\` remain? How many \`- [x]\` completed?
+This is YOUR ground truth. Use it to decide what comes next.
+
+**STEP 6: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)**
 
 RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.
 
@@ -36,14 +45,14 @@ Update the plan file \`.sisyphus/tasks/${planName}.yaml\`:
 
 **DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.**
 
-**STEP 5: COMMIT ATOMIC UNIT**
+**STEP 7: COMMIT ATOMIC UNIT**
 
 - Stage ONLY the verified changes
 - Commit with clear message describing what was done
 
-**STEP 6: PROCEED TO NEXT TASK**
+**STEP 8: PROCEED TO NEXT TASK**
 
-- Read the plan file to identify the next \`- [ ]\` task
+- Read the plan file AGAIN to identify the next \`- [ ]\` task
 - Start immediately - DO NOT STOP
 
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@@ -57,7 +66,12 @@ export function buildStandaloneVerificationReminder(sessionId: string): string {
 
 ${buildVerificationReminder(sessionId)}
 
-**STEP 4: UPDATE TODO STATUS (IMMEDIATELY)**
+**STEP 5: CHECK YOUR PROGRESS DIRECTLY (EVERY TIME — NO EXCEPTIONS)**
+
+Do NOT rely on memory or cached state. Run \`todoread\` NOW to see exact current state.
+Count pending vs completed tasks. This is your ground truth for what comes next.
+
+**STEP 6: UPDATE TODO STATUS (IMMEDIATELY)**
 
 RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.
 
@@ -66,15 +80,15 @@ RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.
 
 **DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.**
 
-**STEP 5: EXECUTE QA TASKS (IF ANY)**
+**STEP 7: EXECUTE QA TASKS (IF ANY)**
 
 If QA tasks exist in your todo list:
 - Execute them BEFORE proceeding
 - Mark each QA task complete after successful verification
 
-**STEP 6: PROCEED TO NEXT PENDING TASK**
+**STEP 8: PROCEED TO NEXT PENDING TASK**
 
-- Identify the next \`pending\` task from your todo list
+- Run \`todoread\` AGAIN to identify the next \`pending\` task
 - Start immediately - DO NOT STOP
 
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━