From 45dfc4ec66f1e3a16ba44e3e63629503a2ff7f53 Mon Sep 17 00:00:00 2001 From: YeonGyu-Kim Date: Tue, 10 Feb 2026 15:40:06 +0900 Subject: [PATCH] feat(atlas): enforce mandatory manual code review and direct boulder state checks - VERIFICATION_REMINDER: add Step 2 manual code review (non-negotiable) - Require Read of EVERY changed file line by line - Cross-check subagent claims vs actual code - Verify logic correctness, completeness, edge cases, patterns - Add Step 5: direct boulder state check via Read plan file - Count remaining tasks directly, no cached state - BOULDER_CONTINUATION_PROMPT: add first rule to read plan file immediately - verification-reminders.ts: restructure steps 5-8 for boulder/todo checks - Atlas default.ts (Claude): enhance 3.4 QA with A/B/C/D sections - A: Automated verification - B: Manual code review (non-negotiable) - C: Hands-on QA (if applicable) - D: Check boulder state directly - Atlas gpt.ts (GPT-5.2): apply same QA enhancements with GPT-optimized structure - verification_rules: update both Claude and GPT versions with manual review requirements Addresses issue where Atlas would skip manual code inspection after delegation, leading to rubber-stamping of broken or incomplete work. --- src/agents/atlas/default.ts | 83 +++++++++++++------- src/agents/atlas/gpt.ts | 62 ++++++++++++--- src/hooks/atlas/system-reminder-templates.ts | 33 ++++++-- src/hooks/atlas/verification-reminders.ts | 30 +++++-- 4 files changed, 154 insertions(+), 54 deletions(-) diff --git a/src/agents/atlas/default.ts b/src/agents/atlas/default.ts index 2568d5111..5f3447992 100644 --- a/src/agents/atlas/default.ts +++ b/src/agents/atlas/default.ts @@ -178,34 +178,54 @@ task( ) \`\`\` -### 3.4 Verify (PROJECT-LEVEL QA) +### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION) -**After EVERY delegation, YOU must verify:** +**You are the QA gate. Subagents lie. Automated checks alone are NOT enough.** -1. **Project-level diagnostics**: - \`lsp_diagnostics(filePath="src/")\` or \`lsp_diagnostics(filePath=".")\` - MUST return ZERO errors +After EVERY delegation, complete ALL of these steps — no shortcuts: -2. **Build verification**: - \`bun run build\` or \`bun run typecheck\` - Exit code MUST be 0 +#### A. Automated Verification +1. \`lsp_diagnostics(filePath=".")\` → ZERO errors at project level +2. \`bun run build\` or \`bun run typecheck\` → exit code 0 +3. \`bun test\` → ALL tests pass -3. **Test verification**: - \`bun test\` - ALL tests MUST pass +#### B. Manual Code Review (NON-NEGOTIABLE — DO NOT SKIP) -4. **Manual inspection**: - - Read changed files - - Confirm changes match requirements - - Check for regressions +**This is the step you are most tempted to skip. DO NOT SKIP IT.** -**Checklist:** +1. \`Read\` EVERY file the subagent created or modified — no exceptions +2. For EACH file, check line by line: + - Does the logic actually implement the task requirement? + - Are there stubs, TODOs, placeholders, or hardcoded values? + - Are there logic errors or missing edge cases? + - Does it follow the existing codebase patterns? + - Are imports correct and complete? +3. Cross-reference: compare what subagent CLAIMED vs what the code ACTUALLY does +4. If anything doesn't match → resume session and fix immediately + +**If you cannot explain what the changed code does, you have not reviewed it.** + +#### C. Hands-On QA (if applicable) +| Deliverable | Method | Tool | +|-------------|--------|------| +| Frontend/UI | Browser | \`/playwright\` | +| TUI/CLI | Interactive | \`interactive_bash\` | +| API/Backend | Real requests | curl | + +#### D. Check Boulder State Directly + +After verification, READ the plan file directly — every time, no exceptions: \`\`\` -[ ] lsp_diagnostics at project level - ZERO errors -[ ] Build command - exit 0 -[ ] Test suite - all pass -[ ] Files exist and match requirements -[ ] No regressions +Read(".sisyphus/tasks/{plan-name}.yaml") +\`\`\` +Count remaining \`- [ ]\` tasks. This is your ground truth for what comes next. + +**Checklist (ALL must be checked):** +\`\`\` +[ ] Automated: lsp_diagnostics clean, build passes, tests pass +[ ] Manual: Read EVERY changed file, verified logic matches requirements +[ ] Cross-check: Subagent claims match actual code +[ ] Boulder: Read plan file, confirmed current progress \`\`\` **If verification fails**: Resume the SAME session with the ACTUAL error output: @@ -325,22 +345,25 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 4.. You are the QA gate. Subagents lie. Verify EVERYTHING. -**After each delegation**: -1. \`lsp_diagnostics\` at PROJECT level (not file level) -2. Run build command -3. Run test suite -4. Read changed files manually -5. Confirm requirements met +**After each delegation — BOTH automated AND manual verification are MANDATORY:** + +1. \`lsp_diagnostics\` at PROJECT level → ZERO errors +2. Run build command → exit 0 +3. Run test suite → ALL pass +4. **\`Read\` EVERY changed file line by line** → logic matches requirements +5. **Cross-check**: subagent's claims vs actual code — do they match? +6. **Check boulder state**: Read the plan file directly, count remaining tasks **Evidence required**: | Action | Evidence | |--------|----------| -| Code change | lsp_diagnostics clean at project level | +| Code change | lsp_diagnostics clean + manual Read of every changed file | | Build | Exit code 0 | | Tests | All pass | -| Delegation | Verified independently | +| Logic correct | You read the code and can explain what it does | +| Boulder state | Read plan file, confirmed progress | -**No evidence = not complete.** +**No evidence = not complete. Skipping manual review = rubber-stamping broken work.** diff --git a/src/agents/atlas/gpt.ts b/src/agents/atlas/gpt.ts index d81620e69..9d0879b2e 100644 --- a/src/agents/atlas/gpt.ts +++ b/src/agents/atlas/gpt.ts @@ -182,19 +182,51 @@ Extract wisdom → include in prompt. task(category="[cat]", load_skills=["[skills]"], run_in_background=false, prompt=\`[6-SECTION PROMPT]\`) \`\`\` -### 3.4 Verify (PROJECT-LEVEL QA) +### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION) -After EVERY delegation: +After EVERY delegation, complete ALL steps — no shortcuts: + +#### A. Automated Verification 1. \`lsp_diagnostics(filePath=".")\` → ZERO errors 2. \`Bash("bun run build")\` → exit 0 3. \`Bash("bun test")\` → all pass -4. \`Read\` changed files → confirm requirements met -Checklist: -- [ ] lsp_diagnostics clean -- [ ] Build passes -- [ ] Tests pass -- [ ] Files match requirements +#### B. Manual Code Review (NON-NEGOTIABLE) +1. \`Read\` EVERY file the subagent touched — no exceptions +2. For each file, verify line by line: + +| Check | What to Look For | +|-------|------------------| +| Logic correctness | Does implementation match task requirements? | +| Completeness | No stubs, TODOs, placeholders, hardcoded values? | +| Edge cases | Off-by-one, null checks, error paths handled? | +| Patterns | Follows existing codebase conventions? | +| Imports | Correct, complete, no unused? | + +3. Cross-check: subagent's claims vs actual code — do they match? +4. If mismatch found → resume session with \`session_id\` and fix + +**If you cannot explain what the changed code does, you have not reviewed it.** + +#### C. Hands-On QA (if applicable) +| Deliverable | Method | Tool | +|-------------|--------|------| +| Frontend/UI | Browser | \`/playwright\` | +| TUI/CLI | Interactive | \`interactive_bash\` | +| API/Backend | Real requests | curl | + +#### D. Check Boulder State Directly +After verification, READ the plan file — every time: +\`\`\` +Read(".sisyphus/tasks/{plan-name}.yaml") +\`\`\` +Count remaining \`- [ ]\` tasks. This is your ground truth. + +Checklist (ALL required): +- [ ] Automated: diagnostics clean, build passes, tests pass +- [ ] Manual: Read EVERY changed file, logic matches requirements +- [ ] Cross-check: subagent claims match actual code +- [ ] Boulder: Read plan file, confirmed current progress ### 3.5 Handle Failures @@ -269,15 +301,23 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 3.. You are the QA gate. Subagents lie. Verify EVERYTHING. -**After each delegation**: +**After each delegation — BOTH automated AND manual verification are MANDATORY**: + | Step | Tool | Expected | |------|------|----------| | 1 | \`lsp_diagnostics(".")\` | ZERO errors | | 2 | \`Bash("bun run build")\` | exit 0 | | 3 | \`Bash("bun test")\` | all pass | -| 4 | \`Read\` changed files | matches requirements | +| 4 | \`Read\` EVERY changed file | logic matches requirements | +| 5 | Cross-check claims vs code | subagent's report matches reality | +| 6 | \`Read\` plan file | boulder state confirmed | -**No evidence = not complete.** +**Manual code review (Step 4) is NON-NEGOTIABLE:** +- Read every line of every changed file +- Verify logic correctness, completeness, edge cases +- If you can't explain what the code does, you haven't reviewed it + +**No evidence = not complete. Skipping manual review = rubber-stamping broken work.** diff --git a/src/hooks/atlas/system-reminder-templates.ts b/src/hooks/atlas/system-reminder-templates.ts index c9d4cd749..8458a3424 100644 --- a/src/hooks/atlas/system-reminder-templates.ts +++ b/src/hooks/atlas/system-reminder-templates.ts @@ -33,6 +33,7 @@ export const BOULDER_CONTINUATION_PROMPT = `${createSystemDirective(SystemDirect You have an active work plan with incomplete tasks. Continue working. RULES: +- **FIRST**: Read the plan file NOW to check exact current progress — count remaining \`- [ ]\` tasks - Proceed without asking for permission - Change \`- [ ]\` to \`- [x]\` in the plan file when done - Use the notepad at .sisyphus/notepads/{PLAN_NAME}/ to record learnings @@ -48,15 +49,36 @@ Tests FAILING, code has ERRORS, implementation INCOMPLETE - but they say "done". ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -**STEP 1: VERIFY WITH YOUR OWN TOOL CALLS (DO THIS NOW)** +**STEP 1: AUTOMATED VERIFICATION (DO THIS FIRST)** Run these commands YOURSELF - do NOT trust agent's claims: 1. \`lsp_diagnostics\` on changed files → Must be CLEAN 2. \`bash\` to run tests → Must PASS 3. \`bash\` to run build/typecheck → Must succeed -4. \`Read\` the actual code → Must match requirements -**STEP 2: DETERMINE IF HANDS-ON QA IS NEEDED** +**STEP 2: MANUAL CODE REVIEW (NON-NEGOTIABLE — DO NOT SKIP)** + +Automated checks are NECESSARY but INSUFFICIENT. You MUST read the actual code. + +**RIGHT NOW — \`Read\` EVERY file the subagent touched. No exceptions.** + +For EACH changed file, verify: +1. Does the implementation logic ACTUALLY match the task requirements? +2. Are there incomplete stubs (TODO comments, placeholder code, hardcoded values)? +3. Are there logic errors, off-by-one bugs, or missing edge cases? +4. Does it follow existing codebase patterns and conventions? +5. Are imports correct? No unused or missing imports? +6. Is error handling present where needed? + +**Cross-check the subagent's claims against reality:** +- Subagent said "Updated X" → READ X. Is it actually updated? +- Subagent said "Added tests" → READ tests. Do they test the RIGHT behavior? +- Subagent said "Follows patterns" → COMPARE with reference. Does it actually? + +**If you cannot explain what the changed code does, you have not reviewed it.** +**If you skip this step, you are rubber-stamping broken work.** + +**STEP 3: DETERMINE IF HANDS-ON QA IS NEEDED** | Deliverable Type | QA Method | Tool | |------------------|-----------|------| @@ -66,7 +88,7 @@ Run these commands YOURSELF - do NOT trust agent's claims: Static analysis CANNOT catch: visual bugs, animation issues, user flow breakages. -**STEP 3: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY** +**STEP 4: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY** \`\`\` todowrite([ @@ -76,7 +98,8 @@ todowrite([ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -**BLOCKING: DO NOT proceed to Step 4 until Steps 1-3 are VERIFIED.**` +**BLOCKING: DO NOT proceed until Steps 1-4 are ALL completed.** +**Skipping Step 2 (manual code review) = unverified work = FAILURE.**` export const ORCHESTRATOR_DELEGATION_REQUIRED = ` diff --git a/src/hooks/atlas/verification-reminders.ts b/src/hooks/atlas/verification-reminders.ts index e792bf3e5..c405f4ec1 100644 --- a/src/hooks/atlas/verification-reminders.ts +++ b/src/hooks/atlas/verification-reminders.ts @@ -26,7 +26,16 @@ export function buildOrchestratorReminder( ${buildVerificationReminder(sessionId)} -**STEP 4: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)** +**STEP 5: CHECK BOULDER STATE DIRECTLY (EVERY TIME — NO EXCEPTIONS)** + +Do NOT rely on cached progress. Read the plan file NOW: +\`\`\` +Read(".sisyphus/tasks/${planName}.yaml") +\`\`\` +Count exactly: how many \`- [ ]\` remain? How many \`- [x]\` completed? +This is YOUR ground truth. Use it to decide what comes next. + +**STEP 6: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)** RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY. @@ -36,14 +45,14 @@ Update the plan file \`.sisyphus/tasks/${planName}.yaml\`: **DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.** -**STEP 5: COMMIT ATOMIC UNIT** +**STEP 7: COMMIT ATOMIC UNIT** - Stage ONLY the verified changes - Commit with clear message describing what was done -**STEP 6: PROCEED TO NEXT TASK** +**STEP 8: PROCEED TO NEXT TASK** -- Read the plan file to identify the next \`- [ ]\` task +- Read the plan file AGAIN to identify the next \`- [ ]\` task - Start immediately - DO NOT STOP ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ @@ -57,7 +66,12 @@ export function buildStandaloneVerificationReminder(sessionId: string): string { ${buildVerificationReminder(sessionId)} -**STEP 4: UPDATE TODO STATUS (IMMEDIATELY)** +**STEP 5: CHECK YOUR PROGRESS DIRECTLY (EVERY TIME — NO EXCEPTIONS)** + +Do NOT rely on memory or cached state. Run \`todoread\` NOW to see exact current state. +Count pending vs completed tasks. This is your ground truth for what comes next. + +**STEP 6: UPDATE TODO STATUS (IMMEDIATELY)** RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY. @@ -66,15 +80,15 @@ RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY. **DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.** -**STEP 5: EXECUTE QA TASKS (IF ANY)** +**STEP 7: EXECUTE QA TASKS (IF ANY)** If QA tasks exist in your todo list: - Execute them BEFORE proceeding - Mark each QA task complete after successful verification -**STEP 6: PROCEED TO NEXT PENDING TASK** +**STEP 8: PROCEED TO NEXT PENDING TASK** -- Identify the next \`pending\` task from your todo list +- Run \`todoread\` AGAIN to identify the next \`pending\` task - Start immediately - DO NOT STOP ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━