feat(atlas): enforce mandatory manual code review and direct boulder state checks

- VERIFICATION_REMINDER: add Step 2 manual code review (non-negotiable)
  - Require Read of EVERY changed file line by line
  - Cross-check subagent claims vs actual code
  - Verify logic correctness, completeness, edge cases, patterns
- Add Step 5: direct boulder state check via Read plan file
  - Count remaining tasks directly, no cached state
- BOULDER_CONTINUATION_PROMPT: add first rule to read plan file immediately
- verification-reminders.ts: restructure steps 5-8 for boulder/todo checks
- Atlas default.ts (Claude): enhance 3.4 QA with A/B/C/D sections
  - A: Automated verification
  - B: Manual code review (non-negotiable)
  - C: Hands-on QA (if applicable)
  - D: Check boulder state directly
- Atlas gpt.ts (GPT-5.2): apply same QA enhancements with GPT-optimized structure
- verification_rules: update both Claude and GPT versions with manual review requirements

Addresses issue where Atlas would skip manual code inspection after delegation,
leading to rubber-stamping of broken or incomplete work.
This commit is contained in:
YeonGyu-Kim
2026-02-10 15:40:06 +09:00
parent f84ef532c1
commit 45dfc4ec66
4 changed files with 154 additions and 54 deletions

View File

@@ -178,34 +178,54 @@ task(
)
\`\`\`
### 3.4 Verify (PROJECT-LEVEL QA)
### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)
**After EVERY delegation, YOU must verify:**
**You are the QA gate. Subagents lie. Automated checks alone are NOT enough.**
1. **Project-level diagnostics**:
\`lsp_diagnostics(filePath="src/")\` or \`lsp_diagnostics(filePath=".")\`
MUST return ZERO errors
After EVERY delegation, complete ALL of these steps — no shortcuts:
2. **Build verification**:
\`bun run build\` or \`bun run typecheck\`
Exit code MUST be 0
#### A. Automated Verification
1. \`lsp_diagnostics(filePath=".")\` → ZERO errors at project level
2. \`bun run build\` or \`bun run typecheck\` → exit code 0
3. \`bun test\` → ALL tests pass
3. **Test verification**:
\`bun test\`
ALL tests MUST pass
#### B. Manual Code Review (NON-NEGOTIABLE — DO NOT SKIP)
4. **Manual inspection**:
- Read changed files
- Confirm changes match requirements
- Check for regressions
**This is the step you are most tempted to skip. DO NOT SKIP IT.**
**Checklist:**
1. \`Read\` EVERY file the subagent created or modified — no exceptions
2. For EACH file, check line by line:
- Does the logic actually implement the task requirement?
- Are there stubs, TODOs, placeholders, or hardcoded values?
- Are there logic errors or missing edge cases?
- Does it follow the existing codebase patterns?
- Are imports correct and complete?
3. Cross-reference: compare what subagent CLAIMED vs what the code ACTUALLY does
4. If anything doesn't match → resume session and fix immediately
**If you cannot explain what the changed code does, you have not reviewed it.**
#### C. Hands-On QA (if applicable)
| Deliverable | Method | Tool |
|-------------|--------|------|
| Frontend/UI | Browser | \`/playwright\` |
| TUI/CLI | Interactive | \`interactive_bash\` |
| API/Backend | Real requests | curl |
#### D. Check Boulder State Directly
After verification, READ the plan file directly — every time, no exceptions:
\`\`\`
[ ] lsp_diagnostics at project level - ZERO errors
[ ] Build command - exit 0
[ ] Test suite - all pass
[ ] Files exist and match requirements
[ ] No regressions
Read(".sisyphus/tasks/{plan-name}.yaml")
\`\`\`
Count remaining \`- [ ]\` tasks. This is your ground truth for what comes next.
**Checklist (ALL must be checked):**
\`\`\`
[ ] Automated: lsp_diagnostics clean, build passes, tests pass
[ ] Manual: Read EVERY changed file, verified logic matches requirements
[ ] Cross-check: Subagent claims match actual code
[ ] Boulder: Read plan file, confirmed current progress
\`\`\`
**If verification fails**: Resume the SAME session with the ACTUAL error output:
@@ -325,22 +345,25 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 4..
You are the QA gate. Subagents lie. Verify EVERYTHING.
**After each delegation**:
1. \`lsp_diagnostics\` at PROJECT level (not file level)
2. Run build command
3. Run test suite
4. Read changed files manually
5. Confirm requirements met
**After each delegation — BOTH automated AND manual verification are MANDATORY:**
1. \`lsp_diagnostics\` at PROJECT level → ZERO errors
2. Run build command → exit 0
3. Run test suite → ALL pass
4. **\`Read\` EVERY changed file line by line** → logic matches requirements
5. **Cross-check**: subagent's claims vs actual code — do they match?
6. **Check boulder state**: Read the plan file directly, count remaining tasks
**Evidence required**:
| Action | Evidence |
|--------|----------|
| Code change | lsp_diagnostics clean at project level |
| Code change | lsp_diagnostics clean + manual Read of every changed file |
| Build | Exit code 0 |
| Tests | All pass |
| Delegation | Verified independently |
| Logic correct | You read the code and can explain what it does |
| Boulder state | Read plan file, confirmed progress |
**No evidence = not complete.**
**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
</verification_rules>
<boundaries>

View File

@@ -182,19 +182,51 @@ Extract wisdom → include in prompt.
task(category="[cat]", load_skills=["[skills]"], run_in_background=false, prompt=\`[6-SECTION PROMPT]\`)
\`\`\`
### 3.4 Verify (PROJECT-LEVEL QA)
### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)
After EVERY delegation:
After EVERY delegation, complete ALL steps — no shortcuts:
#### A. Automated Verification
1. \`lsp_diagnostics(filePath=".")\` → ZERO errors
2. \`Bash("bun run build")\` → exit 0
3. \`Bash("bun test")\` → all pass
4. \`Read\` changed files → confirm requirements met
Checklist:
- [ ] lsp_diagnostics clean
- [ ] Build passes
- [ ] Tests pass
- [ ] Files match requirements
#### B. Manual Code Review (NON-NEGOTIABLE)
1. \`Read\` EVERY file the subagent touched — no exceptions
2. For each file, verify line by line:
| Check | What to Look For |
|-------|------------------|
| Logic correctness | Does implementation match task requirements? |
| Completeness | No stubs, TODOs, placeholders, hardcoded values? |
| Edge cases | Off-by-one, null checks, error paths handled? |
| Patterns | Follows existing codebase conventions? |
| Imports | Correct, complete, no unused? |
3. Cross-check: subagent's claims vs actual code — do they match?
4. If mismatch found → resume session with \`session_id\` and fix
**If you cannot explain what the changed code does, you have not reviewed it.**
#### C. Hands-On QA (if applicable)
| Deliverable | Method | Tool |
|-------------|--------|------|
| Frontend/UI | Browser | \`/playwright\` |
| TUI/CLI | Interactive | \`interactive_bash\` |
| API/Backend | Real requests | curl |
#### D. Check Boulder State Directly
After verification, READ the plan file — every time:
\`\`\`
Read(".sisyphus/tasks/{plan-name}.yaml")
\`\`\`
Count remaining \`- [ ]\` tasks. This is your ground truth.
Checklist (ALL required):
- [ ] Automated: diagnostics clean, build passes, tests pass
- [ ] Manual: Read EVERY changed file, logic matches requirements
- [ ] Cross-check: subagent claims match actual code
- [ ] Boulder: Read plan file, confirmed current progress
### 3.5 Handle Failures
@@ -269,15 +301,23 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 3..
<verification_rules>
You are the QA gate. Subagents lie. Verify EVERYTHING.
**After each delegation**:
**After each delegation — BOTH automated AND manual verification are MANDATORY**:
| Step | Tool | Expected |
|------|------|----------|
| 1 | \`lsp_diagnostics(".")\` | ZERO errors |
| 2 | \`Bash("bun run build")\` | exit 0 |
| 3 | \`Bash("bun test")\` | all pass |
| 4 | \`Read\` changed files | matches requirements |
| 4 | \`Read\` EVERY changed file | logic matches requirements |
| 5 | Cross-check claims vs code | subagent's report matches reality |
| 6 | \`Read\` plan file | boulder state confirmed |
**No evidence = not complete.**
**Manual code review (Step 4) is NON-NEGOTIABLE:**
- Read every line of every changed file
- Verify logic correctness, completeness, edge cases
- If you can't explain what the code does, you haven't reviewed it
**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
</verification_rules>
<boundaries>

View File

@@ -33,6 +33,7 @@ export const BOULDER_CONTINUATION_PROMPT = `${createSystemDirective(SystemDirect
You have an active work plan with incomplete tasks. Continue working.
RULES:
- **FIRST**: Read the plan file NOW to check exact current progress — count remaining \`- [ ]\` tasks
- Proceed without asking for permission
- Change \`- [ ]\` to \`- [x]\` in the plan file when done
- Use the notepad at .sisyphus/notepads/{PLAN_NAME}/ to record learnings
@@ -48,15 +49,36 @@ Tests FAILING, code has ERRORS, implementation INCOMPLETE - but they say "done".
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
**STEP 1: VERIFY WITH YOUR OWN TOOL CALLS (DO THIS NOW)**
**STEP 1: AUTOMATED VERIFICATION (DO THIS FIRST)**
Run these commands YOURSELF - do NOT trust agent's claims:
1. \`lsp_diagnostics\` on changed files → Must be CLEAN
2. \`bash\` to run tests → Must PASS
3. \`bash\` to run build/typecheck → Must succeed
4. \`Read\` the actual code → Must match requirements
**STEP 2: DETERMINE IF HANDS-ON QA IS NEEDED**
**STEP 2: MANUAL CODE REVIEW (NON-NEGOTIABLE — DO NOT SKIP)**
Automated checks are NECESSARY but INSUFFICIENT. You MUST read the actual code.
**RIGHT NOW — \`Read\` EVERY file the subagent touched. No exceptions.**
For EACH changed file, verify:
1. Does the implementation logic ACTUALLY match the task requirements?
2. Are there incomplete stubs (TODO comments, placeholder code, hardcoded values)?
3. Are there logic errors, off-by-one bugs, or missing edge cases?
4. Does it follow existing codebase patterns and conventions?
5. Are imports correct? No unused or missing imports?
6. Is error handling present where needed?
**Cross-check the subagent's claims against reality:**
- Subagent said "Updated X" → READ X. Is it actually updated?
- Subagent said "Added tests" → READ tests. Do they test the RIGHT behavior?
- Subagent said "Follows patterns" → COMPARE with reference. Does it actually?
**If you cannot explain what the changed code does, you have not reviewed it.**
**If you skip this step, you are rubber-stamping broken work.**
**STEP 3: DETERMINE IF HANDS-ON QA IS NEEDED**
| Deliverable Type | QA Method | Tool |
|------------------|-----------|------|
@@ -66,7 +88,7 @@ Run these commands YOURSELF - do NOT trust agent's claims:
Static analysis CANNOT catch: visual bugs, animation issues, user flow breakages.
**STEP 3: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**
**STEP 4: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**
\`\`\`
todowrite([
@@ -76,7 +98,8 @@ todowrite([
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
**BLOCKING: DO NOT proceed to Step 4 until Steps 1-3 are VERIFIED.**`
**BLOCKING: DO NOT proceed until Steps 1-4 are ALL completed.**
**Skipping Step 2 (manual code review) = unverified work = FAILURE.**`
export const ORCHESTRATOR_DELEGATION_REQUIRED = `

View File

@@ -26,7 +26,16 @@ export function buildOrchestratorReminder(
${buildVerificationReminder(sessionId)}
**STEP 4: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)**
**STEP 5: CHECK BOULDER STATE DIRECTLY (EVERY TIME — NO EXCEPTIONS)**
Do NOT rely on cached progress. Read the plan file NOW:
\`\`\`
Read(".sisyphus/tasks/${planName}.yaml")
\`\`\`
Count exactly: how many \`- [ ]\` remain? How many \`- [x]\` completed?
This is YOUR ground truth. Use it to decide what comes next.
**STEP 6: MARK COMPLETION IN PLAN FILE (IMMEDIATELY)**
RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.
@@ -36,14 +45,14 @@ Update the plan file \`.sisyphus/tasks/${planName}.yaml\`:
**DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.**
**STEP 5: COMMIT ATOMIC UNIT**
**STEP 7: COMMIT ATOMIC UNIT**
- Stage ONLY the verified changes
- Commit with clear message describing what was done
**STEP 6: PROCEED TO NEXT TASK**
**STEP 8: PROCEED TO NEXT TASK**
- Read the plan file to identify the next \`- [ ]\` task
- Read the plan file AGAIN to identify the next \`- [ ]\` task
- Start immediately - DO NOT STOP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@@ -57,7 +66,12 @@ export function buildStandaloneVerificationReminder(sessionId: string): string {
${buildVerificationReminder(sessionId)}
**STEP 4: UPDATE TODO STATUS (IMMEDIATELY)**
**STEP 5: CHECK YOUR PROGRESS DIRECTLY (EVERY TIME — NO EXCEPTIONS)**
Do NOT rely on memory or cached state. Run \`todoread\` NOW to see exact current state.
Count pending vs completed tasks. This is your ground truth for what comes next.
**STEP 6: UPDATE TODO STATUS (IMMEDIATELY)**
RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.
@@ -66,15 +80,15 @@ RIGHT NOW - Do not delay. Verification passed → Mark IMMEDIATELY.
**DO THIS BEFORE ANYTHING ELSE. Unmarked = Untracked = Lost progress.**
**STEP 5: EXECUTE QA TASKS (IF ANY)**
**STEP 7: EXECUTE QA TASKS (IF ANY)**
If QA tasks exist in your todo list:
- Execute them BEFORE proceeding
- Mark each QA task complete after successful verification
**STEP 6: PROCEED TO NEXT PENDING TASK**
**STEP 8: PROCEED TO NEXT PENDING TASK**
- Identify the next \`pending\` task from your todo list
- Run \`todoread\` AGAIN to identify the next \`pending\` task
- Start immediately - DO NOT STOP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━