feat(atlas): enforce 4-phase critical QA with mandatory hands-on verification

Rewrite Atlas GPT verification from a checklist to a 4-phase protocol:
Phase 1 (Read Code First), Phase 2 (Automated Checks), Phase 3 (Hands-On QA),
Phase 4 (Gate Decision). Hands-on QA is now mandatory for user-facing changes,
not 'if applicable'. Hook message reinforces subagent distrust and requires
actually running deliverables before proceeding to next task.
This commit is contained in:
YeonGyu-Kim
2026-02-18 17:50:26 +09:00
parent d837498318
commit 75ff6e1be1
3 changed files with 118 additions and 91 deletions

View File

@@ -182,52 +182,71 @@ Extract wisdom → include in prompt.
task(category="[cat]", load_skills=["[skills]"], run_in_background=false, prompt=\`[6-SECTION PROMPT]\`)
\`\`\`
### 3.4 Verify (MANDATORY — EVERY SINGLE DELEGATION)
### 3.4 Verify — 4-Phase Critical QA (EVERY SINGLE DELEGATION)
After EVERY delegation, complete ALL steps — no shortcuts:
Subagents ROUTINELY claim "done" when code is broken, incomplete, or wrong.
Assume they lied. Prove them right — or catch them.
#### A. Automated Verification
1. \`lsp_diagnostics(filePath=".")\` → ZERO errors
2. \`Bash("bun run build")\` → exit 0
3. \`Bash("bun test")\` → all pass
#### PHASE 1: READ THE CODE FIRST (before running anything)
#### B. Manual Code Review (NON-NEGOTIABLE)
1. \`Read\` EVERY file the subagent touched — no exceptions
2. For each file, verify line by line:
**Do NOT run tests or build yet. Read the actual code FIRST.**
| Check | What to Look For |
|-------|------------------|
| Logic correctness | Does implementation match task requirements? |
| Completeness | No stubs, TODOs, placeholders, hardcoded values? |
| Edge cases | Off-by-one, null checks, error paths handled? |
| Patterns | Follows existing codebase conventions? |
| Imports | Correct, complete, no unused? |
1. \`Bash("git diff --stat")\` → See EXACTLY which files changed. Flag any file outside expected scope (scope creep).
2. \`Read\` EVERY changed file — no exceptions, no skimming.
3. For EACH file, critically evaluate:
- **Requirement match**: Does the code ACTUALLY do what the task asked? Re-read the task spec, compare line by line.
- **Scope creep**: Did the subagent touch files or add features NOT requested? Compare \`git diff --stat\` against task scope.
- **Completeness**: Any stubs, TODOs, placeholders, hardcoded values? \`Grep\` for \`TODO\`, \`FIXME\`, \`HACK\`, \`xxx\`.
- **Logic errors**: Off-by-one, null/undefined paths, missing error handling? Trace the happy path AND the error path mentally.
- **Patterns**: Does it follow existing codebase conventions? Compare with a reference file doing similar work.
- **Imports**: Correct, complete, no unused, no missing? Check every import is used, every usage is imported.
- **Anti-patterns**: \`as any\`, \`@ts-ignore\`, empty catch blocks, console.log? \`Grep\` for known anti-patterns in changed files.
3. Cross-check: subagent's claims vs actual code — do they match?
4. If mismatch found → resume session with \`session_id\` and fix
4. **Cross-check**: Subagent said "Updated X" → READ X. Actually updated? Subagent said "Added tests" → READ tests. Do they test the RIGHT behavior, or just pass trivially?
**If you cannot explain what the changed code does, you have not reviewed it.**
**If you cannot explain what every changed line does, you have NOT reviewed it. Go back and read again.**
#### C. Hands-On QA (if applicable)
| Deliverable | Method | Tool |
|-------------|--------|------|
| Frontend/UI | Browser | \`/playwright\` |
| TUI/CLI | Interactive | \`interactive_bash\` |
| API/Backend | Real requests | curl |
#### PHASE 2: AUTOMATED VERIFICATION (targeted, then broad)
#### D. Check Boulder State Directly
After verification, READ the plan file — every time:
Start specific to changed code, then broaden:
1. \`lsp_diagnostics\` on EACH changed file individually → ZERO new errors
2. Run tests RELATED to changed files first → e.g., \`Bash("bun test src/changed-module")\`
3. Then full test suite: \`Bash("bun test")\` → all pass
4. Build/typecheck: \`Bash("bun run build")\` → exit 0
If automated checks pass but your Phase 1 review found issues → automated checks are INSUFFICIENT. Fix the code issues first.
#### PHASE 3: HANDS-ON QA (MANDATORY for anything user-facing)
Static analysis and tests CANNOT catch: visual bugs, broken user flows, wrong CLI output, API response shape issues.
**If the task produced anything a user would SEE or INTERACT with, you MUST run it and verify with your own eyes.**
- **Frontend/UI**: Load with \`/playwright\`, click through the actual user flow, check browser console. Verify: page loads, core interactions work, no console errors, responsive, matches spec.
- **TUI/CLI**: Run with \`interactive_bash\`, try happy path, try bad input, try help flag. Verify: command runs, output correct, error messages helpful, edge inputs handled.
- **API/Backend**: \`Bash\` with curl — test 200 case, test 4xx case, test with malformed input. Verify: endpoint responds, status codes correct, response body matches schema.
- **Config/Infra**: Actually start the service or load the config and observe behavior. Verify: config loads, no runtime errors, backward compatible.
**Not "if applicable" — if the task is user-facing, this is MANDATORY. Skip this and you ship broken features.**
#### PHASE 4: GATE DECISION (proceed or reject)
Before moving to the next task, answer these THREE questions honestly:
1. **Can I explain what every changed line does?** (If no → go back to Phase 1)
2. **Did I see it work with my own eyes?** (If user-facing and no → go back to Phase 3)
3. **Am I confident this doesn't break existing functionality?** (If no → run broader tests)
- **All 3 YES** → Proceed: mark task complete, move to next.
- **Any NO** → Reject: resume session with \`session_id\`, fix the specific issue.
- **Unsure on any** → Reject: "unsure" = "no". Investigate until you have a definitive answer.
**After gate passes:** Check boulder state:
\`\`\`
Read(".sisyphus/tasks/{plan-name}.yaml")
Read(".sisyphus/plans/{plan-name}.md")
\`\`\`
Count remaining \`- [ ]\` tasks. This is your ground truth.
Checklist (ALL required):
- [ ] Automated: diagnostics clean, build passes, tests pass
- [ ] Manual: Read EVERY changed file, logic matches requirements
- [ ] Cross-check: subagent claims match actual code
- [ ] Boulder: Read plan file, confirmed current progress
### 3.5 Handle Failures
**CRITICAL: Use \`session_id\` for retries.**
@@ -299,25 +318,27 @@ task(category="quick", load_skills=[], run_in_background=false, prompt="Task 3..
</notepad_protocol>
<verification_rules>
You are the QA gate. Subagents lie. Verify EVERYTHING.
You are the QA gate. Subagents ROUTINELY LIE about completion. They will claim "done" when:
- Code has syntax errors they didn't notice
- Implementation is a stub with TODOs
- Tests pass trivially (testing nothing meaningful)
- Logic doesn't match what was asked
- They added features nobody requested
**After each delegation — BOTH automated AND manual verification are MANDATORY**:
Your job is to CATCH THEM. Assume every claim is false until YOU personally verify it.
| Step | Tool | Expected |
|------|------|----------|
| 1 | \`lsp_diagnostics(".")\` | ZERO errors |
| 2 | \`Bash("bun run build")\` | exit 0 |
| 3 | \`Bash("bun test")\` | all pass |
| 4 | \`Read\` EVERY changed file | logic matches requirements |
| 5 | Cross-check claims vs code | subagent's report matches reality |
| 6 | \`Read\` plan file | boulder state confirmed |
**4-Phase Protocol (every delegation, no exceptions):**
**Manual code review (Step 4) is NON-NEGOTIABLE:**
- Read every line of every changed file
- Verify logic correctness, completeness, edge cases
- If you can't explain what the code does, you haven't reviewed it
1. **READ CODE** — \`Read\` every changed file, trace logic, check scope. Catch lies before wasting time running broken code.
2. **RUN CHECKS** — lsp_diagnostics (per-file), tests (targeted then broad), build. Catch what your eyes missed.
3. **HANDS-ON QA** — Actually run/open/interact with the deliverable. Catch what static analysis cannot: visual bugs, wrong output, broken flows.
4. **GATE DECISION** — Can you explain every line? Did you see it work? Confident nothing broke? Prevent broken work from propagating to downstream tasks.
**No evidence = not complete. Skipping manual review = rubber-stamping broken work.**
**Phase 3 is NOT optional for user-facing changes.** If you skip hands-on QA, you are shipping untested features.
**Phase 4 gate:** ALL three questions must be YES to proceed. "Unsure" = NO. Investigate until certain.
**On failure at any phase:** Resume with \`session_id\` and the SPECIFIC failure. Do not start fresh.
</verification_rules>
<boundaries>

View File

@@ -91,6 +91,9 @@ export async function run(options: RunOptions): Promise<number> {
path: { id: sessionID },
body: {
agent: resolvedAgent,
tools: {
question: false,
},
parts: [{ type: "text", text: message }],
},
query: { directory },

View File

@@ -40,66 +40,69 @@ RULES:
- Do not stop until all tasks are complete
- If blocked, document the blocker and move to the next task`
export const VERIFICATION_REMINDER = `**MANDATORY: WHAT YOU MUST DO RIGHT NOW**
export const VERIFICATION_REMINDER = `**THE SUBAGENT JUST CLAIMED THIS TASK IS DONE. THEY ARE PROBABLY LYING.**
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Subagents say "done" when code has errors, tests pass trivially, logic is wrong,
or they quietly added features nobody asked for. This happens EVERY TIME.
Assume the work is broken until YOU prove otherwise.
CRITICAL: Subagents FREQUENTLY LIE about completion.
Tests FAILING, code has ERRORS, implementation INCOMPLETE - but they say "done".
---
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
**PHASE 1: READ THE CODE FIRST (before running anything)**
**STEP 1: AUTOMATED VERIFICATION (DO THIS FIRST)**
Do NOT run tests yet. Read the code FIRST so you know what you're testing.
Run these commands YOURSELF - do NOT trust agent's claims:
1. \`lsp_diagnostics\` on changed files → Must be CLEAN
2. \`bash\` to run tests → Must PASS
3. \`bash\` to run build/typecheck → Must succeed
1. \`Bash("git diff --stat")\` — see exactly which files changed. Any file outside expected scope = scope creep.
2. \`Read\` EVERY changed file — no exceptions, no skimming.
3. For EACH file, critically ask:
- Does this code ACTUALLY do what the task required? (Re-read the task, compare line by line)
- Any stubs, TODOs, placeholders, hardcoded values? (\`Grep\` for TODO, FIXME, HACK, xxx)
- Logic errors? Trace the happy path AND the error path in your head.
- Anti-patterns? (\`Grep\` for \`as any\`, \`@ts-ignore\`, empty catch, console.log in changed files)
- Scope creep? Did the subagent touch things or add features NOT in the task spec?
4. Cross-check every claim:
- Said "Updated X" — READ X. Actually updated, or just superficially touched?
- Said "Added tests" — READ the tests. Do they test REAL behavior or just \`expect(true).toBe(true)\`?
- Said "Follows patterns" — OPEN a reference file. Does it ACTUALLY match?
**STEP 2: MANUAL CODE REVIEW (NON-NEGOTIABLE — DO NOT SKIP)**
**If you cannot explain what every changed line does, you have NOT reviewed it.**
Automated checks are NECESSARY but INSUFFICIENT. You MUST read the actual code.
**PHASE 2: RUN AUTOMATED CHECKS (targeted, then broad)**
**RIGHT NOW — \`Read\` EVERY file the subagent touched. No exceptions.**
Now that you understand the code, verify mechanically:
1. \`lsp_diagnostics\` on EACH changed file — ZERO new errors
2. Run tests for changed modules FIRST, then full suite
3. Build/typecheck — exit 0
For EACH changed file, verify:
1. Does the implementation logic ACTUALLY match the task requirements?
2. Are there incomplete stubs (TODO comments, placeholder code, hardcoded values)?
3. Are there logic errors, off-by-one bugs, or missing edge cases?
4. Does it follow existing codebase patterns and conventions?
5. Are imports correct? No unused or missing imports?
6. Is error handling present where needed?
If Phase 1 found issues but Phase 2 passes: Phase 2 is WRONG. The code has bugs that tests don't cover. Fix the code.
**Cross-check the subagent's claims against reality:**
- Subagent said "Updated X" → READ X. Is it actually updated?
- Subagent said "Added tests" → READ tests. Do they test the RIGHT behavior?
- Subagent said "Follows patterns" → COMPARE with reference. Does it actually?
**PHASE 3: HANDS-ON QA — ACTUALLY RUN IT (MANDATORY for user-facing changes)**
**If you cannot explain what the changed code does, you have not reviewed it.**
**If you skip this step, you are rubber-stamping broken work.**
Tests and linters CANNOT catch: visual bugs, wrong CLI output, broken user flows, API response shape issues.
**STEP 3: DETERMINE IF HANDS-ON QA IS NEEDED**
**If this task produced anything a user would SEE or INTERACT with, you MUST launch it and verify yourself.**
| Deliverable Type | QA Method | Tool |
|------------------|-----------|------|
| **Frontend/UI** | Browser interaction | \`/playwright\` skill |
| **TUI/CLI** | Run interactively | \`interactive_bash\` (tmux) |
| **API/Backend** | Send real requests | \`bash\` with curl |
- **Frontend/UI**: \`/playwright\` skill — load the page, click through the flow, check console. Verify: page loads, interactions work, console clean, responsive.
- **TUI/CLI**: \`interactive_bash\` — run the command, try good input, try bad input, try --help. Verify: command runs, output correct, error messages helpful, edge inputs handled.
- **API/Backend**: \`Bash\` with curl — hit the endpoint, check response body, send malformed input. Verify: returns 200, body correct, error cases return proper errors.
- **Config/Build**: Actually start the service or import the config. Verify: loads without error, backward compatible.
Static analysis CANNOT catch: visual bugs, animation issues, user flow breakages.
This is NOT optional "if applicable". If the deliverable is user-facing and you did not run it, you are shipping untested work.
**STEP 4: IF QA IS NEEDED - ADD TO TODO IMMEDIATELY**
**PHASE 4: GATE DECISION — Should you proceed to the next task?**
\`\`\`
todowrite([
{ id: "qa-X", content: "HANDS-ON QA: [specific verification action]", status: "pending", priority: "high" }
])
\`\`\`
Answer honestly:
1. Can I explain what EVERY changed line does? (If no — back to Phase 1)
2. Did I SEE it work with my own eyes? (If user-facing and no — back to Phase 3)
3. Am I confident nothing existing is broken? (If no — run broader tests)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ALL three must be YES. "Probably" = NO. "I think so" = NO. Investigate until CERTAIN.
**BLOCKING: DO NOT proceed until Steps 1-4 are ALL completed.**
**Skipping Step 2 (manual code review) = unverified work = FAILURE.**`
- **All 3 YES** — Proceed: mark task complete, move to next.
- **Any NO** — Reject: resume session with \`session_id\`, fix the specific issue.
- **Unsure** — Reject: "unsure" = "no". Investigate until you have a definitive answer.
**DO NOT proceed to the next task until all 4 phases are complete and the gate passes.**`
export const ORCHESTRATOR_DELEGATION_REQUIRED = `