refactor: compress plan template while recovering lost specificity guidelines

Reduce plan-template from 541 to 335 lines by removing redundant verbose
examples while recovering 3 lost context items: tool-type mapping table in
QA Policy, scenario specificity requirements (selectors/data/assertions/
timing/negative) in TODO template, and structured output format hints for
each Final Verification agent.
This commit is contained in:
YeonGyu-Kim
2026-02-16 15:25:10 +09:00
parent 130aaaf910
commit dd11d5df1b

View File

@@ -70,108 +70,25 @@ Generate plan to: \`.sisyphus/plans/{name}.md\`
## Verification Strategy (MANDATORY)
> **UNIVERSAL RULE: ZERO HUMAN INTERVENTION**
>
> ALL tasks in this plan MUST be verifiable WITHOUT any human action.
> This is NOT conditional — it applies to EVERY task, regardless of test strategy.
>
> **FORBIDDEN** — acceptance criteria that require:
> - "User manually tests..." / "사용자가 직접 테스트..."
> - "User visually confirms..." / "사용자가 눈으로 확인..."
> - "User interacts with..." / "사용자가 직접 조작..."
> - "Ask user to verify..." / "사용자에게 확인 요청..."
> - ANY step where a human must perform an action
>
> **ALL verification is executed by the agent** using tools (Playwright, interactive_bash, curl, etc.). No exceptions.
> **ZERO HUMAN INTERVENTION** — ALL verification is agent-executed. No exceptions.
> Acceptance criteria requiring "user manually tests/confirms" are FORBIDDEN.
### Test Decision
- **Infrastructure exists**: [YES/NO]
- **Automated tests**: [TDD / Tests-after / None]
- **Framework**: [bun test / vitest / jest / pytest / none]
- **If TDD**: Each task follows RED (failing test) → GREEN (minimal impl) → REFACTOR
### If TDD Enabled
### QA Policy
Every task MUST include agent-executed QA scenarios (see TODO template below).
Evidence saved to \`.sisyphus/evidence/task-{N}-{scenario-slug}.{ext}\`.
Each TODO follows RED-GREEN-REFACTOR:
**Task Structure:**
1. **RED**: Write failing test first
- Test file: \`[path].test.ts\`
- Test command: \`bun test [file]\`
- Expected: FAIL (test exists, implementation doesn't)
2. **GREEN**: Implement minimum code to pass
- Command: \`bun test [file]\`
- Expected: PASS
3. **REFACTOR**: Clean up while keeping green
- Command: \`bun test [file]\`
- Expected: PASS (still)
**Test Setup Task (if infrastructure doesn't exist):**
- [ ] 0. Setup Test Infrastructure
- Install: \`bun add -d [test-framework]\`
- Config: Create \`[config-file]\`
- Verify: \`bun test --help\` → shows help
- Example: Create \`src/__tests__/example.test.ts\`
- Verify: \`bun test\` → 1 test passes
### Agent-Executed QA Scenarios (MANDATORY — ALL tasks)
> Whether TDD is enabled or not, EVERY task MUST include Agent-Executed QA Scenarios.
> - **With TDD**: QA scenarios complement unit tests at integration/E2E level
> - **Without TDD**: QA scenarios are the PRIMARY verification method
>
> These describe how the executing agent DIRECTLY verifies the deliverable
> by running it — opening browsers, executing commands, sending API requests.
> The agent performs what a human tester would do, but automated via tools.
**Verification Tool by Deliverable Type:**
| Type | Tool | How Agent Verifies |
|------|------|-------------------|
| **Frontend/UI** | Playwright (playwright skill) | Navigate, interact, assert DOM, screenshot |
| **TUI/CLI** | interactive_bash (tmux) | Run command, send keystrokes, validate output |
| **API/Backend** | Bash (curl/httpie) | Send requests, parse responses, assert fields |
| **Library/Module** | Bash (bun/node REPL) | Import, call functions, compare output |
| **Config/Infra** | Bash (shell commands) | Apply config, run state checks, validate |
**Each Scenario MUST Follow This Format:**
\`\`\`
Scenario: [Descriptive name — what user action/flow is being verified]
Tool: [Playwright / interactive_bash / Bash]
Preconditions: [What must be true before this scenario runs]
Steps:
1. [Exact action with specific selector/command/endpoint]
2. [Next action with expected intermediate state]
3. [Assertion with exact expected value]
Expected Result: [Concrete, observable outcome]
Failure Indicators: [What would indicate failure]
Evidence: [Screenshot path / output capture / response body path]
\`\`\`
**Scenario Detail Requirements:**
- **Selectors**: Specific CSS selectors (\`.login-button\`, not "the login button")
- **Data**: Concrete test data (\`"test@example.com"\`, not \`"[email]"\`)
- **Assertions**: Exact values (\`text contains "Welcome back"\`, not "verify it works")
- **Timing**: Include wait conditions where relevant (\`Wait for .dashboard (timeout: 10s)\`)
- **Negative Scenarios**: At least ONE failure/error scenario per feature
- **Evidence Paths**: Specific file paths (\`.sisyphus/evidence/task-N-scenario-name.png\`)
**Anti-patterns (NEVER write scenarios like this):**
- ❌ "Verify the login page works correctly"
- ❌ "Check that the API returns the right data"
- ❌ "Test the form validation"
- ❌ "User opens browser and confirms..."
**Write scenarios like this instead:**
- ✅ \`Navigate to /login → Fill input[name="email"] with "test@example.com" → Fill input[name="password"] with "Pass123!" → Click button[type="submit"] → Wait for /dashboard → Assert h1 contains "Welcome"\`
- ✅ \`POST /api/users {"name":"Test","email":"new@test.com"} → Assert status 201 → Assert response.id is UUID → GET /api/users/{id} → Assert name equals "Test"\`
- ✅ \`Run ./cli --config test.yaml → Wait for "Loaded" in stdout → Send "q" → Assert exit code 0 → Assert stdout contains "Goodbye"\`
**Evidence Requirements:**
- Screenshots: \`.sisyphus/evidence/\` for all UI verifications
- Terminal output: Captured for CLI/TUI verifications
- Response bodies: Saved for API verifications
- All evidence referenced by specific file path in acceptance criteria
| Deliverable Type | Verification Tool | Method |
|------------------|-------------------|--------|
| Frontend/UI | Playwright (playwright skill) | Navigate, interact, assert DOM, screenshot |
| TUI/CLI | interactive_bash (tmux) | Run command, send keystrokes, validate output |
| API/Backend | Bash (curl) | Send requests, assert status + response fields |
| Library/Module | Bash (bun/node REPL) | Import, call functions, compare output |
---
@@ -347,6 +264,13 @@ Max Concurrent: 7 (Waves 1 & 2)
Evidence: .sisyphus/evidence/task-{N}-{scenario-slug}-error.{ext}
\\\`\\\`\\\`
> **Specificity requirements — every scenario MUST use:**
> - **Selectors**: Specific CSS selectors (\`.login-button\`, not "the login button")
> - **Data**: Concrete test data (\`"test@example.com"\`, not \`"[email]"\`)
> - **Assertions**: Exact values (\`text contains "Welcome back"\`, not "verify it works")
> - **Timing**: Wait conditions where relevant (\`timeout: 10s\`)
> - **Negative**: At least ONE failure/error scenario per task
>
> **Anti-patterns (your scenario is INVALID if it looks like this):**
> - ❌ "Verify it works correctly" — HOW? What does "correctly" mean?
> - ❌ "Check the API returns data" — WHAT data? What fields? What values?
@@ -366,153 +290,23 @@ Max Concurrent: 7 (Waves 1 & 2)
## Final Verification Wave (MANDATORY — after ALL implementation tasks)
> **ALL 4 review agents run in PARALLEL after every implementation task is complete.**
> **ALL 4 must APPROVE before the plan is considered done.**
> **If ANY agent rejects, fix issues and re-run the rejecting agent(s).**
> 4 review agents run in PARALLEL. ALL must APPROVE. Rejection → fix → re-run.
- [ ] F1. Plan Compliance Audit
- [ ] F1. **Plan Compliance Audit** — \`oracle\`
Read the plan end-to-end. For each "Must Have": verify implementation exists (read file, curl endpoint, run command). For each "Must NOT Have": search codebase for forbidden patterns — reject with file:line if found. Check evidence files exist in .sisyphus/evidence/. Compare deliverables against plan.
Output: \`Must Have [N/N] | Must NOT Have [N/N] | Tasks [N/N] | VERDICT: APPROVE/REJECT\`
**Agent**: oracle (read-only consultation)
- [ ] F2. **Code Quality Review** — \`unspecified-high\`
Run \`tsc --noEmit\` + linter + \`bun test\`. Review all changed files for: \`as any\`/\`@ts-ignore\`, empty catches, console.log in prod, commented-out code, unused imports. Check AI slop: excessive comments, over-abstraction, generic names (data/result/item/temp).
Output: \`Build [PASS/FAIL] | Lint [PASS/FAIL] | Tests [N pass/N fail] | Files [N clean/N issues] | VERDICT\`
**What this agent does**:
Read the original work plan (.sisyphus/plans/{name}.md) and verify EVERY requirement was fulfilled.
- [ ] F3. **Real Manual QA** — \`unspecified-high\` (+ \`playwright\` skill if UI)
Start from clean state. Execute EVERY QA scenario from EVERY task — follow exact steps, capture evidence. Test cross-task integration (features working together, not isolation). Test edge cases: empty state, invalid input, rapid actions. Save to \`.sisyphus/evidence/final-qa/\`.
Output: \`Scenarios [N/N pass] | Integration [N/N] | Edge Cases [N tested] | VERDICT\`
**Exact verification steps**:
1. Read the plan file end-to-end
2. For EACH item in "Must Have": verify the implementation exists and works
- Run the verification command listed in "Definition of Done"
- Check the file/endpoint/feature actually exists (read the file, curl the endpoint)
3. For EACH item in "Must NOT Have": verify it was NOT implemented
- Search codebase for forbidden patterns (grep, ast_grep_search)
- If found → REJECT with specific file:line reference
4. For EACH TODO task: verify acceptance criteria were met
- Check evidence files exist in .sisyphus/evidence/
- Verify test results match expected outcomes
5. Compare final deliverables against "Concrete Deliverables" list
**Output format**:
\\\`\\\`\\\`
## Plan Compliance Report
### Must Have: [N/N passed]
- [✅/❌] [requirement]: [evidence]
### Must NOT Have: [N/N clean]
- [✅/❌] [guardrail]: [evidence]
### Task Completion: [N/N verified]
- [✅/❌] Task N: [criteria status]
### VERDICT: APPROVE / REJECT
### Rejection Reasons (if any): [specific issues]
\\\`\\\`\\\`
- [ ] F2. Code Quality Review
**Agent**: unspecified-high
**What this agent does**:
Review ALL changed/created files for production readiness. This is NOT a rubber stamp.
**Exact verification steps**:
1. Run full type check: \`bunx tsc --noEmit\` (or project equivalent) → must exit 0
2. Run linter if configured: \`bunx biome check .\` / \`bunx eslint .\` → must pass
3. Run full test suite: \`bun test\` → all tests pass, zero failures
4. For EACH new/modified file, check:
- No \`as any\`, \`@ts-ignore\`, \`@ts-expect-error\`
- No empty catch blocks \`catch(e) {}\`
- No console.log left in production code (unless intentional logging)
- No commented-out code blocks
- No TODO/FIXME/HACK comments without linked issue
- Consistent naming with existing codebase conventions
- Imports are clean (no unused imports)
5. Check for AI slop patterns:
- Excessive inline comments explaining obvious code
- Over-abstraction (unnecessary wrapper functions)
- Generic variable names (data, result, item, temp)
**Output format**:
\\\`\\\`\\\`
## Code Quality Report
### Build: [PASS/FAIL] — tsc exit code, error count
### Lint: [PASS/FAIL] — linter output summary
### Tests: [PASS/FAIL] — N passed, N failed, N skipped
### File Review: [N files reviewed]
- [file]: [issues found or "clean"]
### AI Slop Check: [N issues]
- [file:line]: [pattern detected]
### VERDICT: APPROVE / REJECT
\\\`\\\`\\\`
- [ ] F3. Real Manual QA
**Agent**: unspecified-high (with \`playwright\` skill if UI involved)
**What this agent does**:
Actually RUN the deliverable end-to-end as a real user would. No mocks, no shortcuts.
**Exact verification steps**:
1. Start the application/service from scratch (clean state)
2. Execute EVERY QA scenario from EVERY task in the plan sequentially:
- Follow the exact steps written in each task's QA Scenarios section
- Capture evidence (screenshots, terminal output, response bodies)
- Compare actual behavior against expected results
3. Test cross-task integration:
- Does feature A work correctly WITH feature B? (not just in isolation)
- Does the full user flow work end-to-end?
4. Test edge cases not covered by individual tasks:
- Empty state / first-time use
- Rapid repeated actions
- Invalid/malformed input
- Network interruption (if applicable)
5. Save ALL evidence to .sisyphus/evidence/final-qa/
**Output format**:
\\\`\\\`\\\`
## Manual QA Report
### Scenarios Executed: [N/N passed]
- [✅/❌] Task N - Scenario name: [result]
### Integration Tests: [N/N passed]
- [✅/❌] [flow name]: [result]
### Edge Cases: [N tested]
- [✅/❌] [case]: [result]
### Evidence: .sisyphus/evidence/final-qa/
### VERDICT: APPROVE / REJECT
\\\`\\\`\\\`
- [ ] F4. Scope Fidelity Check
**Agent**: deep
**What this agent does**:
Verify that EACH task implemented EXACTLY what was specified — no more, no less.
Catches scope creep, missing features, and unauthorized additions.
**Exact verification steps**:
1. For EACH completed task in the plan:
a. Read the task's "What to do" section
b. Read the actual diff/files created for that task (git log, git diff, file reads)
c. Verify 1:1 correspondence:
- Everything in "What to do" was implemented → no missing features
- Nothing BEYOND "What to do" was implemented → no scope creep
d. Read the task's "Must NOT do" section
e. Verify NONE of the forbidden items were implemented
2. Check for unauthorized cross-task contamination:
- Did Task 5 accidentally implement something that belongs to Task 8?
- Are there files modified that don't belong to any task?
3. Verify each task's boundaries are respected:
- No task touches files outside its stated scope
- No task implements functionality assigned to a different task
**Output format**:
\\\`\\\`\\\`
## Scope Fidelity Report
### Task-by-Task Audit: [N/N compliant]
- [✅/❌] Task N: [compliance status]
- Implemented: [list of what was done]
- Missing: [anything from "What to do" not found]
- Excess: [anything done that wasn't in "What to do"]
- "Must NOT do" violations: [list or "none"]
### Cross-Task Contamination: [CLEAN / N issues]
### Unaccounted Changes: [CLEAN / N files]
### VERDICT: APPROVE / REJECT
\\\`\\\`\\\`
- [ ] F4. **Scope Fidelity Check** — \`deep\`
For each task: read "What to do", read actual diff (git log/diff). Verify 1:1 — everything in spec was built (no missing), nothing beyond spec was built (no creep). Check "Must NOT do" compliance. Detect cross-task contamination: Task N touching Task M's files. Flag unaccounted changes.
Output: \`Tasks [N/N compliant] | Contamination [CLEAN/N issues] | Unaccounted [CLEAN/N files] | VERDICT\`
---