oh-my-openagent/.opencode/skills/work-with-pr-workspace/iteration-1/benchmark.md

# Benchmark: work-with-pr (Iteration 1)

## Summary

| Metric | With Skill | Without Skill | Delta |
|--------|-----------|---------------|-------|
| Pass Rate | 96.8% (30/31) | 51.6% (16/31) | +45.2% |
| Mean Duration | 340.2s | 303.0s | +37.2s |
| Duration Stddev | 169.3s | 77.8s | +91.5s |

## Per-Eval Breakdown

| Eval | With Skill | Without Skill | Delta |
|------|-----------|---------------|-------|
| happy-path-feature-config-option | 100% (10/10) | 40% (4/10) | +60% |
| bugfix-atlas-null-check | 100% (6/6) | 67% (4/6) | +33% |
| refactor-split-constants | 100% (5/5) | 40% (2/5) | +60% |
| new-mcp-arxiv-casual | 100% (5/5) | 60% (3/5) | +40% |
| regex-fix-false-positive | 80% (4/5) | 60% (3/5) | +20% |

## Key Discriminators

- **three-gates** (CI + review-work + Cubic): 5/5 vs 0/5 — strongest signal
- **worktree-isolation**: 5/5 vs 1/5
- **atomic-commits**: 2/2 vs 0/2
- **cubic-check-method**: 1/1 vs 0/1

## Non-Discriminating Assertions

- References actual files: passes in both conditions
- PR targets dev: passes in both conditions
- Runs local checks before pushing: passes in both conditions

## Only With-Skill Failure

- **eval-5 minimal-change**: Skill-guided agent proposed config schema changes and Go binary update for a minimal regex fix. The skill may encourage over-engineering in fix scenarios.

## Analyst Notes

- The skill adds most value for procedural knowledge (verification gates, worktree workflow) that agents cannot infer from codebase alone.
- Duration cost is modest (+12%) and acceptable given the +45% pass rate improvement.
- Consider adding explicit "fix-type tasks: stay minimal" guidance in iteration 2.