Files
oh-my-openagent/.opencode/skills/work-with-pr-workspace/iteration-1/benchmark.json
YeonGyu-Kim c7518eae2d add skills
2026-03-14 12:45:58 +09:00

139 lines
6.1 KiB
JSON

{
"skill_name": "work-with-pr",
"iteration": 1,
"summary": {
"with_skill": {
"pass_rate": 0.968,
"mean_duration_seconds": 340.2,
"stddev_duration_seconds": 169.3
},
"without_skill": {
"pass_rate": 0.516,
"mean_duration_seconds": 303.0,
"stddev_duration_seconds": 77.8
},
"delta": {
"pass_rate": 0.452,
"mean_duration_seconds": 37.2,
"stddev_duration_seconds": 91.5
}
},
"evals": [
{
"eval_name": "happy-path-feature-config-option",
"with_skill": {
"pass_rate": 1.0,
"passed": 10,
"total": 10,
"duration_seconds": 292,
"failed_assertions": []
},
"without_skill": {
"pass_rate": 0.4,
"passed": 4,
"total": 10,
"duration_seconds": 365,
"failed_assertions": [
{"assertion": "Plan uses git worktree in a sibling directory", "reason": "Uses git checkout -b, no worktree isolation"},
{"assertion": "Plan specifies multiple atomic commits for multi-file changes", "reason": "Steps listed sequentially but no atomic commit strategy mentioned"},
{"assertion": "Verification loop includes all 3 gates: CI, review-work, and Cubic", "reason": "Only mentions CI pipeline in step 6. No review-work or Cubic."},
{"assertion": "Gates are checked in order: CI first, then review-work, then Cubic", "reason": "No gate ordering - only CI mentioned"},
{"assertion": "Cubic check uses gh api to check cubic-dev-ai[bot] reviews", "reason": "No mention of Cubic at all"},
{"assertion": "Plan includes worktree cleanup after merge", "reason": "No worktree used, no cleanup needed"}
]
}
},
{
"eval_name": "bugfix-atlas-null-check",
"with_skill": {
"pass_rate": 1.0,
"passed": 6,
"total": 6,
"duration_seconds": 506,
"failed_assertions": []
},
"without_skill": {
"pass_rate": 0.667,
"passed": 4,
"total": 6,
"duration_seconds": 325,
"failed_assertions": [
{"assertion": "Plan uses git worktree in a sibling directory", "reason": "No worktree. Steps go directly to creating branch and modifying files."},
{"assertion": "Verification loop includes all 3 gates", "reason": "Only mentions CI pipeline (step 5). No review-work or Cubic."}
]
}
},
{
"eval_name": "refactor-split-constants",
"with_skill": {
"pass_rate": 1.0,
"passed": 5,
"total": 5,
"duration_seconds": 181,
"failed_assertions": []
},
"without_skill": {
"pass_rate": 0.4,
"passed": 2,
"total": 5,
"duration_seconds": 229,
"failed_assertions": [
{"assertion": "Plan uses git worktree in a sibling directory", "reason": "git checkout -b only, no worktree"},
{"assertion": "Uses 2+ commits for the multi-file refactor", "reason": "Single atomic commit: 'refactor: split delegate-task constants and category model requirements'"},
{"assertion": "Verification loop includes all 3 gates", "reason": "Only mentions typecheck/test/build. No review-work or Cubic."}
]
}
},
{
"eval_name": "new-mcp-arxiv-casual",
"with_skill": {
"pass_rate": 1.0,
"passed": 5,
"total": 5,
"duration_seconds": 152,
"failed_assertions": []
},
"without_skill": {
"pass_rate": 0.6,
"passed": 3,
"total": 5,
"duration_seconds": 197,
"failed_assertions": [
{"assertion": "Verification loop includes all 3 gates", "reason": "Only mentions bun test/typecheck/build. No review-work or Cubic."}
]
}
},
{
"eval_name": "regex-fix-false-positive",
"with_skill": {
"pass_rate": 0.8,
"passed": 4,
"total": 5,
"duration_seconds": 570,
"failed_assertions": [
{"assertion": "Only modifies regex and adds tests — no unrelated changes", "reason": "Also proposes config schema change (exclude_patterns) and Go binary update — goes beyond minimal fix"}
]
},
"without_skill": {
"pass_rate": 0.6,
"passed": 3,
"total": 5,
"duration_seconds": 399,
"failed_assertions": [
{"assertion": "Plan uses git worktree in a sibling directory", "reason": "git checkout -b, no worktree"},
{"assertion": "Verification loop includes all 3 gates", "reason": "Only bun test and typecheck. No review-work or Cubic."}
]
}
}
],
"analyst_observations": [
"Three-gates assertion (CI + review-work + Cubic) is the strongest discriminator: 5/5 with-skill vs 0/5 without-skill. Without the skill, agents never know about Cubic or review-work gates.",
"Worktree isolation is nearly as discriminating (5/5 vs 1/5). One without-skill run (eval-4) independently chose worktree, suggesting some agents already know worktree patterns, but the skill makes it consistent.",
"The skill's only failure (eval-5 minimal-change) reveals a potential over-engineering tendency: the skill-guided agent proposed config schema changes and Go binary updates for what should have been a minimal regex fix. Consider adding explicit guidance for fix-type tasks to stay minimal.",
"Duration tradeoff: with-skill is 12% slower on average (340s vs 303s), driven mainly by eval-2 (bugfix) and eval-5 (regex fix) where the skill's thorough verification planning adds overhead. For eval-1 and eval-3-4, with-skill was actually faster.",
"Without-skill duration has lower variance (stddev 78s vs 169s), suggesting the skill introduces more variable execution paths depending on task complexity.",
"Non-discriminating assertions: 'References actual files', 'PR targets dev', 'Runs local checks' — these pass regardless of skill. They validate baseline agent competence, not skill value. Consider removing or downweighting in future iterations.",
"Atomic commits assertion discriminates moderately (2/2 with-skill tested vs 0/2 without-skill tested). Without the skill, agents default to single commits even for multi-file refactors."
]
}