Compare commits

..

4 Commits

Author SHA1 Message Date
YeonGyu-Kim
96886f18ac docs: add look_at tool and multimodal-looker agent documentation
🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)
2025-12-13 15:28:59 +09:00
YeonGyu-Kim
a3938e8c25 feat: add look_at tool and multimodal-looker agent
Add a new tool and agent for analyzing media files (PDFs, images, diagrams)
that require visual interpretation beyond raw text.

- Add `multimodal-looker` agent using Gemini 2.5 Flash model
- Add `look_at` tool that spawns multimodal-looker sessions
- Restrict multimodal-looker from calling task/call_omo_agent/look_at tools

Inspired by Sourcegraph Ampcode's look_at tool design.

🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)
2025-12-13 15:28:59 +09:00
YeonGyu-Kim
821b0b8e9f docs: add known issue and hotfix for opencode-openai-codex-auth 400 error
🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)
2025-12-13 15:28:59 +09:00
Junho Yeo
356bd1dff3 fix(ci): prevent publish workflow from running on forks (#34) 2025-12-13 14:48:18 +09:00
13 changed files with 219 additions and 1 deletions

View File

@@ -26,6 +26,7 @@ permissions:
jobs:
publish:
runs-on: ubuntu-latest
if: github.repository == 'code-yeongyu/oh-my-opencode'
steps:
- uses: actions/checkout@v4
with:

View File

@@ -166,6 +166,18 @@ opencode auth login
# 브라우저에서 OAuth 플로우 완료
```
**⚠️ 알려진 이슈**: 현재 공식 npm 패키지에 400 에러(`"No tool call found for function call output with call_id"`)를 유발하는 버그가 있습니다. 수정 버전이 배포될 때까지 **핫픽스 브랜치 사용을 권장합니다**. `~/.config/opencode/package.json`을 수정하세요:
```json
{
"dependencies": {
"opencode-openai-codex-auth": "code-yeongyu/opencode-openai-codex-auth#fix/orphaned-function-call-output-with-tools"
}
}
```
그 후 `cd ~/.config/opencode && bun i`를 실행하세요. `opencode.json`에서는 버전 없이 `"opencode-openai-codex-auth"`로 사용합니다 (`@4.1.0` 제외).
#### 4.4 대안: 프록시 기반 인증
프록시 기반 인증을 선호하는 사용자를 위해 [VibeProxy](https://github.com/automazeio/vibeproxy) (macOS) 또는 [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI)를 대안으로 사용할 수 있습니다.
@@ -206,6 +218,7 @@ OpenCode 는 아주 확장가능하고 아주 커스터마이저블합니다.
- **explore** (`opencode/grok-code`): 빠른 코드베이스 탐색, 파일 패턴 매칭. Claude Code는 Haiku를 쓰지만, 우리는 Grok을 씁니다. 현재 무료이고, 극도로 빠르며, 파일 탐색 작업에 충분한 지능을 갖췄기 때문입니다. Claude Code 에서 영감을 받았습니다.
- **frontend-ui-ux-engineer** (`google/gemini-3-pro-preview`): 개발자로 전향한 디자이너라는 설정을 갖고 있습니다. 멋진 UI를 만듭니다. 아름답고 창의적인 UI 코드를 생성하는 데 탁월한 Gemini를 사용합니다.
- **document-writer** (`google/gemini-3-pro-preview`): 기술 문서 전문가라는 설정을 갖고 있습니다. Gemini 는 문학가입니다. 글을 기가막히게 씁니다.
- **multimodal-looker** (`google/gemini-2.5-flash`): 시각적 콘텐츠 해석을 위한 전문 에이전트. PDF, 이미지, 다이어그램을 분석하여 정보를 추출합니다.
각 에이전트는 메인 에이전트가 알아서 호출하지만, 명시적으로 요청할 수도 있습니다:
@@ -258,6 +271,12 @@ OpenCode 는 아주 확장가능하고 아주 커스터마이저블합니다.
- 기본 `glob`은 타임아웃이 없습니다. ripgrep이 멈추면 무한정 대기합니다.
- 이 도구는 타임아웃을 강제하고 만료 시 프로세스를 종료합니다.
#### 내장 멀티모달 도구 (Built-in Multimodal Tools)
- **look_at**: 시각적 해석이 필요한 미디어 파일(PDF, 이미지, 다이어그램 등)을 Gemini 2.5 Flash를 사용하여 분석합니다. Sourcegraph Ampcode의 `look_at` 도구에서 영감을 받았습니다.
- 파라미터: `file_path` (절대 경로), `goal` (추출할 정보)
- 사용 사례: PDF 텍스트 추출, 이미지 설명, 다이어그램 분석
#### 내장 MCPs
- **websearch_exa**: Exa AI 웹 검색. 실시간 웹 검색과 콘텐츠 스크래핑을 수행합니다. 관련 웹사이트에서 LLM에 최적화된 컨텍스트를 반환합니다.

View File

@@ -165,6 +165,18 @@ opencode auth login
# Complete OAuth flow in browser
```
**⚠️ Known Issue**: The official npm package currently has a bug that causes 400 errors (`"No tool call found for function call output with call_id"`). Until a fix is released, **use the hotfix branch instead**. Modify `~/.config/opencode/package.json`:
```json
{
"dependencies": {
"opencode-openai-codex-auth": "code-yeongyu/opencode-openai-codex-auth#fix/orphaned-function-call-output-with-tools"
}
}
```
Then run `cd ~/.config/opencode && bun i`. In your `opencode.json`, use the plugin name without a version: `"opencode-openai-codex-auth"` (not `@4.1.0`).
#### 4.4 Alternative: Proxy-based Authentication
For users who prefer proxy-based authentication, [VibeProxy](https://github.com/automazeio/vibeproxy) (macOS) or [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI) remain available as alternatives.
@@ -203,6 +215,7 @@ I believe in the right tool for the job. For your wallet's sake, use CLIProxyAPI
- **explore** (`opencode/grok-code`): Fast exploration and pattern matching. Claude Code uses Haiku; we use Grok. It is currently free, blazing fast, and intelligent enough for file traversal. Inspired by Claude Code.
- **frontend-ui-ux-engineer** (`google/gemini-3-pro-preview`): A designer turned developer. Creates stunning UIs. Uses Gemini because its creativity and UI code generation are superior.
- **document-writer** (`google/gemini-3-pro-preview`): A technical writing expert. Gemini is a wordsmith; it writes prose that flows naturally.
- **multimodal-looker** (`google/gemini-2.5-flash`): Specialized agent for visual content interpretation. Analyzes PDFs, images, and diagrams to extract information.
Each agent is automatically invoked by the main agent, but you can also explicitly request them:
@@ -257,6 +270,12 @@ The features you use in your editor—other agents cannot access them. Oh My Ope
- The default `glob` lacks timeout. If ripgrep hangs, it waits indefinitely.
- This tool enforces timeouts and kills the process on expiration.
#### Built-in Multimodal Tools
- **look_at**: Analyzes media files (PDFs, images, diagrams) that require visual interpretation using Gemini 2.5 Flash. Inspired by Sourcegraph Ampcode's `look_at` tool.
- Parameters: `file_path` (absolute path), `goal` (what to extract)
- Use cases: PDF text extraction, image description, diagram analysis
#### Built-in MCPs
- **websearch_exa**: Exa AI web search. Performs real-time web searches and can scrape content from specific URLs. Returns LLM-optimized context from relevant websites.

View File

@@ -4,6 +4,7 @@ import { librarianAgent } from "./librarian"
import { exploreAgent } from "./explore"
import { frontendUiUxEngineerAgent } from "./frontend-ui-ux-engineer"
import { documentWriterAgent } from "./document-writer"
import { multimodalLookerAgent } from "./multimodal-looker"
export const builtinAgents: Record<string, AgentConfig> = {
oracle: oracleAgent,
@@ -11,6 +12,7 @@ export const builtinAgents: Record<string, AgentConfig> = {
explore: exploreAgent,
"frontend-ui-ux-engineer": frontendUiUxEngineerAgent,
"document-writer": documentWriterAgent,
"multimodal-looker": multimodalLookerAgent,
}
export * from "./types"

View File

@@ -0,0 +1,42 @@
import type { AgentConfig } from "@opencode-ai/sdk"
export const multimodalLookerAgent: AgentConfig = {
description:
"Analyze media files (PDFs, images, diagrams) that require interpretation beyond raw text. Extracts specific information or summaries from documents, describes visual content. Use when you need analyzed/extracted data rather than literal file contents.",
mode: "subagent",
model: "google/gemini-2.5-flash",
temperature: 0.1,
tools: { Read: true },
prompt: `You interpret media files that cannot be read as plain text.
Your job: examine the attached file and extract ONLY what was requested.
When to use you:
- Media files the Read tool cannot interpret
- Extracting specific information or summaries from documents
- Describing visual content in images or diagrams
- When analyzed/extracted data is needed, not raw file contents
When NOT to use you:
- Source code or plain text files needing exact contents (use Read)
- Files that need editing afterward (need literal content from Read)
- Simple file reading where no interpretation is needed
How you work:
1. Receive a file path and a goal describing what to extract
2. Read and analyze the file deeply
3. Return ONLY the relevant extracted information
4. The main agent never processes the raw file - you save context tokens
For PDFs: extract text, structure, tables, data from specific sections
For images: describe layouts, UI elements, text, diagrams, charts
For diagrams: explain relationships, flows, architecture depicted
Response rules:
- Return extracted information directly, no preamble
- If info not found, state clearly what's missing
- Match the language of the request
- Be thorough on the goal, concise on everything else
Your output goes straight to the main agent for continued work.`,
}

View File

@@ -6,6 +6,7 @@ export type AgentName =
| "explore"
| "frontend-ui-ux-engineer"
| "document-writer"
| "multimodal-looker"
export type AgentOverrideConfig = Partial<AgentConfig>

View File

@@ -5,6 +5,7 @@ import { librarianAgent } from "./librarian"
import { exploreAgent } from "./explore"
import { frontendUiUxEngineerAgent } from "./frontend-ui-ux-engineer"
import { documentWriterAgent } from "./document-writer"
import { multimodalLookerAgent } from "./multimodal-looker"
import { deepMerge } from "../shared"
const allBuiltinAgents: Record<AgentName, AgentConfig> = {
@@ -13,6 +14,7 @@ const allBuiltinAgents: Record<AgentName, AgentConfig> = {
explore: exploreAgent,
"frontend-ui-ux-engineer": frontendUiUxEngineerAgent,
"document-writer": documentWriterAgent,
"multimodal-looker": multimodalLookerAgent,
}
function mergeAgentConfig(

View File

@@ -41,7 +41,7 @@ import {
getCurrentSessionTitle,
} from "./features/claude-code-session-state";
import { updateTerminalTitle } from "./features/terminal";
import { builtinTools, createCallOmoAgent, createBackgroundTools } from "./tools";
import { builtinTools, createCallOmoAgent, createBackgroundTools, createLookAt } from "./tools";
import { BackgroundManager } from "./features/background-agent";
import { createBuiltinMcps } from "./mcp";
import { OhMyOpenCodeConfigSchema, type OhMyOpenCodeConfig, type HookName } from "./config";
@@ -218,6 +218,7 @@ const OhMyOpenCodePlugin: Plugin = async (ctx) => {
const backgroundTools = createBackgroundTools(backgroundManager, ctx.client);
const callOmoAgent = createCallOmoAgent(ctx, backgroundManager);
const lookAt = createLookAt(ctx);
const googleAuthHooks = pluginConfig.google_auth
? await createGoogleAntigravityAuthPlugin(ctx)
@@ -230,6 +231,7 @@ const OhMyOpenCodePlugin: Plugin = async (ctx) => {
...builtinTools,
...backgroundTools,
call_omo_agent: callOmoAgent,
look_at: lookAt,
},
"chat.message": async (input, output) => {
@@ -268,6 +270,14 @@ const OhMyOpenCodePlugin: Plugin = async (ctx) => {
call_omo_agent: false,
};
}
if (config.agent["multimodal-looker"]) {
config.agent["multimodal-looker"].tools = {
...config.agent["multimodal-looker"].tools,
task: false,
call_omo_agent: false,
look_at: false,
};
}
const mcpResult = (pluginConfig.claude_code?.mcp ?? true)
? await loadMcpConfigs()

View File

@@ -34,6 +34,7 @@ import type { BackgroundManager } from "../features/background-agent"
type OpencodeClient = PluginInput["client"]
export { createCallOmoAgent } from "./call-omo-agent"
export { createLookAt } from "./look-at"
export function createBackgroundTools(manager: BackgroundManager, client: OpencodeClient) {
return {

View File

@@ -0,0 +1,23 @@
export const MULTIMODAL_LOOKER_AGENT = "multimodal-looker" as const
export const LOOK_AT_DESCRIPTION = `Analyze media files (PDFs, images, diagrams) that require visual interpretation.
Use this tool to extract specific information from files that cannot be processed as plain text:
- PDF documents: extract text, tables, structure, specific sections
- Images: describe layouts, UI elements, text content, diagrams
- Charts/Graphs: explain data, trends, relationships
- Screenshots: identify UI components, text, visual elements
- Architecture diagrams: explain flows, connections, components
Parameters:
- file_path: Absolute path to the file to analyze
- goal: What specific information to extract (be specific for better results)
Examples:
- "Extract all API endpoints from this OpenAPI spec PDF"
- "Describe the UI layout and components in this screenshot"
- "Explain the data flow in this architecture diagram"
- "List all table data from page 3 of this PDF"
This tool uses a separate context window with Gemini 2.5 Flash for multimodal analysis,
saving tokens in the main conversation while providing accurate visual interpretation.`

View File

@@ -0,0 +1,3 @@
export * from "./types"
export * from "./constants"
export { createLookAt } from "./tools"

View File

@@ -0,0 +1,91 @@
import { tool, type PluginInput } from "@opencode-ai/plugin"
import { LOOK_AT_DESCRIPTION, MULTIMODAL_LOOKER_AGENT } from "./constants"
import type { LookAtArgs } from "./types"
import { log } from "../../shared/logger"
export function createLookAt(ctx: PluginInput) {
return tool({
description: LOOK_AT_DESCRIPTION,
args: {
file_path: tool.schema.string().describe("Absolute path to the file to analyze"),
goal: tool.schema.string().describe("What specific information to extract from the file"),
},
async execute(args: LookAtArgs, toolContext) {
log(`[look_at] Analyzing file: ${args.file_path}, goal: ${args.goal}`)
const prompt = `Analyze this file and extract the requested information.
File path: ${args.file_path}
Goal: ${args.goal}
Read the file using the Read tool, then provide ONLY the extracted information that matches the goal.
Be thorough on what was requested, concise on everything else.
If the requested information is not found, clearly state what is missing.`
log(`[look_at] Creating session with parent: ${toolContext.sessionID}`)
const createResult = await ctx.client.session.create({
body: {
parentID: toolContext.sessionID,
title: `look_at: ${args.goal.substring(0, 50)}`,
},
})
if (createResult.error) {
log(`[look_at] Session create error:`, createResult.error)
return `Error: Failed to create session: ${createResult.error}`
}
const sessionID = createResult.data.id
log(`[look_at] Created session: ${sessionID}`)
log(`[look_at] Sending prompt to session ${sessionID}`)
await ctx.client.session.prompt({
path: { id: sessionID },
body: {
agent: MULTIMODAL_LOOKER_AGENT,
tools: {
task: false,
call_omo_agent: false,
look_at: false,
},
parts: [{ type: "text", text: prompt }],
},
})
log(`[look_at] Prompt sent, fetching messages...`)
const messagesResult = await ctx.client.session.messages({
path: { id: sessionID },
})
if (messagesResult.error) {
log(`[look_at] Messages error:`, messagesResult.error)
return `Error: Failed to get messages: ${messagesResult.error}`
}
const messages = messagesResult.data
log(`[look_at] Got ${messages.length} messages`)
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const lastAssistantMessage = messages
.filter((m: any) => m.info.role === "assistant")
.sort((a: any, b: any) => (b.info.time?.created || 0) - (a.info.time?.created || 0))[0]
if (!lastAssistantMessage) {
log(`[look_at] No assistant message found`)
return `Error: No response from multimodal-looker agent`
}
log(`[look_at] Found assistant message with ${lastAssistantMessage.parts.length} parts`)
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const textParts = lastAssistantMessage.parts.filter((p: any) => p.type === "text")
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const responseText = textParts.map((p: any) => p.text).join("\n")
log(`[look_at] Got response, length: ${responseText.length}`)
return responseText
},
})
}

View File

@@ -0,0 +1,4 @@
export interface LookAtArgs {
file_path: string
goal: string
}