docs: add look_at tool and multimodal-looker agent documentation

🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)
feat: add look_at tool and multimodal-looker agent
2025-12-13 15:28:59 +09:00 · 2025-12-13 15:28:59 +09:00 · 2025-12-13 15:28:59 +09:00 · 2025-12-13 14:48:18 +09:00
13 changed files with 219 additions and 1 deletions
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -26,6 +26,7 @@ permissions:
 jobs:
  publish:
    runs-on: ubuntu-latest
+    if: github.repository == 'code-yeongyu/oh-my-opencode'
    steps:
      - uses: actions/checkout@v4
        with:
--- a/README.ko.md
+++ b/README.ko.md
@@ -166,6 +166,18 @@ opencode auth login
 # 브라우저에서 OAuth 플로우 완료
 ```

+**⚠️ 알려진 이슈**: 현재 공식 npm 패키지에 400 에러(`"No tool call found for function call output with call_id"`)를 유발하는 버그가 있습니다. 수정 버전이 배포될 때까지 **핫픽스 브랜치 사용을 권장합니다**. `~/.config/opencode/package.json`을 수정하세요:
+
+```json
+{
+  "dependencies": {
+    "opencode-openai-codex-auth": "code-yeongyu/opencode-openai-codex-auth#fix/orphaned-function-call-output-with-tools"
+  }
+}
+```
+
+그 후 `cd ~/.config/opencode && bun i`를 실행하세요. `opencode.json`에서는 버전 없이 `"opencode-openai-codex-auth"`로 사용합니다 (`@4.1.0` 제외).
+
 #### 4.4 대안: 프록시 기반 인증

 프록시 기반 인증을 선호하는 사용자를 위해 [VibeProxy](https://github.com/automazeio/vibeproxy) (macOS) 또는 [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI)를 대안으로 사용할 수 있습니다.
@@ -206,6 +218,7 @@ OpenCode 는 아주 확장가능하고 아주 커스터마이저블합니다.
 - **explore** (`opencode/grok-code`): 빠른 코드베이스 탐색, 파일 패턴 매칭. Claude Code는 Haiku를 쓰지만, 우리는 Grok을 씁니다. 현재 무료이고, 극도로 빠르며, 파일 탐색 작업에 충분한 지능을 갖췄기 때문입니다. Claude Code 에서 영감을 받았습니다.
 - **frontend-ui-ux-engineer** (`google/gemini-3-pro-preview`): 개발자로 전향한 디자이너라는 설정을 갖고 있습니다. 멋진 UI를 만듭니다. 아름답고 창의적인 UI 코드를 생성하는 데 탁월한 Gemini를 사용합니다.
 - **document-writer** (`google/gemini-3-pro-preview`): 기술 문서 전문가라는 설정을 갖고 있습니다. Gemini 는 문학가입니다. 글을 기가막히게 씁니다.
+- **multimodal-looker** (`google/gemini-2.5-flash`): 시각적 콘텐츠 해석을 위한 전문 에이전트. PDF, 이미지, 다이어그램을 분석하여 정보를 추출합니다.

 각 에이전트는 메인 에이전트가 알아서 호출하지만, 명시적으로 요청할 수도 있습니다:

@@ -258,6 +271,12 @@ OpenCode 는 아주 확장가능하고 아주 커스터마이저블합니다.
  - 기본 `glob`은 타임아웃이 없습니다. ripgrep이 멈추면 무한정 대기합니다.
  - 이 도구는 타임아웃을 강제하고 만료 시 프로세스를 종료합니다.

+#### 내장 멀티모달 도구 (Built-in Multimodal Tools)
+
+- **look_at**: 시각적 해석이 필요한 미디어 파일(PDF, 이미지, 다이어그램 등)을 Gemini 2.5 Flash를 사용하여 분석합니다. Sourcegraph Ampcode의 `look_at` 도구에서 영감을 받았습니다.
+  - 파라미터: `file_path` (절대 경로), `goal` (추출할 정보)
+  - 사용 사례: PDF 텍스트 추출, 이미지 설명, 다이어그램 분석
+
 #### 내장 MCPs

 - **websearch_exa**: Exa AI 웹 검색. 실시간 웹 검색과 콘텐츠 스크래핑을 수행합니다. 관련 웹사이트에서 LLM에 최적화된 컨텍스트를 반환합니다.
--- a/README.md
+++ b/README.md
@@ -165,6 +165,18 @@ opencode auth login
 # Complete OAuth flow in browser
 ```

+**⚠️ Known Issue**: The official npm package currently has a bug that causes 400 errors (`"No tool call found for function call output with call_id"`). Until a fix is released, **use the hotfix branch instead**. Modify `~/.config/opencode/package.json`:
+
+```json
+{
+  "dependencies": {
+    "opencode-openai-codex-auth": "code-yeongyu/opencode-openai-codex-auth#fix/orphaned-function-call-output-with-tools"
+  }
+}
+```
+
+Then run `cd ~/.config/opencode && bun i`. In your `opencode.json`, use the plugin name without a version: `"opencode-openai-codex-auth"` (not `@4.1.0`).
+
 #### 4.4 Alternative: Proxy-based Authentication

 For users who prefer proxy-based authentication, [VibeProxy](https://github.com/automazeio/vibeproxy) (macOS) or [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI) remain available as alternatives.
@@ -203,6 +215,7 @@ I believe in the right tool for the job. For your wallet's sake, use CLIProxyAPI
 - **explore** (`opencode/grok-code`): Fast exploration and pattern matching. Claude Code uses Haiku; we use Grok. It is currently free, blazing fast, and intelligent enough for file traversal. Inspired by Claude Code.
 - **frontend-ui-ux-engineer** (`google/gemini-3-pro-preview`): A designer turned developer. Creates stunning UIs. Uses Gemini because its creativity and UI code generation are superior.
 - **document-writer** (`google/gemini-3-pro-preview`): A technical writing expert. Gemini is a wordsmith; it writes prose that flows naturally.
+- **multimodal-looker** (`google/gemini-2.5-flash`): Specialized agent for visual content interpretation. Analyzes PDFs, images, and diagrams to extract information.

 Each agent is automatically invoked by the main agent, but you can also explicitly request them:

@@ -257,6 +270,12 @@ The features you use in your editor—other agents cannot access them. Oh My Ope
  - The default `glob` lacks timeout. If ripgrep hangs, it waits indefinitely.
  - This tool enforces timeouts and kills the process on expiration.

+#### Built-in Multimodal Tools
+
+- **look_at**: Analyzes media files (PDFs, images, diagrams) that require visual interpretation using Gemini 2.5 Flash. Inspired by Sourcegraph Ampcode's `look_at` tool.
+  - Parameters: `file_path` (absolute path), `goal` (what to extract)
+  - Use cases: PDF text extraction, image description, diagram analysis
+
 #### Built-in MCPs

 - **websearch_exa**: Exa AI web search. Performs real-time web searches and can scrape content from specific URLs. Returns LLM-optimized context from relevant websites.
--- a/src/agents/index.ts
+++ b/src/agents/index.ts
@@ -4,6 +4,7 @@ import { librarianAgent } from "./librarian"
 import { exploreAgent } from "./explore"
 import { frontendUiUxEngineerAgent } from "./frontend-ui-ux-engineer"
 import { documentWriterAgent } from "./document-writer"
+import { multimodalLookerAgent } from "./multimodal-looker"

 export const builtinAgents: Record<string, AgentConfig> = {
  oracle: oracleAgent,
@@ -11,6 +12,7 @@ export const builtinAgents: Record<string, AgentConfig> = {
  explore: exploreAgent,
  "frontend-ui-ux-engineer": frontendUiUxEngineerAgent,
  "document-writer": documentWriterAgent,
+  "multimodal-looker": multimodalLookerAgent,
 }

 export * from "./types"
--- a/src/agents/multimodal-looker.ts
+++ b/src/agents/multimodal-looker.ts
@@ -0,0 +1,42 @@
+import type { AgentConfig } from "@opencode-ai/sdk"
+
+export const multimodalLookerAgent: AgentConfig = {
+  description:
+    "Analyze media files (PDFs, images, diagrams) that require interpretation beyond raw text. Extracts specific information or summaries from documents, describes visual content. Use when you need analyzed/extracted data rather than literal file contents.",
+  mode: "subagent",
+  model: "google/gemini-2.5-flash",
+  temperature: 0.1,
+  tools: { Read: true },
+  prompt: `You interpret media files that cannot be read as plain text.
+
+Your job: examine the attached file and extract ONLY what was requested.
+
+When to use you:
+- Media files the Read tool cannot interpret
+- Extracting specific information or summaries from documents
+- Describing visual content in images or diagrams
+- When analyzed/extracted data is needed, not raw file contents
+
+When NOT to use you:
+- Source code or plain text files needing exact contents (use Read)
+- Files that need editing afterward (need literal content from Read)
+- Simple file reading where no interpretation is needed
+
+How you work:
+1. Receive a file path and a goal describing what to extract
+2. Read and analyze the file deeply
+3. Return ONLY the relevant extracted information
+4. The main agent never processes the raw file - you save context tokens
+
+For PDFs: extract text, structure, tables, data from specific sections
+For images: describe layouts, UI elements, text, diagrams, charts
+For diagrams: explain relationships, flows, architecture depicted
+
+Response rules:
+- Return extracted information directly, no preamble
+- If info not found, state clearly what's missing
+- Match the language of the request
+- Be thorough on the goal, concise on everything else
+
+Your output goes straight to the main agent for continued work.`,
+}
--- a/src/agents/types.ts
+++ b/src/agents/types.ts
@@ -6,6 +6,7 @@ export type AgentName =
  | "explore"
  | "frontend-ui-ux-engineer"
  | "document-writer"
+  | "multimodal-looker"

 export type AgentOverrideConfig = Partial<AgentConfig>

--- a/src/agents/utils.ts
+++ b/src/agents/utils.ts
@@ -5,6 +5,7 @@ import { librarianAgent } from "./librarian"
 import { exploreAgent } from "./explore"
 import { frontendUiUxEngineerAgent } from "./frontend-ui-ux-engineer"
 import { documentWriterAgent } from "./document-writer"
+import { multimodalLookerAgent } from "./multimodal-looker"
 import { deepMerge } from "../shared"

 const allBuiltinAgents: Record<AgentName, AgentConfig> = {
@@ -13,6 +14,7 @@ const allBuiltinAgents: Record<AgentName, AgentConfig> = {
  explore: exploreAgent,
  "frontend-ui-ux-engineer": frontendUiUxEngineerAgent,
  "document-writer": documentWriterAgent,
+  "multimodal-looker": multimodalLookerAgent,
 }

 function mergeAgentConfig(
--- a/src/index.ts
+++ b/src/index.ts
@@ -41,7 +41,7 @@ import {
  getCurrentSessionTitle,
 } from "./features/claude-code-session-state";
 import { updateTerminalTitle } from "./features/terminal";
-import { builtinTools, createCallOmoAgent, createBackgroundTools } from "./tools";
+import { builtinTools, createCallOmoAgent, createBackgroundTools, createLookAt } from "./tools";
 import { BackgroundManager } from "./features/background-agent";
 import { createBuiltinMcps } from "./mcp";
 import { OhMyOpenCodeConfigSchema, type OhMyOpenCodeConfig, type HookName } from "./config";
@@ -218,6 +218,7 @@ const OhMyOpenCodePlugin: Plugin = async (ctx) => {
  const backgroundTools = createBackgroundTools(backgroundManager, ctx.client);

  const callOmoAgent = createCallOmoAgent(ctx, backgroundManager);
+  const lookAt = createLookAt(ctx);

  const googleAuthHooks = pluginConfig.google_auth
    ? await createGoogleAntigravityAuthPlugin(ctx)
@@ -230,6 +231,7 @@ const OhMyOpenCodePlugin: Plugin = async (ctx) => {
      ...builtinTools,
      ...backgroundTools,
      call_omo_agent: callOmoAgent,
+      look_at: lookAt,
    },

    "chat.message": async (input, output) => {
@@ -268,6 +270,14 @@ const OhMyOpenCodePlugin: Plugin = async (ctx) => {
          call_omo_agent: false,
        };
      }
+      if (config.agent["multimodal-looker"]) {
+        config.agent["multimodal-looker"].tools = {
+          ...config.agent["multimodal-looker"].tools,
+          task: false,
+          call_omo_agent: false,
+          look_at: false,
+        };
+      }

      const mcpResult = (pluginConfig.claude_code?.mcp ?? true)
        ? await loadMcpConfigs()
--- a/src/tools/index.ts
+++ b/src/tools/index.ts
@@ -34,6 +34,7 @@ import type { BackgroundManager } from "../features/background-agent"
 type OpencodeClient = PluginInput["client"]

 export { createCallOmoAgent } from "./call-omo-agent"
+export { createLookAt } from "./look-at"

 export function createBackgroundTools(manager: BackgroundManager, client: OpencodeClient) {
  return {
--- a/src/tools/look-at/constants.ts
+++ b/src/tools/look-at/constants.ts
@@ -0,0 +1,23 @@
+export const MULTIMODAL_LOOKER_AGENT = "multimodal-looker" as const
+
+export const LOOK_AT_DESCRIPTION = `Analyze media files (PDFs, images, diagrams) that require visual interpretation.
+
+Use this tool to extract specific information from files that cannot be processed as plain text:
+- PDF documents: extract text, tables, structure, specific sections
+- Images: describe layouts, UI elements, text content, diagrams
+- Charts/Graphs: explain data, trends, relationships
+- Screenshots: identify UI components, text, visual elements
+- Architecture diagrams: explain flows, connections, components
+
+Parameters:
+- file_path: Absolute path to the file to analyze
+- goal: What specific information to extract (be specific for better results)
+
+Examples:
+- "Extract all API endpoints from this OpenAPI spec PDF"
+- "Describe the UI layout and components in this screenshot"
+- "Explain the data flow in this architecture diagram"
+- "List all table data from page 3 of this PDF"
+
+This tool uses a separate context window with Gemini 2.5 Flash for multimodal analysis,
+saving tokens in the main conversation while providing accurate visual interpretation.`
--- a/src/tools/look-at/index.ts
+++ b/src/tools/look-at/index.ts
@@ -0,0 +1,3 @@
+export * from "./types"
+export * from "./constants"
+export { createLookAt } from "./tools"
--- a/src/tools/look-at/tools.ts
+++ b/src/tools/look-at/tools.ts
@@ -0,0 +1,91 @@
+import { tool, type PluginInput } from "@opencode-ai/plugin"
+import { LOOK_AT_DESCRIPTION, MULTIMODAL_LOOKER_AGENT } from "./constants"
+import type { LookAtArgs } from "./types"
+import { log } from "../../shared/logger"
+
+export function createLookAt(ctx: PluginInput) {
+  return tool({
+    description: LOOK_AT_DESCRIPTION,
+    args: {
+      file_path: tool.schema.string().describe("Absolute path to the file to analyze"),
+      goal: tool.schema.string().describe("What specific information to extract from the file"),
+    },
+    async execute(args: LookAtArgs, toolContext) {
+      log(`[look_at] Analyzing file: ${args.file_path}, goal: ${args.goal}`)
+
+      const prompt = `Analyze this file and extract the requested information.
+
+File path: ${args.file_path}
+Goal: ${args.goal}
+
+Read the file using the Read tool, then provide ONLY the extracted information that matches the goal.
+Be thorough on what was requested, concise on everything else.
+If the requested information is not found, clearly state what is missing.`
+
+      log(`[look_at] Creating session with parent: ${toolContext.sessionID}`)
+      const createResult = await ctx.client.session.create({
+        body: {
+          parentID: toolContext.sessionID,
+          title: `look_at: ${args.goal.substring(0, 50)}`,
+        },
+      })
+
+      if (createResult.error) {
+        log(`[look_at] Session create error:`, createResult.error)
+        return `Error: Failed to create session: ${createResult.error}`
+      }
+
+      const sessionID = createResult.data.id
+      log(`[look_at] Created session: ${sessionID}`)
+
+      log(`[look_at] Sending prompt to session ${sessionID}`)
+      await ctx.client.session.prompt({
+        path: { id: sessionID },
+        body: {
+          agent: MULTIMODAL_LOOKER_AGENT,
+          tools: {
+            task: false,
+            call_omo_agent: false,
+            look_at: false,
+          },
+          parts: [{ type: "text", text: prompt }],
+        },
+      })
+
+      log(`[look_at] Prompt sent, fetching messages...`)
+
+      const messagesResult = await ctx.client.session.messages({
+        path: { id: sessionID },
+      })
+
+      if (messagesResult.error) {
+        log(`[look_at] Messages error:`, messagesResult.error)
+        return `Error: Failed to get messages: ${messagesResult.error}`
+      }
+
+      const messages = messagesResult.data
+      log(`[look_at] Got ${messages.length} messages`)
+
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      const lastAssistantMessage = messages
+        .filter((m: any) => m.info.role === "assistant")
+        .sort((a: any, b: any) => (b.info.time?.created || 0) - (a.info.time?.created || 0))[0]
+
+      if (!lastAssistantMessage) {
+        log(`[look_at] No assistant message found`)
+        return `Error: No response from multimodal-looker agent`
+      }
+
+      log(`[look_at] Found assistant message with ${lastAssistantMessage.parts.length} parts`)
+
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      const textParts = lastAssistantMessage.parts.filter((p: any) => p.type === "text")
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      const responseText = textParts.map((p: any) => p.text).join("\n")
+
+      log(`[look_at] Got response, length: ${responseText.length}`)
+
+      return responseText
+    },
+  })
+}
--- a/src/tools/look-at/types.ts
+++ b/src/tools/look-at/types.ts
@@ -0,0 +1,4 @@
+export interface LookAtArgs {
+  file_path: string
+  goal: string
+}
Author	SHA1	Message	Date
YeonGyu-Kim	96886f18ac	docs: add look_at tool and multimodal-looker agent documentation 🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)	2025-12-13 15:28:59 +09:00
YeonGyu-Kim	a3938e8c25	feat: add look_at tool and multimodal-looker agent Add a new tool and agent for analyzing media files (PDFs, images, diagrams) that require visual interpretation beyond raw text. - Add `multimodal-looker` agent using Gemini 2.5 Flash model - Add `look_at` tool that spawns multimodal-looker sessions - Restrict multimodal-looker from calling task/call_omo_agent/look_at tools Inspired by Sourcegraph Ampcode's look_at tool design. 🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)	2025-12-13 15:28:59 +09:00
YeonGyu-Kim	821b0b8e9f	docs: add known issue and hotfix for opencode-openai-codex-auth 400 error 🤖 GENERATED WITH ASSISTANCE OF [OhMyOpenCode](https://github.com/code-yeongyu/oh-my-opencode)	2025-12-13 15:28:59 +09:00
Junho Yeo	356bd1dff3	fix(ci): prevent publish workflow from running on forks (#34 )	2025-12-13 14:48:18 +09:00