Image-to-Text tool (#3562622) · Issues · project / ai

Image-to-Text tool

>>> [!note] Migrated issue   Reported by: [marcus_johansson](https://www.drupal.org/user/385947) Related to !1078 >>> <p>[Tracker]<br> <strong>Update Summary: </strong>Introduce a unified Image-to-Text tool definition with required provider/model and support for prompt-based extraction instructions.<br> <strong>Short Description: </strong>[One-line issue summary for stakeholders]<br> <strong>Check-in Date: </strong>MM/DD/YYYY<br> <em>Metadata is used by the <a href="https://www.drupalstarforge.ai/" title="AI Tracker">AI Tracker.</a> Docs and additional fields <a href="https://www.drupalstarforge.ai/ai-dashboard/docs" title="AI Issue Tracker Documentation">here</a>.</em><br> [/Tracker]</p> <h3 id="summary-problem-motivation">Problem/Motivation</h3> <p> We need tools for <a href="https://www.drupal.org/project/tool">Tool API</a> for the operations that the AI module exposes. Drupal AI does not yet define a dedicated Image-to-Text tool for extracting structured or unstructured text from images.<br> Providers implement OCR, captioning, and visual question answering differently, which prevents UIs and AI Agents from having a consistent interface.<br> A standard ImageToText tool is needed so modules can pass an image together with an optional prompt explaining what information the model should extract, and reliably receive text output using a shared schema. </p> <h3 id="summary-proposed-resolution">Proposed resolution</h3> <ul> <li>Create a new <code>ImageToText</code> tool type within the Drupal AI Tool API.</li> <li>Define required inputs: <ul> <li><code>provider</code> — backend used for text extraction.</li> <li><code>model</code> — the specific OCR/captioning/VQA model.</li> <li><code>image</code> — binary image input.</li> </ul> </li> <li>Add optional inputs: <ul> <li><code>prompt</code> — instruction describing what text or information to extract (e.g., “read the handwritten section,” “describe the scene,” “list all visible numbers,” “extract table contents”).</li> <li><code>options</code> — provider-specific parameters.</li> </ul> </li> <li>Define output format: <ul> <li><code>text</code> — extracted or generated text as a UTF-8 string.</li> </ul> </li> <li>Ensure consistent handling so UIs and Agents can build flows for OCR, captioning, and visual Q&A across all providers.</li> </ul>

issue