Image-to-Text tool
>>> [!note] Migrated issue <!-- Drupal.org comment --> <!-- Migrated from issue #3562622. --> Reported by: [marcus_johansson](https://www.drupal.org/user/385947) Related to !1078 >>> <p>[Tracker]<br> <strong>Update Summary: </strong>Introduce a unified Image-to-Text tool definition with required provider/model and support for prompt-based extraction instructions.<br> <strong>Short Description: </strong>[One-line issue summary for stakeholders]<br> <strong>Check-in Date: </strong>MM/DD/YYYY<br> <em>Metadata is used by the <a href="https://www.drupalstarforge.ai/" title="AI Tracker">AI Tracker.</a> Docs and additional fields <a href="https://www.drupalstarforge.ai/ai-dashboard/docs" title="AI Issue Tracker Documentation">here</a>.</em><br> [/Tracker]</p> <h3 id="summary-problem-motivation">Problem/Motivation</h3> <p> We need tools for <a href="https://www.drupal.org/project/tool">Tool API</a> for the operations that the AI module exposes. Drupal AI does not yet define a dedicated Image-to-Text tool for extracting structured or unstructured text from images.<br> Providers implement OCR, captioning, and visual question answering differently, which prevents UIs and AI Agents from having a consistent interface.<br> A standard ImageToText tool is needed so modules can pass an image together with an optional prompt explaining what information the model should extract, and reliably receive text output using a shared schema. </p> <h3 id="summary-proposed-resolution">Proposed resolution</h3> <ul> <li>Create a new <code>ImageToText</code> tool type within the Drupal AI Tool API.</li> <li>Define required inputs: <ul> <li><code>provider</code> &mdash; backend used for text extraction.</li> <li><code>model</code> &mdash; the specific OCR/captioning/VQA model.</li> <li><code>image</code> &mdash; binary image input.</li> </ul> </li> <li>Add optional inputs: <ul> <li><code>prompt</code> &mdash; instruction describing what text or information to extract (e.g., &ldquo;read the handwritten section,&rdquo; &ldquo;describe the scene,&rdquo; &ldquo;list all visible numbers,&rdquo; &ldquo;extract table contents&rdquo;).</li> <li><code>options</code> &mdash; provider-specific parameters.</li> </ul> </li> <li>Define output format: <ul> <li><code>text</code> &mdash; extracted or generated text as a UTF-8 string.</li> </ul> </li> <li>Ensure consistent handling so UIs and Agents can build flows for OCR, captioning, and visual Q&amp;A across all providers.</li> </ul>
issue