Image-to-Text tool
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3562622. -->
Reported by: [marcus_johansson](https://www.drupal.org/user/385947)
Related to !1078
>>>
<p>[Tracker]<br>
<strong>Update Summary: </strong>Introduce a unified Image-to-Text tool definition with required provider/model and support for prompt-based extraction instructions.<br>
<strong>Short Description: </strong>[One-line issue summary for stakeholders]<br>
<strong>Check-in Date: </strong>MM/DD/YYYY<br>
<em>Metadata is used by the <a href="https://www.drupalstarforge.ai/" title="AI Tracker">AI Tracker.</a> Docs and additional fields <a href="https://www.drupalstarforge.ai/ai-dashboard/docs" title="AI Issue Tracker Documentation">here</a>.</em><br>
[/Tracker]</p>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>
We need tools for <a href="https://www.drupal.org/project/tool">Tool API</a> for the operations that the AI module exposes. Drupal AI does not yet define a dedicated Image-to-Text tool for extracting structured or unstructured text from images.<br>
Providers implement OCR, captioning, and visual question answering differently, which prevents UIs and AI Agents from having a consistent interface.<br>
A standard ImageToText tool is needed so modules can pass an image together with an optional prompt explaining what information the model should extract, and reliably receive text output using a shared schema.
</p>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<ul>
<li>Create a new <code>ImageToText</code> tool type within the Drupal AI Tool API.</li>
<li>Define required inputs:
<ul>
<li><code>provider</code> — backend used for text extraction.</li>
<li><code>model</code> — the specific OCR/captioning/VQA model.</li>
<li><code>image</code> — binary image input.</li>
</ul>
</li>
<li>Add optional inputs:
<ul>
<li><code>prompt</code> — instruction describing what text or information to extract (e.g., “read the handwritten section,” “describe the scene,” “list all visible numbers,” “extract table contents”).</li>
<li><code>options</code> — provider-specific parameters.</li>
</ul>
</li>
<li>Define output format:
<ul>
<li><code>text</code> — extracted or generated text as a UTF-8 string.</li>
</ul>
</li>
<li>Ensure consistent handling so UIs and Agents can build flows for OCR, captioning, and visual Q&A across all providers.</li>
</ul>
issue