Create Document Classification Recipe (#3569202) · Issues · project / ai_initiative

Create Document Classification Recipe

>>> [!note] Migrated issue   Reported by: [marcus_johansson](https://www.drupal.org/user/385947) >>> <p>[Tracker]<br> <strong>Update Summary: </strong>[One-line status update for stakeholders]<br> <strong>Short Description: </strong>Create document classification recipes using Easy and Unstructured approaches<br> <strong>Check-in Date: </strong>MM/DD/YYYY<br> [/Tracker]</p> <h3 id="summary-problem-motivation">Problem/Motivation</h3> <p> Document classification is a common and high-value production use case for AI, especially when working with mixed media inputs such as PDFs, office documents, and structured text files. While the AI ecosystem already contains building blocks for document ingestion and extraction, there is currently no clear, recipe-based way to enable document classification end-to-end in a reusable and opinionated manner.</p> <h3 id="summary-proposed-resolution">Proposed resolution</h3> <h4>Original concept</h4> <p> The original plan proposed two separate recipes — one using simple text extraction (ai_simple_pdf_to_text) and one using Unstructured.io — with a shared baseline that stores extracted text in a long string field on the Document media type.</p> <h4>Revised approach</h4> <p> Based on feedback from fago (<a href="https://www.drupal.org/project/ai_initiative/issues/3569202#comment-16519113">https://www.drupal.org/project/ai_initiative/issues/3569202#comment-16519113</a>) and discussion with Marcus Johansson, the approach was simplified to a single recipe using document_loader as a unified extraction backend. Key changes from the original concept:</p> <ul> <li>One recipe instead of two — document_loader supports all file types and the extraction backend is swappable. Installing alternative document_loader plugins (e.g. Unstructured.io) automatically improves extraction quality without recipe changes.</li> <li>No stored extracted text — full document text is never persisted to the database (DB bloat for large documents). Instead, the recipe generates an LLM summary and classifies taxonomy from that summary.</li> <li>Same pattern as image classification recipe (image → description → tags): file → summary → category/topic classification.</li> </ul> <h4>Implementation</h4> <ul> <li>Adds field_document_summary (text_long) to the Document media type for the LLM-generated summary.</li> <li>Adds field_document_category (entity_reference, cardinality 3) and field_document_topic (entity_reference, cardinality 5) for taxonomy classification.</li> <li>Configures three automator rules in sequence:</li> <li>LlmDocumentText (weight 100): Extracts text from the uploaded file via document_loader, sends to LLM for summarization. For large documents exceeding the context window, text<br> is chunked and summarized iteratively (map-reduce) using existing TextChunker/Tokenizer utilities.</li> <li>llm_taxonomy (weight 200): Classifies the summary into up to 3 broad categories.</li> <li>llm_taxonomy (weight 300): Identifies up to 5 specific topics from the summary.</li> <li>The LlmDocumentText plugin is proposed as an upstream contribution to the AI module: <a href="https://www.drupal.org/project/ai/issues/3582848">https://www.drupal.org/project/ai/issues/3582848</a></li> <li>A document_loader Drupal Token for reusable file text extraction has been suggested as a separate follow-up.</li> </ul> <p> <strong>Project:</strong> <a href="https://www.drupal.org/project/ai_recipe_document_classification">https://www.drupal.org/project/ai_recipe_document_classification</a><br> <strong>Code:</strong> <a href="https://git.drupalcode.org/project/ai_recipe_document_classification">https://git.drupalcode.org/project/ai_recipe_document_classification</a> (branch 1.0.x)</p> <p> <strong>Dependencies:</strong></p> <ul> <li>drupal/ai ^1.1 (specifically ai_automators)</li> <li>drupal/ai_file_to_text ^1.0 (file extractors for document_loader)</li> <li>drupal/document_loader ^2.0</li> </ul> <h3 id="summary-ai-usage">AI usage (if applicable)</h3> <p> [x] AI Assisted Issue<br> This issue was generated with AI assistance, but was reviewed and refined by the creator.</p>

issue