Configure HtmlConverter to use ATX headers
>>> [!note] Migrated issue <!-- Drupal.org comment --> <!-- Migrated from issue #3568251. --> Reported by: [arialblack](https://www.drupal.org/user/986636) Related to !27 >>> <h3 id="summary-problem-motivation">Problem/Motivation</h3> <p>The AI Search module uses league/html-to-markdown to convert HTML field content to Markdown before chunking and embedding. The current default configuration produces unnecessary Markdown artifacts that significantly increase token counts:</p> <p>1. <strong>Setext-style heading underlines</strong> for H1 and H2:</p> <pre> Main Heading ======================= Section Heading ----------------------- </pre><p>Each heading adds 20-30 extra characters with no semantic value.<br> <del><br> 2. <strong>Markdown image references</strong>:</del></p> <pre> ![alt text](/sites/default/files/image.jpg?itok=abc123) </pre><p><br> <del><br> These add file paths and URLs but provide no actual image content to the embedding model.</del></p> <p><strong>Impact:</strong><br> - Higher token counts lead to more chunks per item (observed: 8-9 chunks vs expected 3-5)<br> - More chunks = more embedding API calls = higher costs and rate limit issues<br> - For text-based semantic search, embeddings work on plain text and don't require Markdown formatting<br> - Setext underlines and image paths add no semantic value for embeddings</p> <h4 id="summary-steps-reproduce">Steps to reproduce</h4> <p>1. Create a node with H1/H2 headings and images<br> 2. Index it with AI Search and use Rendered item<br> 3. Enable debug logging and observe chunk content<br> 4. Note Setext underlines (<code>---</code>, <code>===</code>) and image references <code>![](path)</code><br> 5. Count chunks per node (typically 8-9)</p> <h3 id="summary-proposed-resolution">Proposed resolution</h3> <p>Configure <code>League\HTMLToMarkdown\HtmlConverter</code> to use more compact representations by setting two options in the <code>EmbeddingStrategyPluginBase</code> constructor:</p> <p>1. <code>header_style: 'atx'</code> - Use ATX-style headings (<code># Heading</code>, <code>## Heading</code>) instead of Setext underlines<br> <del>2. <code>remove_nodes: 'img'</code> - Completely remove image elements instead of converting them to Markdown references</del></p> <p>Both options are supported by the library:<br> - Documentation: <a href="https://github.com/thephpleague/html-to-markdown">https://github.com/thephpleague/html-to-markdown</a><br> - See "Style notes" section for <code>header_style</code><br> - See "Conversion options" section for <code>remove_nodes</code></p> <p><strong>Before:</strong></p> <pre> Main Heading ============ Section heading ----------------------- ![Description](/sites/default/files/image.jpg?itok=abc) </pre><p><strong>After:</strong></p> <pre> # Main Heading ## Section heading [Images completely removed] </pre><p><strong>Benefits:</strong><br> - Reduced token count: Headings use 2-4 characters instead of 20-30<br> - Fewer chunks per indexed item<br> - Lower API costs: Fewer embedding API calls<br> - Better rate limit compliance: Reduced API request volume<br> - Preserved semantics: Heading text and structure remain clear</p>
issue