Configure HtmlConverter to use ATX headers
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3568251. -->
Reported by: [arialblack](https://www.drupal.org/user/986636)
Related to !27
>>>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>The AI Search module uses league/html-to-markdown to convert HTML field content to Markdown before chunking and embedding. The current default configuration produces unnecessary Markdown artifacts that significantly increase token counts:</p>
<p>1. <strong>Setext-style heading underlines</strong> for H1 and H2:</p>
<pre>
Main Heading
=======================
Section Heading
-----------------------
</pre><p>Each heading adds 20-30 extra characters with no semantic value.<br>
<del><br>
2. <strong>Markdown image references</strong>:</del></p>
<pre>

</pre><p><br>
<del><br>
These add file paths and URLs but provide no actual image content to the embedding model.</del></p>
<p><strong>Impact:</strong><br>
- Higher token counts lead to more chunks per item (observed: 8-9 chunks vs expected 3-5)<br>
- More chunks = more embedding API calls = higher costs and rate limit issues<br>
- For text-based semantic search, embeddings work on plain text and don't require Markdown formatting<br>
- Setext underlines and image paths add no semantic value for embeddings</p>
<h4 id="summary-steps-reproduce">Steps to reproduce</h4>
<p>1. Create a node with H1/H2 headings and images<br>
2. Index it with AI Search and use Rendered item<br>
3. Enable debug logging and observe chunk content<br>
4. Note Setext underlines (<code>---</code>, <code>===</code>) and image references <code></code><br>
5. Count chunks per node (typically 8-9)</p>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<p>Configure <code>League\HTMLToMarkdown\HtmlConverter</code> to use more compact representations by setting two options in the <code>EmbeddingStrategyPluginBase</code> constructor:</p>
<p>1. <code>header_style: 'atx'</code> - Use ATX-style headings (<code># Heading</code>, <code>## Heading</code>) instead of Setext underlines<br>
<del>2. <code>remove_nodes: 'img'</code> - Completely remove image elements instead of converting them to Markdown references</del></p>
<p>Both options are supported by the library:<br>
- Documentation: <a href="https://github.com/thephpleague/html-to-markdown">https://github.com/thephpleague/html-to-markdown</a><br>
- See "Style notes" section for <code>header_style</code><br>
- See "Conversion options" section for <code>remove_nodes</code></p>
<p><strong>Before:</strong></p>
<pre>
Main Heading
============
Section heading
-----------------------

</pre><p><strong>After:</strong></p>
<pre>
# Main Heading
## Section heading
[Images completely removed]
</pre><p><strong>Benefits:</strong><br>
- Reduced token count: Headings use 2-4 characters instead of 20-30<br>
- Fewer chunks per indexed item<br>
- Lower API costs: Fewer embedding API calls<br>
- Better rate limit compliance: Reduced API request volume<br>
- Preserved semantics: Heading text and structure remain clear</p>
issue