Change how users select `tokenizer chat model` on AI Search / Search API server
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3472212. -->
Reported by: [jackbravo](https://www.drupal.org/user/138388)
>>>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>The <strong>Tokenizer chat model</strong> select input on the AI Search Server configuration page can show models not supported by the current implementation of the <strong>Drupal\ai\Utility\Tokenizer</strong> class.</p>
<h4 id="summary-steps-reproduce">Steps to reproduce</h4>
<p>1. Enable an AI provider that is not OpenAI, like Ollama with any models like llama3.1, qwen2, or gemma2.<br>
2. Configure an AI Search server (with Milvus, which is only option right now)<br>
3. The `Tokenizer chat model` select input will show those options<br>
4. Configure an AI Search index<br>
5. Index, you'll get the next error:</p>
<p><code>InvalidArgumentException: Unknown model name: llama3.1:latest in Yethee\Tiktoken\EncoderProvider->getForModel() (line 123 of /var/www/html/vendor/yethee/tiktoken/src/EncoderProvider.php).</code></p>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<p>A couple of suggestions:</p>
<ol>
<li>Instead of showing enabled models (many of which may not be supported by the current <strong>Yethee\Tiktoken\EncoderProvider</strong>), show only the current supported models.</li>
<li>Provide also a simple character splitting option, besides the specialized token splitting one.</li>
<li>Provide a good helper text to help guide users of the module decide between the available options.</li>
<li>Provide a good default value, maybe <strong>gpt-4 -> cl100k_base</strong></li>
</ol>
<h3 id="summary-remaining-tasks">Remaining tasks</h3>
<h3 id="summary-ui-changes">User interface changes</h3>
<h3 id="summary-api-changes">API changes</h3>
<h3 id="summary-data-model-changes">Data model changes</h3>
issue