Chunked AI Search embeddings should have a unique ID
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3470652. -->
Reported by: [vivek panicker](https://www.drupal.org/user/3540616)
Related to !53
>>>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>There is no unique ID for the individual chunks that are indexed. This means that back-ends that require an ID on creation have to generate the ID themselves. Given the embeddings are generated by AI Search, it would seem reasonable for the ID to be generated by AI Search too for consistency and to reduce the effort for provider implementations.</p>
<p><code>EmbeddingStrategyInterface::getEmbedding</code> already generates an ID for the embedding, which is then stored in <code>drupal_long_id</code> for indexing. However, both implementations in AI module use the Search API Item ID directly, which is both non-unique, but also duplicated with the <code>drupal_entity_id</code> metadata field.</p>
<h4 id="summary-steps-reproduce">Steps to reproduce</h4>
<p>Index data and check the index if unique IDs are being created for each chunk.</p>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<ul>
<li><code>EmbeddingStrategyInterface::getEmbedding</code> implementations should be updated to provide a <em>unique</em> identifier for the chunk. This should be a combination of the Search API Item ID and a unique suffix (e.g. delta).
<ul>
<li><del>MetadataAveragePoolEmbeddingStrategy.php - This needs work</del></li>
<li><del>MetadataEmbeddingBase.php - This needs work</del></li>
<li><del>BasicEmbeddingStrategy.php - This is fine</del></li>
</ul>
</li>
<li><del>Throw an exception in SearchApiAiSearchBackend::indexItems() if the ID is missing or not unique per chunk</del></li>
</ul>
<h3 id="summary-remaining-tasks">Remaining tasks</h3>
<ul>
<li>Implement the changes</li>
<li>Issue a change notice</li>
<li>Decide whether we should provide some more structured responses to help enforce the correct behaviour. E.g. a DTO or phpdoc array shapes. The function simply returns an array with no documentation or enforcement of what should be in it, which could lead to errors from embedding implementations.</li>
</ul>
<h3 id="summary-ui-changes">User interface changes</h3>
<p>None.</p>
<h3 id="summary-api-changes">API changes</h3>
<p><code>drupal_long_id</code>, returned from <code>EmbeddingStrategyInterface::getEmbedding</code>, will now need to be a unique ID.</p>
<h3 id="summary-data-model-changes">Data model changes</h3>
<p><code>drupal_long_id</code> metadata will have a unique value.</p>
<h3 id="summary-original-report">Original report by vivek panicker</h3>
<p>When creating the item to index data, a unique ID is no longer created for it, like it was done earlier in the Search API AI module<br>
AI module code:<br>
<img src="https://www.drupal.org/files/issues/2024-08-28/Screenshot%20from%202024-08-28%2011-49-14.png" alt="alt"><br>
Search API AI module code:<br>
<img src="https://www.drupal.org/files/issues/2024-08-28/Screenshot%20from%202024-08-28%2012-19-22.png" alt="alt"></p>
<p>Vector Database services like Pinecone require that the indexed item should have a unique ID.<br>
<img src="https://www.drupal.org/files/issues/2024-08-28/Screenshot%20from%202024-08-28%2013-07-36.png" alt="alt"></p>
<p>I believe if we maintain the same ID, this will cause the data in the index to override.</p>
issue