Refactor EmbeddingBase and RagTool into smaller protected helpers to ease customization of chunk and search-result formatting (#3584013) · Issues · project / ai_search

Refactor EmbeddingBase and RagTool into smaller protected helpers to ease customization of chunk and search-result formatting

### Problem/Motivation The exact text format used at two distinct stages of an ai_search-backed RAG pipeline can have an impact on the quality of LLM responses: 1. **The chunk text** built by an embedding strategy (`groupFieldData()` + `prepareChunkText()` + `getChunks()` in `EmbeddingBase`). This text is both vectorized *and* retrieved verbatim from the vector DB as the chunk content the LLM ultimately sees as context. 2. **The aggregated search-results string** built by `RagTool::execute()` and handed to the LLM as the result of the function call. In practice, fairly small formatting tweaks at either stage — e.g. bolding contextual content labels and placing them above the main chunk, replacing the triple-backtick code fences between results with explicit `---` separators, or omitting/renaming the title prefix — can noticeably reduce attribution errors where the LLM cites the wrong source for a quote. (Some of the underlying reasoning around this comes up in [the Anthropic prompt-engineering guidance on long-context and structure](https://docs.anthropic.com/claude/docs/long-context-tips).) In the current code, doing this is harder than it should be: - `EmbeddingBase::groupFieldData()` is a single ~60-line method that owns: iterating fields, resolving label/main/contextual classification via index config, flattening values, *and* the per-field templating (`$field->getLabel() . ": " . $value . "\n\n"`). To change the per-field templating (bold labels, single newline, label-then-value vs. value-then-label, etc.), a subclass either has to reimplement the whole method or — as we did — post-process the resulting string with regex, which is fragile. - `EmbeddingBase::getChunks()` is a single ~50-line method that owns: budget arithmetic for contextual-vs-main share, the "does contextual content fit in N%?" branch, calling `TextChunker`, *and* the cross-join assembly of main × contextual chunks. To inject custom segmentation (e.g. structural boundary markers instead of pure token-based chunking) a subclass has to re-implement most of this, even when the budgeting and assembly logic is unchanged. - `EmbeddingBase::prepareChunkText()` hardcodes the title rendering (`'# ' . strtoupper($title)`) and the contextual-after-main ordering. Changing either requires overriding the whole method. - `RagTool::execute()` owns: parameter collection, query construction, result iteration, per-result templating (`"Search result: #$i:\n\`\`\`\n" . $content . "\n\`\`\`\n\n"`), the overall wrapping copy ("Results from searching in the rag index … for the following prompt …"), and the no-results / exception fallbacks. Just to change the per-result template and the separator between results, a subclass has to override the entire `execute()` method, which means duplicating the query/exception/wrap logic and locking the subclass to the parent's current implementation. The net result is that consumers wanting to experiment with chunk- and result-formatting end up duplicating substantial chunks of base-class logic, which both discourages experimentation and creates ongoing maintenance friction whenever the parent classes evolve. ### Proposed resolution Break the large methods in `EmbeddingBase` and `RagTool` into small, single-responsibility protected helpers so that subclasses only need to override the formatting concern they care about. ### AI usage (if applicable) [x] AI Assisted Issue This issue was generated with AI assistance, but was reviewed and refined by the creator. [ ] AI Assisted Code This code was mainly generated by a human, with AI autocompleting or parts AI generated, but under full human supervision. [ ] AI Generated Code This code was mainly generated by an AI with human guidance, and reviewed, tested, and refined by a human. [ ] Vibe Coded This code was generated by an AI and has only been functionally tested.

issue