Feature: Translation result caching and cross-field deduplication (#3585529) · Issues · project / ai_translate

Feature: Translation result caching and cross-field deduplication

## Problem/Motivation `ai_translate` re-sends every field to the LLM on every translation run, with no memory of previous translations. This causes two avoidable costs: 1. **Re-translation of unchanged content.** On a re-translation run (fixing one typo, updating a single field), every unchanged field is sent to the LLM again, even though the output would be identical to the previous run. 2. **Repeated translation of identical strings.** Paragraph-heavy entities and Layout Builder pages often contain the same short string in multiple components (shared CTA labels, navigation titles, standard headings). Each occurrence is translated independently, even though a single translation would suffice. On a large multilingual site both effects can compound significantly: avoidable token spend and rate-limit pressure accumulate across every translation run. This issue is related to #3585527 (batch multiple fields per request), which addresses per-field redundancy within a single run. **Caching** and **deduplication** address redundancy across runs and across identical strings within a run. Both optimizations are complementary and independent. --- ## Proposed resolution Add an opt-in cache and deduplication layer inside the translation pipeline, between field extraction and the LLM call. No changes to existing interfaces or behaviour when the feature is not enabled. **Deduplication (within a single run):** - After extracting all field texts for an entity, hash each unique string (SHA-256). - Group field keys by hash; send each unique string to the LLM only once. - Map the result back to every field that shared that string. **Caching (across runs):** - Before any LLM call, check a Drupal cache backend keyed on `ai_translate:src_lang:tgt_lang:sha256`. - On a cache hit, return the cached translation immediately. - On a cache miss, translate and write the result to the cache. - Use Drupal's standard cache tag and lifetime infrastructure; no special management needed. **Effect:** - Re-translating an entity whose content has not changed costs 0 LLM requests. - A Layout Builder page with 10 identical CTA labels triggers 1 LLM call instead of 10. - Cache is invalidated automatically by Drupal's cache lifecycle. **Implementation approach we have working:** We built this in a custom module during a production project. The core logic lives in a `translateMetadata()` method that intercepts the batch of extracted field items, runs the hash/cache lookup pipeline, calls the bulk or single-field translator for misses only, and maps results back. We are happy to share the implementation as a starting point or patch. --- ## Remaining tasks - [ ] Decide on cache backend (injectable; default `cache.default`) - [ ] Decide on opt-in surface (config flag vs. always-on) - [ ] Align implementation with `TextTranslator` / `TextTranslatorInterface` conventions - [ ] Tests: cache hit, cache miss, deduplication within a run, cross-run invalidation - [ ] phpcs (Drupal + DrupalPractice) clean; GitLab CI green - [ ] Maintainer review --- ## Questions for maintainers 1. Is there interest in receiving this as a contribution? 2. Preferred scope: always-on optimisation or opt-in config flag? 3. Preferred cache backend: `cache.default`, a dedicated `cache.ai_translate` bin, or injectable? ## User interface changes None. This is a transparent performance optimisation. ## API changes Additive only. A new optional cache layer inserted into the translation pipeline. `TextTranslator::translateContent()` and all existing interfaces remain unchanged.

issue