Add a Moderation Guardrail plugin (configurable moderation provider/model as a guardrail) (#3586531) · Issues · project / ai

Add a Moderation Guardrail plugin (configurable moderation provider/model as a guardrail)

## Goal Add a single configurable **Moderation Guardrail** plugin to the **Guardrails** system, so any moderation provider/model can be used as a guardrail. This makes "each moderation provider becomes a possible guardrail" achievable through configuration of one plugin, and is the foundation for folding external moderation into Guardrails. This issue covers **only creating the plugin**. The migration of existing `ai.external_moderation` configuration onto guardrail sets, and the deprecation of the old moderation runner, are handled separately in **#3586528** (which is blocked by this issue). ## Background External moderation currently lives in AI Core as a parallel mechanism to Guardrails (#3479913): - **Config:** `ai.external_moderation` → key `moderations` (schema `type: ignore`). Each entry is `{ provider: <chat provider id>, models: ["<providerId>__<modelId>", …], tags: "<comma,separated>" }`. - **Runtime:** `src/EventSubscriber/ModeratePreRequestEventSubscriber.php` subscribes to `PreGenerateResponseEvent`, matches entries by the request's **provider id** and **tags** (`matchConfigs()`), runs each configured moderation model via `$provider->moderation($input, $model_id)->getNormalized()->isFlagged()`, and throws `AiUnsafePromptException` when flagged. So moderation is currently: **pre-request only, input only, hard-stop, scoped by provider + tags**, invoked through the `moderation` operation type. Guardrails (also core) is the newer abstraction for exactly this job: - **Plugin type** `ai_guardrail` — attribute `Drupal\ai\Attribute\AiGuardrail`, manager `plugin.manager.ai_guardrail`, namespace `Plugin/AiGuardrail`, base `AiGuardrailPluginBase`, interface `AiGuardrailInterface` (`label()`, `isAvailable()`, `processInput(InputInterface): GuardrailResultInterface`, `processOutput(OutputInterface): GuardrailResultInterface`). `NonDeterministicGuardrailInterface` marks guardrails that need an AI provider — moderation is exactly this. - **Result types:** `PassResult`, `StopResult` (carries `getScore()`), `RewriteInputResult`, `RewriteOutputResult`. - **Config entities:** `ai_guardrail` (a configured plugin instance: `guardrail` + `guardrail_settings`) and `ai_guardrail_set` (`stop_threshold`, `pre_generate_guardrails`, `post_generate_guardrails`). Making moderation "just another guardrail" lets it reuse the existing global/agent/automator wiring (e.g. the agent `guardrail_set` property, and #3586447 for Automators). ## Proposed approach Add a **Moderation Guardrail** plugin in `src/Plugin/AiGuardrail/ModerationGuardrail.php`: - `#[AiGuardrail(id: 'moderation_guardrail', label: new TranslatableMarkup('Moderation Guardrail'), description: …)]` - implements `NonDeterministicGuardrailInterface` (needs an AI provider) and `NonStreamableGuardrailInterface`. - A **single configurable plugin** (not a per-provider deriver): its `guardrail_settings` select the moderation **provider** + **model** (and optional provider config). Site builders create one configured `ai_guardrail` entity per moderation provider/model they want. - **Input only:** implement `processInput()` — run `$provider->moderation($input, $model_id)->getNormalized()->isFlagged()` and return a `StopResult` (with a score) when flagged, `PassResult` otherwise. `processOutput()` returns a pass (no output moderation in this scope). This preserves today's input-only behaviour; output moderation can be a follow-up. - **Stop via result, not exception:** the throw-based `AiUnsafePromptException` stop is replaced by returning `StopResult`; the guardrail set's `stop_threshold` decides whether the request halts. A `StopResult` is sufficient — no BC shim for the old exception is planned. - New schema: `ai.guardrail.settings.moderation_guardrail` (provider id, model id, optional `llm_config`-style mapping, violation message). ## Resolved decisions * **Stop semantics** → returning a `StopResult` (with `stop_threshold`) is sufficient; the old `AiUnsafePromptException` path is dropped, no compatibility shim. * **Plugin shape** → a single configurable `moderation_guardrail` plugin (provider/model chosen in settings), not a per-provider deriver. * **Output moderation** → input only for this issue; `processOutput()` is a no-op pass. Output moderation can be a follow-up. * **Naming** → the plugin is named **Moderation Guardrail** (`moderation_guardrail`). See #3586471. ## Resources * Migration / deprecation sibling (blocked by this issue): #3586528 * External moderation → core: #3479913 * Guardrails naming: #3586471 · Guardrails on Automators: #3586447 · During-generate modes: #3586491 * Current runner (reference for behaviour to reproduce): `src/EventSubscriber/ModeratePreRequestEventSubscriber.php` * Guardrail plugin type: `src/Attribute/AiGuardrail.php`, `src/Guardrail/AiGuardrailInterface.php`, `src/Guardrail/AiGuardrailPluginManager.php` * Guardrail config entities: `src/Entity/AiGuardrail.php`, `src/Entity/AiGuardrailSet.php` ## Decision

issue