checksum-validated PII guardrail plugins (Luhn for credit cards, MOD-97 for IBANs)
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3580692. -->
Reported by: [ajv009](https://www.drupal.org/user/3653917)
>>>
<p>[Tracker]<br>
<strong>Update Summary: </strong>Feature request for checksum-validated PII guardrail plugins (Luhn for credit cards, MOD-97 for IBANs)<br>
<strong>Short Description: </strong>Add guardrail plugins with algorithmic validation to reduce false positives in PII detection<br>
<strong>Check-in Date: </strong>03/22/2026<br>
<em>Metadata is used by the <a href="https://www.drupalstarforge.ai/" title="AI Tracker">AI Tracker.</a> Docs and additional fields <a href="https://www.drupalstarforge.ai/ai-dashboard/docs" title="AI Issue Tracker Documentation">here</a>.</em><br>
[/Tracker]</p>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>The current <code>regexp_guardrail</code> plugin detects PII by pattern shape only. It has no way to validate whether a matched string is actually a valid credit card number, IBAN, or other structured identifier. This causes a high rate of false positives that block legitimate AI responses.</p>
<p>The <a href="https://www.drupal.org/project/ai_recipe_guardrails_pii">AI Guardrails PII recipe</a> documents this problem explicitly in its README. The false positives are not theoretical; they affect common content that LLMs generate regularly.</p>
<p>I wrote a <a href="https://gist.github.com/AJV009/04ccd05ad79998d4cfc3426505dff890">PHP test script</a> that implements both checksum algorithms and tests them against real-world data. Run it with <code>php pii-checksum-validation-test.php</code> to reproduce everything below.</p>
<h4>Credit card detection: regex vs regex + Luhn</h4>
<p>The credit card regex <code>/(?&lt;!\d)(?:\d[\s-]?){12,19}\d(?!\d)/</code> matches any 13-20 digit sequence. The <a href="https://en.wikipedia.org/wiki/Luhn_algorithm">Luhn algorithm</a> (ISO/IEC 7812-1) is the standard checksum used by all card networks. Results from the test script:</p>
<table>
<tr>
<th>Category</th>
<th>Number</th>
<th>Description</th>
<th>Regex</th>
<th>Luhn</th>
<th>Correct?</th>
</tr>
<tr>
<td>Real card</td>
<td><code>4111111111111111</code></td>
<td>Visa test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>4242424242424242</code></td>
<td>Stripe Visa test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>5555555555554444</code></td>
<td>Mastercard test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>2223003122003222</code></td>
<td>Mastercard 2-series test</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>378282246310005</code></td>
<td>Amex test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>6011111111111117</code></td>
<td>Discover test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>3530111333300000</code></td>
<td>JCB test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>30569309025904</code></td>
<td>Diners Club test card</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>4111 1111 1111 1111</code></td>
<td>Visa with spaces</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real card</td>
<td><code>5555-5555-5555-4444</code></td>
<td>Mastercard with dashes</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td colspan="6"></td>
</tr>
<tr>
<td>ISBN</td>
<td><code>9780131103627</code></td>
<td>The C Programming Language</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>ISBN</td>
<td><code>9780596009205</code></td>
<td>Head First Design Patterns</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>ISBN</td>
<td><code>978-3-16-148410-0</code></td>
<td>ISBN-13 with hyphens</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>ISBN</td>
<td><code>9780062316110</code></td>
<td>Thinking, Fast and Slow</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>ISBN</td>
<td><code>9780735211292</code></td>
<td>Atomic Habits</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>Barcode</td>
<td><code>4006381333931</code></td>
<td>EAN-13: German product barcode</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Barcode</td>
<td><code>5901234123457</code></td>
<td>EAN-13: Polish product barcode</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>Timestamp</td>
<td><code>20250322143052</code></td>
<td>2025-03-22 14:30:52</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Timestamp</td>
<td><code>20230101000000</code></td>
<td>2023-01-01 00:00:00</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Timestamp</td>
<td><code>20260322150000</code></td>
<td>2026-03-22 15:00:00</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>Device ID</td>
<td><code>353456789012345</code></td>
<td>IMEI: mobile device identifier</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Device ID</td>
<td><code>490154203237518</code></td>
<td>IMEI: another device</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>Tracking</td>
<td><code>92748999985493569564</code></td>
<td>USPS tracking number (20 digits)</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Math</td>
<td><code>3141592653589793</code></td>
<td>Pi digits</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Math</td>
<td><code>2718281828459045</code></td>
<td>Euler number</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>Database ID</td>
<td><code>1234567890123456</code></td>
<td>Sequential 16-digit database ID</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Database ID</td>
<td><code>9876543210987</code></td>
<td>Sequential 13-digit revision ID</td>
<td>MATCH</td>
<td>PASS</td>
<td><strong>FP!</strong></td>
</tr>
<tr>
<td>Financial</td>
<td><code>021000021123456789</code></td>
<td>US routing + account number</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Network</td>
<td><code>1921681001721631</code></td>
<td>Two IP addresses concatenated</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Numeric</td>
<td><code>1111111111111111</code></td>
<td>All 1s (16 digits)</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
</table>
<p><strong>Result: 21 false positives with regex only, 7 with regex + Luhn. Luhn eliminates 66.7% of false positives.</strong> The 7 surviving false positives (marked FP!) are coincidental Luhn matches (~10% of random numbers pass Luhn, which is <a href="https://stripe.com/resources/more/how-to-use-the-luhn-algorithm-a-guide-in-applications-for-businesses">consistent with the mathematical expectation</a>). These can be further reduced by adding <a href="https://en.wikipedia.org/wiki/Payment_card_number#Issuer_identification_number_(IIN)">BIN/IIN prefix validation</a> (Visa starts with 4 and is 16 digits, Mastercard with 51-55 and is 16 digits, etc.).</p>
<h4>IBAN detection: regex vs regex + MOD-97</h4>
<p>The IBAN regex <code>/(?&lt;!\w)[A-Z]{2}\d{2}(?:\s?[A-Z0-9]){11,30}(?!\w)/i</code> matches any 2 letters + 2 digits + 11-30 alphanumerics. The <a href="https://en.wikipedia.org/wiki/International_Bank_Account_Number#Validating_the_IBAN">MOD-97 algorithm</a> (ISO 7064) is the standard checksum all real IBANs must pass:</p>
<table>
<tr>
<th>Category</th>
<th>String</th>
<th>Description</th>
<th>Regex</th>
<th>MOD-97</th>
<th>Correct?</th>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>DE89370400440532013000</code></td>
<td>Germany (Deutsche Bank)</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>GB29NWBK60161331926819</code></td>
<td>UK (NatWest)</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>FR7630006000011234567890189</code></td>
<td>France</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>NL91ABNA0417164300</code></td>
<td>Netherlands (ABN AMRO)</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>ES9121000418450200051332</code></td>
<td>Spain</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>BE68539007547034</code></td>
<td>Belgium</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>CH9300762011623852957</code></td>
<td>Switzerland</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>AT611904300234573201</code></td>
<td>Austria</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>PL61109010140000071219812874</code></td>
<td>Poland</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td>Real IBAN</td>
<td><code>IE29AIBK93115212345678</code></td>
<td>Ireland</td>
<td>MATCH</td>
<td>PASS</td>
<td>YES</td>
</tr>
<tr>
<td colspan="6"></td>
</tr>
<tr>
<td>Config ID</td>
<td><code>AB12CDEF1234567890</code></td>
<td>Drupal module machine name</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Config ID</td>
<td><code>AI01GUARDRAIL000001</code></td>
<td>AI module config entity ID</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Build/VCS</td>
<td><code>VR12BUILD2024032201</code></td>
<td>Software version/build string</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Serial</td>
<td><code>SN01DELL2024XPS1599</code></td>
<td>Dell serial number</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Serial</td>
<td><code>HP15LAPTOP20240101AB</code></td>
<td>HP product serial number</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Error code</td>
<td><code>EX12ABCDEF12345678</code></td>
<td>Exception/error reference code</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Tracking</td>
<td><code>DH12SHIPMENT0012345</code></td>
<td>DHL-like tracking code</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Tracking</td>
<td><code>FE95EXPRESS00054321</code></td>
<td>FedEx-like tracking code</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>API key</td>
<td><code>SK01LIVE0000TESTKEY1</code></td>
<td>Stripe-like API key prefix</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Tax ref</td>
<td><code>TX45ASSESSMENT00001</code></td>
<td>Tax assessment reference</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Hex ID</td>
<td><code>FF00CC99AA88BB77EE</code></td>
<td>Hex color/identifier string</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
<tr>
<td>Crypto</td>
<td><code>BC10QW20ER30TY40UI50</code></td>
<td>Crypto-like address fragment</td>
<td>MATCH</td>
<td>FAIL</td>
<td>YES</td>
</tr>
</table>
<p><strong>Result: 12 false positives with regex only, 0 with regex + MOD-97. MOD-97 eliminates 100% of false positives.</strong> The theoretical random pass rate for MOD-97 is only 1/97 (~1%).</p>
<h4>Bonus finding: IBAN regex word-eating bug</h4>
<p>The <code>(?:\s?[A-Z0-9])</code> group in the IBAN regex treats <code>space + letter</code> as a valid IBAN character. When an IBAN appears mid-sentence, the regex consumes following words:</p>
<pre>
Input: "IBAN DE89370400440532013000 is valid"
Matches: "DE89370400440532013000 is valid"
^^^^^^^^^^ consumed by regex
</pre><p>This is because <code>\s?</code> inside the repeating group lets the pattern match a space followed by a letter from the next word. This may warrant a separate fix in the recipe's regex pattern.</p>
<h3>Industry context</h3>
<p>These checksum algorithms are industry standards, not novel ideas:</p>
<p><strong>Luhn algorithm (ISO/IEC 7812-1) for credit cards:</strong></p>
<ul>
<li>Used by all major card networks (<a href="https://stripe.com/resources/more/how-to-use-the-luhn-algorithm-a-guide-in-applications-for-businesses">Stripe</a>, Adyen, Square all use Luhn as their first validation step)</li>
<li>Trivially implementable in PHP (roughly 10 lines of code, <a href="https://gist.github.com/AJV009/04ccd05ad79998d4cfc3426505dff890">see test script</a>)</li>
<li>Can be combined with <a href="https://en.wikipedia.org/wiki/Payment_card_number#Issuer_identification_number_(IIN)">BIN/IIN prefix validation</a> for even higher accuracy</li>
</ul>
<p><strong>MOD-97 algorithm (ISO 7064) for IBANs:</strong></p>
<ul>
<li>Moves the first 4 characters (country code + check digits) to the end, converts letters to numbers, checks remainder when divided by 97 equals 1</li>
<li>Can be combined with <a href="https://en.wikipedia.org/wiki/International_Bank_Account_Number#IBAN_formats_by_country">country-specific length validation</a> (DE=22, FR=27, GB=22, etc.)</li>
</ul>
<p>Major PII detection frameworks all use this layered approach:</p>
<ul>
<li><a href="https://github.com/microsoft/presidio">Microsoft Presidio</a> (open-source): <a href="https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/credit_card_recognizer.py"><code>CreditCardRecognizer</code></a> sets confidence to 1.0 after passing Luhn, 0.0 if it fails. <a href="https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/iban_recognizer.py"><code>IbanRecognizer</code></a> validates MOD-97.</li>
<li><a href="https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html">AWS Comprehend</a>: ML-based detection with high confidence scoring for structured PII types</li>
<li><a href="https://cloud.google.com/sensitive-data-protection/docs/concepts-infotypes">Google Cloud DLP</a>: Pattern matching + checksum validation + context-aware likelihood scoring</li>
<li><a href="https://protectai.com/llm-guard">LLM Guard</a> (Protect AI): Combines regex patterns with validation logic for structured PII</li>
</ul>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<p>Add checksum-validated guardrail plugins to the AI module. Two possible approaches:</p>
<p><strong>Option A: New dedicated guardrail plugins</strong></p>
<p>Ship two new guardrail plugins alongside <code>regexp_guardrail</code>:</p>
<ul>
<li><code>CreditCardGuardrail</code>: Regex matching + Luhn checksum + optional BIN prefix validation. Configuration: violation message, optional list of card network prefixes to match.</li>
<li><code>IbanGuardrail</code>: Regex matching + MOD-97 checksum + optional country code allowlist. Configuration: violation message, optional list of accepted country codes.</li>
</ul>
<p>These would be deterministic (no LLM cost), fast, and significantly more accurate than regex alone.</p>
<p><strong>Option B: Add a validator callback to regexp_guardrail</strong></p>
<p>Extend <code>regexp_guardrail</code> with an optional "post-match validator" config. Built-in validators: <code>luhn</code>, <code>mod97</code>, <code>none</code> (default). The regex runs first, and if it matches, the validator runs as a second check. Only if both pass does the guardrail fire.</p>
<p><strong>Recommendation:</strong> Option A is cleaner because credit card and IBAN detection have different configuration needs (BIN prefixes vs country codes) and the plugins can have purpose-built configuration forms with clear labels. Option B is simpler but mixes different concerns into one plugin.</p>
<h3 id="summary-remaining-tasks">Remaining tasks</h3>
<ul>
<li>Decide on approach (Option A vs Option B)</li>
<li>Implement the Luhn validation logic in PHP</li>
<li>Implement the MOD-97 validation logic in PHP</li>
<li>Add configuration forms with clear descriptions</li>
<li>Write tests (test data from the <a href="https://gist.github.com/AJV009/04ccd05ad79998d4cfc3426505dff890">gist</a> can be reused)</li>
<li>Update the PII recipe to use the new plugins (if Option A) or add validator config (if Option B)</li>
<li>Consider whether phone number validation (E.164 format) should also be included</li>
<li>Consider fixing the IBAN regex word-eating bug separately in the recipe</li>
</ul>
<h3>Related issues and projects</h3>
<ul>
<li><a href="https://www.drupal.org/project/ai_initiative/issues/3577498">#3577498</a>: [Meta] AI PII Guardrails Recipe - the recipe whose README documents the false positive problem</li>
<li><a href="https://www.drupal.org/project/ai_recipe_guardrails_pii">AI Guardrails PII recipe</a> - would be updated to use the new plugins</li>
<li><a href="https://www.drupal.org/project/ai/issues/3577790">#3577790</a>: Add validation to regex guardrail configuration (RTBC) - related guardrail improvement</li>
<li><a href="https://www.drupal.org/project/mi_preflight">MI Preflight</a> - separate module with AI-powered PII detection for content compliance (does not use the guardrail plugin system)</li>
<li><a href="https://www.drupal.org/project/analyze_ai_content_security_audit">Analyze AI Content Security Audit</a> - AI-powered content security scanning (separate from guardrails)</li>
</ul>
<h3 id="summary-ai-usage">AI usage (if applicable)</h3>
<p>[x] AI Assisted Issue<br>
This issue was generated with AI assistance, but was reviewed and refined by the creator.</p>
<p>[ ] AI Assisted Code<br>
This code was mainly generated by a human, with AI autocompleting or parts AI generated, but under full human supervision.</p>
<p>[ ] AI Generated Code<br>
This code was mainly generated by an AI with human guidance, and reviewed, tested, and refined by a human.</p>
<p>[ ] Vibe Coded<br>
This code was generated by an AI and has only been functionally tested.</p>
issue