checksum-validated PII guardrail plugins (Luhn for credit cards, MOD-97 for IBANs)
>>> [!note] Migrated issue <!-- Drupal.org comment --> <!-- Migrated from issue #3580692. --> Reported by: [ajv009](https://www.drupal.org/user/3653917) >>> <p>[Tracker]<br> <strong>Update Summary: </strong>Feature request for checksum-validated PII guardrail plugins (Luhn for credit cards, MOD-97 for IBANs)<br> <strong>Short Description: </strong>Add guardrail plugins with algorithmic validation to reduce false positives in PII detection<br> <strong>Check-in Date: </strong>03/22/2026<br> <em>Metadata is used by the <a href="https://www.drupalstarforge.ai/" title="AI Tracker">AI Tracker.</a> Docs and additional fields <a href="https://www.drupalstarforge.ai/ai-dashboard/docs" title="AI Issue Tracker Documentation">here</a>.</em><br> [/Tracker]</p> <h3 id="summary-problem-motivation">Problem/Motivation</h3> <p>The current <code>regexp_guardrail</code> plugin detects PII by pattern shape only. It has no way to validate whether a matched string is actually a valid credit card number, IBAN, or other structured identifier. This causes a high rate of false positives that block legitimate AI responses.</p> <p>The <a href="https://www.drupal.org/project/ai_recipe_guardrails_pii">AI Guardrails PII recipe</a> documents this problem explicitly in its README. The false positives are not theoretical; they affect common content that LLMs generate regularly.</p> <p>I wrote a <a href="https://gist.github.com/AJV009/04ccd05ad79998d4cfc3426505dff890">PHP test script</a> that implements both checksum algorithms and tests them against real-world data. Run it with <code>php pii-checksum-validation-test.php</code> to reproduce everything below.</p> <h4>Credit card detection: regex vs regex + Luhn</h4> <p>The credit card regex <code>/(?&amp;lt;!\d)(?:\d[\s-]?){12,19}\d(?!\d)/</code> matches any 13-20 digit sequence. The <a href="https://en.wikipedia.org/wiki/Luhn_algorithm">Luhn algorithm</a> (ISO/IEC 7812-1) is the standard checksum used by all card networks. Results from the test script:</p> <table> <tr> <th>Category</th> <th>Number</th> <th>Description</th> <th>Regex</th> <th>Luhn</th> <th>Correct?</th> </tr> <tr> <td>Real card</td> <td><code>4111111111111111</code></td> <td>Visa test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>4242424242424242</code></td> <td>Stripe Visa test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>5555555555554444</code></td> <td>Mastercard test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>2223003122003222</code></td> <td>Mastercard 2-series test</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>378282246310005</code></td> <td>Amex test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>6011111111111117</code></td> <td>Discover test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>3530111333300000</code></td> <td>JCB test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>30569309025904</code></td> <td>Diners Club test card</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>4111 1111 1111 1111</code></td> <td>Visa with spaces</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real card</td> <td><code>5555-5555-5555-4444</code></td> <td>Mastercard with dashes</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td colspan="6"></td> </tr> <tr> <td>ISBN</td> <td><code>9780131103627</code></td> <td>The C Programming Language</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>ISBN</td> <td><code>9780596009205</code></td> <td>Head First Design Patterns</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>ISBN</td> <td><code>978-3-16-148410-0</code></td> <td>ISBN-13 with hyphens</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>ISBN</td> <td><code>9780062316110</code></td> <td>Thinking, Fast and Slow</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>ISBN</td> <td><code>9780735211292</code></td> <td>Atomic Habits</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>Barcode</td> <td><code>4006381333931</code></td> <td>EAN-13: German product barcode</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Barcode</td> <td><code>5901234123457</code></td> <td>EAN-13: Polish product barcode</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>Timestamp</td> <td><code>20250322143052</code></td> <td>2025-03-22 14:30:52</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Timestamp</td> <td><code>20230101000000</code></td> <td>2023-01-01 00:00:00</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Timestamp</td> <td><code>20260322150000</code></td> <td>2026-03-22 15:00:00</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>Device ID</td> <td><code>353456789012345</code></td> <td>IMEI: mobile device identifier</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Device ID</td> <td><code>490154203237518</code></td> <td>IMEI: another device</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>Tracking</td> <td><code>92748999985493569564</code></td> <td>USPS tracking number (20 digits)</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Math</td> <td><code>3141592653589793</code></td> <td>Pi digits</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Math</td> <td><code>2718281828459045</code></td> <td>Euler number</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>Database ID</td> <td><code>1234567890123456</code></td> <td>Sequential 16-digit database ID</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Database ID</td> <td><code>9876543210987</code></td> <td>Sequential 13-digit revision ID</td> <td>MATCH</td> <td>PASS</td> <td><strong>FP!</strong></td> </tr> <tr> <td>Financial</td> <td><code>021000021123456789</code></td> <td>US routing + account number</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Network</td> <td><code>1921681001721631</code></td> <td>Two IP addresses concatenated</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Numeric</td> <td><code>1111111111111111</code></td> <td>All 1s (16 digits)</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> </table> <p><strong>Result: 21 false positives with regex only, 7 with regex + Luhn. Luhn eliminates 66.7% of false positives.</strong> The 7 surviving false positives (marked FP!) are coincidental Luhn matches (~10% of random numbers pass Luhn, which is <a href="https://stripe.com/resources/more/how-to-use-the-luhn-algorithm-a-guide-in-applications-for-businesses">consistent with the mathematical expectation</a>). These can be further reduced by adding <a href="https://en.wikipedia.org/wiki/Payment_card_number#Issuer_identification_number_(IIN)">BIN/IIN prefix validation</a> (Visa starts with 4 and is 16 digits, Mastercard with 51-55 and is 16 digits, etc.).</p> <h4>IBAN detection: regex vs regex + MOD-97</h4> <p>The IBAN regex <code>/(?&amp;lt;!\w)[A-Z]{2}\d{2}(?:\s?[A-Z0-9]){11,30}(?!\w)/i</code> matches any 2 letters + 2 digits + 11-30 alphanumerics. The <a href="https://en.wikipedia.org/wiki/International_Bank_Account_Number#Validating_the_IBAN">MOD-97 algorithm</a> (ISO 7064) is the standard checksum all real IBANs must pass:</p> <table> <tr> <th>Category</th> <th>String</th> <th>Description</th> <th>Regex</th> <th>MOD-97</th> <th>Correct?</th> </tr> <tr> <td>Real IBAN</td> <td><code>DE89370400440532013000</code></td> <td>Germany (Deutsche Bank)</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>GB29NWBK60161331926819</code></td> <td>UK (NatWest)</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>FR7630006000011234567890189</code></td> <td>France</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>NL91ABNA0417164300</code></td> <td>Netherlands (ABN AMRO)</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>ES9121000418450200051332</code></td> <td>Spain</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>BE68539007547034</code></td> <td>Belgium</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>CH9300762011623852957</code></td> <td>Switzerland</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>AT611904300234573201</code></td> <td>Austria</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>PL61109010140000071219812874</code></td> <td>Poland</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td>Real IBAN</td> <td><code>IE29AIBK93115212345678</code></td> <td>Ireland</td> <td>MATCH</td> <td>PASS</td> <td>YES</td> </tr> <tr> <td colspan="6"></td> </tr> <tr> <td>Config ID</td> <td><code>AB12CDEF1234567890</code></td> <td>Drupal module machine name</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Config ID</td> <td><code>AI01GUARDRAIL000001</code></td> <td>AI module config entity ID</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Build/VCS</td> <td><code>VR12BUILD2024032201</code></td> <td>Software version/build string</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Serial</td> <td><code>SN01DELL2024XPS1599</code></td> <td>Dell serial number</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Serial</td> <td><code>HP15LAPTOP20240101AB</code></td> <td>HP product serial number</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Error code</td> <td><code>EX12ABCDEF12345678</code></td> <td>Exception/error reference code</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Tracking</td> <td><code>DH12SHIPMENT0012345</code></td> <td>DHL-like tracking code</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Tracking</td> <td><code>FE95EXPRESS00054321</code></td> <td>FedEx-like tracking code</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>API key</td> <td><code>SK01LIVE0000TESTKEY1</code></td> <td>Stripe-like API key prefix</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Tax ref</td> <td><code>TX45ASSESSMENT00001</code></td> <td>Tax assessment reference</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Hex ID</td> <td><code>FF00CC99AA88BB77EE</code></td> <td>Hex color/identifier string</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> <tr> <td>Crypto</td> <td><code>BC10QW20ER30TY40UI50</code></td> <td>Crypto-like address fragment</td> <td>MATCH</td> <td>FAIL</td> <td>YES</td> </tr> </table> <p><strong>Result: 12 false positives with regex only, 0 with regex + MOD-97. MOD-97 eliminates 100% of false positives.</strong> The theoretical random pass rate for MOD-97 is only 1/97 (~1%).</p> <h4>Bonus finding: IBAN regex word-eating bug</h4> <p>The <code>(?:\s?[A-Z0-9])</code> group in the IBAN regex treats <code>space + letter</code> as a valid IBAN character. When an IBAN appears mid-sentence, the regex consumes following words:</p> <pre> Input: "IBAN DE89370400440532013000 is valid" Matches: "DE89370400440532013000 is valid" ^^^^^^^^^^ consumed by regex </pre><p>This is because <code>\s?</code> inside the repeating group lets the pattern match a space followed by a letter from the next word. This may warrant a separate fix in the recipe's regex pattern.</p> <h3>Industry context</h3> <p>These checksum algorithms are industry standards, not novel ideas:</p> <p><strong>Luhn algorithm (ISO/IEC 7812-1) for credit cards:</strong></p> <ul> <li>Used by all major card networks (<a href="https://stripe.com/resources/more/how-to-use-the-luhn-algorithm-a-guide-in-applications-for-businesses">Stripe</a>, Adyen, Square all use Luhn as their first validation step)</li> <li>Trivially implementable in PHP (roughly 10 lines of code, <a href="https://gist.github.com/AJV009/04ccd05ad79998d4cfc3426505dff890">see test script</a>)</li> <li>Can be combined with <a href="https://en.wikipedia.org/wiki/Payment_card_number#Issuer_identification_number_(IIN)">BIN/IIN prefix validation</a> for even higher accuracy</li> </ul> <p><strong>MOD-97 algorithm (ISO 7064) for IBANs:</strong></p> <ul> <li>Moves the first 4 characters (country code + check digits) to the end, converts letters to numbers, checks remainder when divided by 97 equals 1</li> <li>Can be combined with <a href="https://en.wikipedia.org/wiki/International_Bank_Account_Number#IBAN_formats_by_country">country-specific length validation</a> (DE=22, FR=27, GB=22, etc.)</li> </ul> <p>Major PII detection frameworks all use this layered approach:</p> <ul> <li><a href="https://github.com/microsoft/presidio">Microsoft Presidio</a> (open-source): <a href="https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/credit_card_recognizer.py"><code>CreditCardRecognizer</code></a> sets confidence to 1.0 after passing Luhn, 0.0 if it fails. <a href="https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/iban_recognizer.py"><code>IbanRecognizer</code></a> validates MOD-97.</li> <li><a href="https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html">AWS Comprehend</a>: ML-based detection with high confidence scoring for structured PII types</li> <li><a href="https://cloud.google.com/sensitive-data-protection/docs/concepts-infotypes">Google Cloud DLP</a>: Pattern matching + checksum validation + context-aware likelihood scoring</li> <li><a href="https://protectai.com/llm-guard">LLM Guard</a> (Protect AI): Combines regex patterns with validation logic for structured PII</li> </ul> <h3 id="summary-proposed-resolution">Proposed resolution</h3> <p>Add checksum-validated guardrail plugins to the AI module. Two possible approaches:</p> <p><strong>Option A: New dedicated guardrail plugins</strong></p> <p>Ship two new guardrail plugins alongside <code>regexp_guardrail</code>:</p> <ul> <li><code>CreditCardGuardrail</code>: Regex matching + Luhn checksum + optional BIN prefix validation. Configuration: violation message, optional list of card network prefixes to match.</li> <li><code>IbanGuardrail</code>: Regex matching + MOD-97 checksum + optional country code allowlist. Configuration: violation message, optional list of accepted country codes.</li> </ul> <p>These would be deterministic (no LLM cost), fast, and significantly more accurate than regex alone.</p> <p><strong>Option B: Add a validator callback to regexp_guardrail</strong></p> <p>Extend <code>regexp_guardrail</code> with an optional "post-match validator" config. Built-in validators: <code>luhn</code>, <code>mod97</code>, <code>none</code> (default). The regex runs first, and if it matches, the validator runs as a second check. Only if both pass does the guardrail fire.</p> <p><strong>Recommendation:</strong> Option A is cleaner because credit card and IBAN detection have different configuration needs (BIN prefixes vs country codes) and the plugins can have purpose-built configuration forms with clear labels. Option B is simpler but mixes different concerns into one plugin.</p> <h3 id="summary-remaining-tasks">Remaining tasks</h3> <ul> <li>Decide on approach (Option A vs Option B)</li> <li>Implement the Luhn validation logic in PHP</li> <li>Implement the MOD-97 validation logic in PHP</li> <li>Add configuration forms with clear descriptions</li> <li>Write tests (test data from the <a href="https://gist.github.com/AJV009/04ccd05ad79998d4cfc3426505dff890">gist</a> can be reused)</li> <li>Update the PII recipe to use the new plugins (if Option A) or add validator config (if Option B)</li> <li>Consider whether phone number validation (E.164 format) should also be included</li> <li>Consider fixing the IBAN regex word-eating bug separately in the recipe</li> </ul> <h3>Related issues and projects</h3> <ul> <li><a href="https://www.drupal.org/project/ai_initiative/issues/3577498">#3577498</a>: [Meta] AI PII Guardrails Recipe - the recipe whose README documents the false positive problem</li> <li><a href="https://www.drupal.org/project/ai_recipe_guardrails_pii">AI Guardrails PII recipe</a> - would be updated to use the new plugins</li> <li><a href="https://www.drupal.org/project/ai/issues/3577790">#3577790</a>: Add validation to regex guardrail configuration (RTBC) - related guardrail improvement</li> <li><a href="https://www.drupal.org/project/mi_preflight">MI Preflight</a> - separate module with AI-powered PII detection for content compliance (does not use the guardrail plugin system)</li> <li><a href="https://www.drupal.org/project/analyze_ai_content_security_audit">Analyze AI Content Security Audit</a> - AI-powered content security scanning (separate from guardrails)</li> </ul> <h3 id="summary-ai-usage">AI usage (if applicable)</h3> <p>[x] AI Assisted Issue<br> This issue was generated with AI assistance, but was reviewed and refined by the creator.</p> <p>[ ] AI Assisted Code<br> This code was mainly generated by a human, with AI autocompleting or parts AI generated, but under full human supervision.</p> <p>[ ] AI Generated Code<br> This code was mainly generated by an AI with human guidance, and reviewed, tested, and refined by a human.</p> <p>[ ] Vibe Coded<br> This code was generated by an AI and has only been functionally tested.</p>
issue