fix: #3591904 Narrow restore chain-write lock via event-first ordering (!10) · Merge requests · project / audit_trail

Summary

Narrows ChainArchiver::restore()'s chain-write lock from "held for the full row-replay loop + segment UPDATE" down to "held only for the brief segment_restored event INSERT". Concurrent logger writes on the same chain now proceed unblocked during multi-million-row restores.

Sibling to !9 (merged) / #3591896 (which did the same narrowing for archiveSegment()).

Closes #3591904.

Problem

restore() previously acquired the chain-write lock and held it for:

The full row-replay loop (one INSERT per archived row).
The bounded segment_restored chain event emission.
The bounded segment row UPDATE.

For million-row archives the loop outlived the 30s lock TTL, so #3591878 introduced a renewer pattern. That kept the lock alive but did not solve the underlying issue: concurrent logger writes on the same chain are blocked for the full restore duration.

The same narrowing applied to archiveSegment() in #3591896 could not be applied to restore() because the row-replay loop has to run inside a DB transaction (atomicity over partial inserts), and acquiring the chain-write lock inside a transaction holds an InnoDB X-lock on the semaphore row for the full transaction duration -- the chain-write lock cannot be moved to wrap a narrow chain event INSERT inside the transaction without re-introducing exactly the DB-level blocking we want to avoid.

Approach: event-first ordering

The key insight is that the verifier already accepts every transitional state of a partially-applied restore without modification:

segment_restored is non-cross-checkable (no *_event_id column to match against on the segment row).
The verifier does not require rows in [from_id, to_id] to be absent when segment.live_purged_at != 0; it walks visible rows and validates their hash + previous_hash linkage like any other row.
The existing live-purge supersession rule already accepts segment.live_purged_event_id pointing at a later segment_restored event (the post-Step-3 state).

This lets restore() decompose into three steps with very different lock scopes:

Step	What	Lock	Why
0 + 1	Reject if in-flight event exists; emit `segment_restored` chain event	Chain-write lock (brief)	Guards chain anchor + per-chain id sequence for the single INSERT. Step 0 lookup inside the lock so two concurrent same-segment restores cannot both pass + both emit.
2	Replay rows + acks in a DB transaction	None	Logger writes proceed unblocked; archived ids are always below the live head, so logger AUTO_INCREMENT ids never collide with explicit-id restore INSERTs.
3	Pointer move on segment row (`live_purged_at = 0`, `live_purged_event_id`, re-signed `lifecycle_hmac`)	None	Idempotent UPDATE on one row.

The chain attestation is committed BEFORE rows land in the live table. This is the property that lets Step 2 run without holding the chain-write lock.

Step 0 reject

If a previous restore() call crashed between Step 1 and Step 3, the chain has a segment_restored event referencing the segment, but the segment row's live_purged_event_id still points at the pre-restore purge event. A retry detects this via:

SELECT id FROM audit_trail
WHERE chain = :chain
  AND action = 'segment_restored'
  AND resource = 'segment:<id>'
  AND id > segment.live_purged_event_id
LIMIT 1

If a row exists, restore() refuses with a message naming the in-flight event id and the [from_id, to_id] window. The operator inspects the partial state and finalizes manually; the API does not offer an automated resume path.

Operational impact

During a multi-million-row restore, concurrent logger writes on the same chain proceed unblocked. Same property #3591896 unlocked for archive.
Same-segment concurrent restore is caught up front by Step 0, not by mid-flight collisions.
Crashed restores leave a detectable signature (the segment_restored event without finalized segment row) and are rejected on retry.

Deleted

CHAIN_WRITE_LOCK_TTL_S constant.
CHAIN_WRITE_LOCK_RENEW_INTERVAL_S constant.
renewChainWriteLockIfStale() method.

No long-running chain-write lock means no renewer.

What is NOT in this MR

No verifier changes. The transitional state verifies cleanly against the existing rules.
No schema change. The detection signature is "segment_restored event id > segment.live_purged_event_id", which uses existing columns.
No automated resume / recovery command. Partial-state recovery remains a manual operator path; can be added in a follow-up if operationally needed.

Test plan

phpcs clean on audit_trail (Drupal + DrupalPractice standards).
phpstan clean on touched files.
Full kernel suite green: 233/233 (was 232 + 1 new test).
New test testRestoreStep0RejectsRetryAfterPartialRestoreLeavesEvent pins the partial-state error wrap message AND the Step 0 retry refusal in one flow.
Pre-existing tests pass unchanged (PurgeRestoreTest, ChainArchiverTest, CronArchiveHookTest); docstrings updated where they referenced the now-removed restoreLocked() method.

Docs

docs/architecture.md: new "Restore ordering: event first, rows second, segment row last" subsection.
docs/verification.md: new "Mid-restore transitional states verify cleanly" paragraph documenting why no verifier rule was added.

Edited May 25, 2026 by Frank Mably

fix: #3591904 Narrow restore chain-write lock via event-first ordering