fix: #3591904 Narrow restore chain-write lock via event-first ordering

Summary

Narrows ChainArchiver::restore()'s chain-write lock from "held for the full row-replay loop + segment UPDATE" down to "held only for the brief segment_restored event INSERT". Concurrent logger writes on the same chain now proceed unblocked during multi-million-row restores.

Sibling to !9 (merged) / #3591896 (which did the same narrowing for archiveSegment()).

Closes #3591904.

Problem

restore() previously acquired the chain-write lock and held it for:

  1. The full row-replay loop (one INSERT per archived row).
  2. The bounded segment_restored chain event emission.
  3. The bounded segment row UPDATE.

For million-row archives the loop outlived the 30s lock TTL, so #3591878 introduced a renewer pattern. That kept the lock alive but did not solve the underlying issue: concurrent logger writes on the same chain are blocked for the full restore duration.

The same narrowing applied to archiveSegment() in #3591896 could not be applied to restore() because the row-replay loop has to run inside a DB transaction (atomicity over partial inserts), and acquiring the chain-write lock inside a transaction holds an InnoDB X-lock on the semaphore row for the full transaction duration -- the chain-write lock cannot be moved to wrap a narrow chain event INSERT inside the transaction without re-introducing exactly the DB-level blocking we want to avoid.

Approach: event-first ordering

The key insight is that the verifier already accepts every transitional state of a partially-applied restore without modification:

  • segment_restored is non-cross-checkable (no *_event_id column to match against on the segment row).
  • The verifier does not require rows in [from_id, to_id] to be absent when segment.live_purged_at != 0; it walks visible rows and validates their hash + previous_hash linkage like any other row.
  • The existing live-purge supersession rule already accepts segment.live_purged_event_id pointing at a later segment_restored event (the post-Step-3 state).

This lets restore() decompose into three steps with very different lock scopes:

Step What Lock Why
0 + 1 Reject if in-flight event exists; emit segment_restored chain event Chain-write lock (brief) Guards chain anchor + per-chain id sequence for the single INSERT. Step 0 lookup inside the lock so two concurrent same-segment restores cannot both pass + both emit.
2 Replay rows + acks in a DB transaction None Logger writes proceed unblocked; archived ids are always below the live head, so logger AUTO_INCREMENT ids never collide with explicit-id restore INSERTs.
3 Pointer move on segment row (live_purged_at = 0, live_purged_event_id, re-signed lifecycle_hmac) None Idempotent UPDATE on one row.

The chain attestation is committed BEFORE rows land in the live table. This is the property that lets Step 2 run without holding the chain-write lock.

Step 0 reject

If a previous restore() call crashed between Step 1 and Step 3, the chain has a segment_restored event referencing the segment, but the segment row's live_purged_event_id still points at the pre-restore purge event. A retry detects this via:

SELECT id FROM audit_trail
WHERE chain = :chain
  AND action = 'segment_restored'
  AND resource = 'segment:<id>'
  AND id > segment.live_purged_event_id
LIMIT 1

If a row exists, restore() refuses with a message naming the in-flight event id and the [from_id, to_id] window. The operator inspects the partial state and finalizes manually; the API does not offer an automated resume path.

Operational impact

  • During a multi-million-row restore, concurrent logger writes on the same chain proceed unblocked. Same property #3591896 unlocked for archive.
  • Same-segment concurrent restore is caught up front by Step 0, not by mid-flight collisions.
  • Crashed restores leave a detectable signature (the segment_restored event without finalized segment row) and are rejected on retry.

Deleted

  • CHAIN_WRITE_LOCK_TTL_S constant.
  • CHAIN_WRITE_LOCK_RENEW_INTERVAL_S constant.
  • renewChainWriteLockIfStale() method.

No long-running chain-write lock means no renewer.

What is NOT in this MR

  • No verifier changes. The transitional state verifies cleanly against the existing rules.
  • No schema change. The detection signature is "segment_restored event id > segment.live_purged_event_id", which uses existing columns.
  • No automated resume / recovery command. Partial-state recovery remains a manual operator path; can be added in a follow-up if operationally needed.

Test plan

  • phpcs clean on audit_trail (Drupal + DrupalPractice standards).
  • phpstan clean on touched files.
  • Full kernel suite green: 233/233 (was 232 + 1 new test).
  • New test testRestoreStep0RejectsRetryAfterPartialRestoreLeavesEvent pins the partial-state error wrap message AND the Step 0 retry refusal in one flow.
  • Pre-existing tests pass unchanged (PurgeRestoreTest, ChainArchiverTest, CronArchiveHookTest); docstrings updated where they referenced the now-removed restoreLocked() method.

Docs

  • docs/architecture.md: new "Restore ordering: event first, rows second, segment row last" subsection.
  • docs/verification.md: new "Mid-restore transitional states verify cleanly" paragraph documenting why no verifier rule was added.
Edited by Frank Mably

Merge request reports

Loading