fix: #3591904 Narrow restore chain-write lock via event-first ordering
Summary
Narrows ChainArchiver::restore()'s chain-write lock from "held for the full row-replay loop + segment UPDATE" down to "held only for the brief segment_restored event INSERT". Concurrent logger writes on the same chain now proceed unblocked during multi-million-row restores.
Sibling to !9 (merged) / #3591896 (which did the same narrowing for archiveSegment()).
Closes #3591904.
Problem
restore() previously acquired the chain-write lock and held it for:
- The full row-replay loop (one
INSERTper archived row). - The bounded
segment_restoredchain event emission. - The bounded segment row UPDATE.
For million-row archives the loop outlived the 30s lock TTL, so #3591878 introduced a renewer pattern. That kept the lock alive but did not solve the underlying issue: concurrent logger writes on the same chain are blocked for the full restore duration.
The same narrowing applied to archiveSegment() in #3591896 could not be applied to restore() because the row-replay loop has to run inside a DB transaction (atomicity over partial inserts), and acquiring the chain-write lock inside a transaction holds an InnoDB X-lock on the semaphore row for the full transaction duration -- the chain-write lock cannot be moved to wrap a narrow chain event INSERT inside the transaction without re-introducing exactly the DB-level blocking we want to avoid.
Approach: event-first ordering
The key insight is that the verifier already accepts every transitional state of a partially-applied restore without modification:
segment_restoredis non-cross-checkable (no*_event_idcolumn to match against on the segment row).- The verifier does not require rows in
[from_id, to_id]to be absent whensegment.live_purged_at != 0; it walks visible rows and validates theirhash+previous_hashlinkage like any other row. - The existing live-purge supersession rule already accepts
segment.live_purged_event_idpointing at a latersegment_restoredevent (the post-Step-3 state).
This lets restore() decompose into three steps with very different lock scopes:
| Step | What | Lock | Why |
|---|---|---|---|
| 0 + 1 | Reject if in-flight event exists; emit segment_restored chain event |
Chain-write lock (brief) | Guards chain anchor + per-chain id sequence for the single INSERT. Step 0 lookup inside the lock so two concurrent same-segment restores cannot both pass + both emit. |
| 2 | Replay rows + acks in a DB transaction | None | Logger writes proceed unblocked; archived ids are always below the live head, so logger AUTO_INCREMENT ids never collide with explicit-id restore INSERTs. |
| 3 | Pointer move on segment row (live_purged_at = 0, live_purged_event_id, re-signed lifecycle_hmac) |
None | Idempotent UPDATE on one row. |
The chain attestation is committed BEFORE rows land in the live table. This is the property that lets Step 2 run without holding the chain-write lock.
Step 0 reject
If a previous restore() call crashed between Step 1 and Step 3, the chain has a segment_restored event referencing the segment, but the segment row's live_purged_event_id still points at the pre-restore purge event. A retry detects this via:
SELECT id FROM audit_trail
WHERE chain = :chain
AND action = 'segment_restored'
AND resource = 'segment:<id>'
AND id > segment.live_purged_event_id
LIMIT 1If a row exists, restore() refuses with a message naming the in-flight event id and the [from_id, to_id] window. The operator inspects the partial state and finalizes manually; the API does not offer an automated resume path.
Operational impact
- During a multi-million-row restore, concurrent logger writes on the same chain proceed unblocked. Same property #3591896 unlocked for archive.
- Same-segment concurrent restore is caught up front by Step 0, not by mid-flight collisions.
- Crashed restores leave a detectable signature (the
segment_restoredevent without finalized segment row) and are rejected on retry.
Deleted
CHAIN_WRITE_LOCK_TTL_Sconstant.CHAIN_WRITE_LOCK_RENEW_INTERVAL_Sconstant.renewChainWriteLockIfStale()method.
No long-running chain-write lock means no renewer.
What is NOT in this MR
- No verifier changes. The transitional state verifies cleanly against the existing rules.
- No schema change. The detection signature is "
segment_restoredevent id >segment.live_purged_event_id", which uses existing columns. - No automated resume / recovery command. Partial-state recovery remains a manual operator path; can be added in a follow-up if operationally needed.
Test plan
-
phpcsclean onaudit_trail(Drupal + DrupalPractice standards). -
phpstanclean on touched files. - Full kernel suite green: 233/233 (was 232 + 1 new test).
- New test
testRestoreStep0RejectsRetryAfterPartialRestoreLeavesEventpins the partial-state error wrap message AND the Step 0 retry refusal in one flow. - Pre-existing tests pass unchanged (PurgeRestoreTest, ChainArchiverTest, CronArchiveHookTest); docstrings updated where they referenced the now-removed
restoreLocked()method.
Docs
docs/architecture.md: new "Restore ordering: event first, rows second, segment row last" subsection.docs/verification.md: new "Mid-restore transitional states verify cleanly" paragraph documenting why no verifier rule was added.