fix: #3591896 Narrow archiveSegment() chain-write lock scope

Fix for #3591896

Narrows the chain-write lock scope on ChainArchiver::archiveSegment() so the long Phase A (NDJSON write loop + hash_file() + HMAC computations) runs unlocked, and only the bounded Phase B (segment row UPDATE + segment_archived chain event INSERT + rename(temp, final)) holds the lock.

Replaces the TTL renewal pattern shipped in #3591878 for archive. Restore is NOT narrowed in this MR — see the deferred section below and follow-up #3591904.

Approach

archiveSegment (narrowed)

Before: lock acquired around the full archive operation including writeNdjson + hash_file (data-dependent runtime). Multi-million-row segments needed TTL renewal to avoid lock expiry mid-loop.

After:

Phase A (no lock):
  writeNdjson()                  // long row loop, disjoint id window
  appendArchiveRecordFooter()
  hash_file('sha256')
  archiveContentHmac()
  lifecycleHmac(initial)

Phase B (chain-write lock held + DB transaction):
  UPDATE audit_trail_segment
    WHERE archived_at = 0          // optimistic concurrency predicate
  chainedWrite(SegmentLifecycleAction::ARCHIVED, acquire_lock: FALSE)
  UPDATE audit_trail_segment       // finalize with archived_event_id
  rename(temp, final)
COMMIT

Same-segment concurrent archive: caught by the WHERE predicate

If cron and drush both pass the early loadSegmentRow() check on the same segment (no lock around it) and both reach Phase B, the second caller's UPDATE affects 0 rows. The catch throws cleanly, rolls back, unlinks the temp file. First wins, second errors with Segment #N: concurrent archive landed.

Operational benefit

While a long archive is running, logger writes and other chain-write ops on the chain proceed unblocked. No more tens-of-seconds outages on user-facing audit log writes during multi-million-row archives.

What restoreLocked() looks like (unchanged in this MR)

Attempted to narrow restore with the same pattern. Audit surfaced a fundamental constraint: the DB transaction has to wrap the row-replay loop + chain-event INSERT + segment UPDATE for atomicity. Three placements for the lock relative to the transaction were considered:

  • Lock inside transaction: Drupal's DatabaseLockBackend shares the DB connection → semaphore INSERT joins the transaction → InnoDB X-lock on semaphore row held for the full transaction. Concurrent logger writes block at the DB level (below Drupal's retry loop) and surface raw lock_wait_timeout errors after 50s. PHP-side release runs early but has no effect on DB-level mutual exclusion.
  • Lock before transaction (current behavior): lock effectively held for the full transaction anyway, via either the renewer or chainedWrite's same-holder re-acquire which also joins the transaction. Same operational behavior as MR !8 (merged)'s wide-lock pattern.
  • Two transactions (T1 row replay, T2 chain event + segment UPDATE): breaks atomicity. Partial-state failure (T1 committed, T2 didn't) looks indistinguishable from a tamper at the verifier level.

No placement preserves both atomicity and DB-level narrowing without a new per-segment lifecycle marker. Tracked at #3591904 with the proposed restore_in_progress schema change + verifier tolerance + retry-detection design.

restoreLocked() in this MR keeps the wide-lock + TTL-renewal pattern from #3591878 unchanged.

Companion correctness fixes (surfaced during the audit)

Two chainedWrite calls inside DB transactions were missing acquire_lock: FALSE:

  • ChainArchiver::purge() line ~1035 (segment_live_purged event)
  • ChainArchiver::restoreLocked() line ~2571 (segment_restored event)

The outer chain-write lock is already held by both methods. Default acquire_lock: TRUE caused a same-holder MERGE on the semaphore row inside the transaction, holding the InnoDB X-lock on that row for the transaction duration. For purge() the impact was negligible (transaction is bounded); for restoreLocked() it contributed to the same lock-during-transaction issue tracked at #3591904. Both fixed in this MR.

Reverted from #3591878 in this MR

  • writeNdjson() signature dropped $lock_name and $renewed_at parameters; no renewer call in its loop anymore.
  • archiveSegment no longer passes a $renewed_at into writeNdjson.

Kept from #3591878 (needed by restoreLocked until #3591904 lands)

  • renewChainWriteLockIfStale() helper
  • CHAIN_WRITE_LOCK_TTL_S / CHAIN_WRITE_LOCK_RENEW_INTERVAL_S constants
  • Renewer call inside restoreLocked's per-row loop

All three are scheduled for deletion in #3591904.

Verification

  • Full kernel suite: 232 passing, 985 assertions.
  • phpcs clean on touched files (the 2 pre-existing AuditTrailSettingsForm non-capturing-catch errors are out of scope).
  • FF-merges cleanly against 1.x.
Edited by Frank Mably

Merge request reports

Loading