Problem/Motivation
This is a follow-up to #3591878 (which shipped the TTL-renewal pattern on the chain-write lock to keep long archive + restore operations from outliving the lock's 30s TTL).
The renewal pattern is a workaround. The underlying issue is that the chain-write lock is held for the entire archive operation when it is only strictly needed for the brief chain-event emission at the end. The long row-loop in the middle does not produce any write that can conflict with concurrent chain writers, and the DB-level UNIQUE(chain, previous_hash) index prevents chain fork regardless. Narrowing the lock scope on archiveSegment() solves the TTL problem at the design level rather than patching around it, AND removes a real operational issue: while the over-broad lock is held, every chain-write op on the chain blocks or fails. On a large archive, that means user-facing audit log writes drop for tens of seconds.
Scope
This issue narrows archiveSegment() only. restoreLocked() is NOT included -- see the "Why restore is not narrowed here" section below for the design constraint and the open follow-up.
Analysis
ChainArchiver::archiveSegment() (via writeNdjson()) holds the chain-write lock for a data-dependent duration:
- Verify segment state (read-only, no lock needed).
- Acquire chain-write lock.
- Per-row loop:
writeNdjson()writes one NDJSON line per row in [from_id, to_id], plushash_file('sha256')on the whole file. - Update segment row lifecycle stamps.
- Emit
segment_archivedchain event at the live head (one INSERT intoaudit_trail). - Rename temp NDJSON to its final path.
- Release lock.
Of these, only steps 4-5 actually need the chain-write lock for chain-integrity reasons. The row loop in step 3 operates on a past id window that does not intersect the head; concurrent logger writes land at the head with the current head's hash as their previous_hash, and the archive's NDJSON file captures the pre-loop snapshot. The two cannot fork. The UNIQUE(chain, previous_hash) index is the load-bearing protection.
Operational impact
While the chain-write lock is held by archive, every other code path on the same chain blocks or fails:
- Logger writes (
AuditTrailChainWriter::writeChainedRow()): block up to 5s then drop with drop-counter bump and throw. - Cron archive / purge / file-purge / transient-purge / orphan-heal / createBareSegment / transientPurgeSegment: block up to 30s then throw.
safelyRunPasscatches each but the chain is skipped for that cron tick.
For a 1M-row archive (~100s), this is effectively a 100-second outage of all chain-write activity on the chain. With the TTL-renewal pattern in #3591878, the outage is preserved-by-design: renewal is what keeps the lock held that long.
Proposed resolution
Refactor archiveSegment() to acquire the chain-write lock only around the bounded Phase B (segment row UPDATE + segment_archived chain event INSERT + rename). The long Phase A (writeNdjson + hash_file + HMAC computation) runs without the lock. Concurrent-archive-of-the-same-segment race is caught by an optimistic WHERE archived_at = 0 predicate on the Phase B UPDATE -- second caller sees affected_rows = 0, throws, rolls back, unlinks temp file.
Why restore is NOT narrowed here
Initially this issue proposed narrowing restoreLocked() with the same pattern. Audit review surfaced a fundamental constraint: restoreLocked() needs a single DB transaction wrapping the row-replay loop AND the segment_restored chain event INSERT AND the segment row UPDATE -- atomicity is required for chain integrity, and a partial restore (rows present but no segment_restored event) is indistinguishable from a tamper at the verifier level.
Three placements were considered:
- Lock acquired inside the transaction: Drupal's
DatabaseLockBackendshares the same DB connection, so the semaphore INSERT joins the transaction. The InnoDB row-level lock on the semaphore row is held for the full transaction duration. Concurrent logger writes block at the DB level (below Drupal's controlled retry) and surface rawlock_wait_timeouterrors after 50s, bypassing the drop-counter contract. PHP-side lock release runs early but has no effect on the DB-level mutual exclusion. - Lock acquired before the transaction: lock held for the full transaction duration anyway. Same operational behavior as #3591878's wide-lock pattern. No narrowing achieved.
- Two transactions (T1 row replay, T2 chain event + segment UPDATE): breaks atomicity. If T2 fails after T1 committed, rows are restored but the segment row still says
live_purged_at != 0. Verifier flags this as a tamper. Operator retry hits PRIMARY KEY conflicts on the already-restored rows.
None of the three preserves both atomicity and DB-level narrowing without additional infrastructure. The two-transaction option becomes viable with a per-segment restore_in_progress lifecycle stamp + verifier tolerance for that intermediate state + restore retry detection of "T1 already committed" -- a schema change and verifier update that is out of scope here.
Open follow-up: #3591904 (to be filed) -- tracks the restore-narrowing work with the schema + verifier design.
Related
- #3591878 (parent: shipped the TTL-renewal workaround that this issue's archive half replaces with a proper scope fix; the restore half stays as-is for now).
- #3591871 (grandparent: introduced the renewal pattern on the coverage lock).
Remaining tasks
- Refactor
archiveSegment()to narrow the chain-write lock scope (Phase A no lock, Phase B with lock). - Revert the
writeNdjson()signature change (drop$lock_nameand$renewed_atparameters, drop renewer calls in the loop). - Keep
renewChainWriteLockIfStale(),CHAIN_WRITE_LOCK_TTL_S,CHAIN_WRITE_LOCK_RENEW_INTERVAL_S: now used only byrestoreLocked()'s row-replay loop pending the follow-up. - Restore-side changes from #3591878 stay unchanged.
- Kernel test: long-archive plus concurrent logger write, verifier accepts the resulting chain.
API changes
None for external callers.
Data model changes
None.
Issue fork audit_trail-3591896
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
mably commentedComment #3
mably commentedComment #5
mably commentedComment #6
mably commentedComment #8
mably commented