Problem/Motivation
The module uses six distinct lock names. After the deadline-bug fix and renewal pattern introduced in #3591871, an audit of every remaining lock surfaced two issues that deserve attention as a follow-up.
Lock inventory
Each row: lock name -> acquirer -> typical critical-section duration -> TTL -> status.
audit_trail.write:[chain]viaChainArchiver::acquireChainWriteLock(). Most callers (purge / file-purge / createBareSegment / transientPurgeSegment) hold the lock for one bounded operation, ~50ms. Two callers run a per-row loop under the lock whose duration is data-dependent:archiveSegment()writes NDJSON line-by-line for the segment range, andrestoreLocked()re-INSERTs each line back into the live table. Both can exceed the 30s TTL on multi-million-row segments. Issue A below.audit_trail.coverage:[chain]viaChainArchiver::acquireCoverageLock(), up to 50 INSERTs. TTL 60s + 20s renewal. Fixed in #3591871.audit_trail.write:[chain]viaAuditTrailChainWriter::writeChainedRow(), ~10-20ms per chain row. TTL 5s. Issue B below.audit_trail.checkpoint:[chain]viaAuditTrailVerifier::mintCheckpoint(), ~milliseconds. TTL 5s. Single try-acquire, no deadline loop. No issue.audit_trail.chain_writer.dropped:counterviaAuditTrailChainWriter, ~microseconds (one State::set). TTL 1s. Single try-acquire, no issue.audit_trail.cron_archive_throttle:[chain]viaCronArchiveHook::cron(), throttle gate only. TTL 1s. Single try-acquire, no issue.
Issue A: chain-write lock TTL can expire mid-loop in archiveSegment() and restoreLocked()
Two chain-write-lock holders run a per-row loop whose runtime scales linearly with row count, and can therefore outlive the 30s TTL on large segments:
ChainArchiver::archiveSegment()writes one NDJSON line per row in the segment range, then hashes the whole file withhash_file('sha256'). Disk-bandwidth bound.ChainArchiver::restoreLocked()fgets()-loops through the archive file, re-canonicalizes and re-INSERTs each row.
Realistic durations:
- 100 rows: ~1s.
- 1000 rows: ~10s.
- 10000+ rows: 100s+. Crosses the 30s TTL.
For restoreLocked() specifically, the lock is explicitly documented as defensive at ChainArchiver::restore() (lines 2201-2210): chain integrity is actually protected at the DB level by UNIQUE(chain, previous_hash). Restore touches a disjoint id window from the live head, so even if the TTL expires mid-loop and a concurrent logger write claims the chain-write lock, the two writers operate on disjoint slots and disjoint previous_hash regions. No fork, no corruption -- but the defensive layer is temporarily gone.
For archiveSegment(), the consequence of TTL expiry is more concerning: a concurrent logger write on the chain could land WHILE the archive writer is mid-file. The archive's NDJSON envelope captures the live row state at archive time; if a sibling write lands at the head between the start of the archive loop and its end, the chain row sequence the archive snapshotted may not be the sequence on disk in the live table at the moment the segment row is stamped. Verifier reconciliation still works (the NDJSON file is the source of truth for the archived range), but operators inspecting timing forensically could see an off-by-one between archive boundaries and concurrent activity.
Real correctness risk: low for both. Defensive guarantee: degraded.
Issue B: AuditTrailChainWriter has the same deadline-vs-short-circuit bug fixed in #3591871
AuditTrailChainWriter::writeChainedRow() line ~131:
The retry loop reads roughly: set deadline = now + 5s; while we can't acquire the lock, check whether the wait returned FALSE (lock free) AND the deadline has passed; only then bump the drop counter and throw. The conjunction is the bug.
Same bug shape as ChainArchiver::acquireChainWriteLock() had before #3591871: when wait() returns TRUE (timed out without the lock becoming available), !wait() is FALSE and the "and" short-circuits before the deadline check. If a foreign holder keeps the lock for longer than 5s, every wait() returns TRUE, the deadline check never fires, the loop spins forever instead of giving up and bumping the drop counter.
Real correctness risk: medium. The drop-counter contract -- "give up after 5s and surface the drop on /admin/reports/status" -- is silently violated; chained writes block indefinitely under sustained contention instead.
Proposed resolution
For Issue A, apply the same TTL + renewal pattern that #3591871 introduced for the coverage lock: CHAIN_WRITE_LOCK_TTL_S + CHAIN_WRITE_LOCK_RENEW_INTERVAL_S constants, plus a renewChainWriteLockIfStale() helper called inside the two per-row loops -- one call site in archiveSegment() (per-line NDJSON write loop) and one in restoreLocked() (per-line replay loop). Every other chain-write-lock holder is bounded and does not need the renewer.
For Issue B, replace the the "wait result AND deadline" conjunction conjunction with an unconditional deadline check, identical to the fix already shipped for ChainArchiver::acquireChainWriteLock() in #3591871. Use wait() only for its back-off side effect; the deadline check stands on its own.
Why not bump the TTL for Issue A instead of adding renewal?
A 600s TTL covers practical restore sizes but leaves the lock held for 10 minutes after a crashed restore worker -- blocking every legitimate caller for the same window. The renewal pattern gives both bounded recovery on crash (lock frees within TTL) and unbounded happy-path runtime.
Related
- #3591871 (parent: introduced the TTL + renewal pattern on the coverage lock; fixed the same deadline short-circuit bug on
ChainArchiver::acquireChainWriteLock()). - #3591838 (grandparent: cron coverage-pass refactor).
Remaining tasks
- Issue A: add chain-write TTL + renew constants,
renewChainWriteLockIfStale()helper, wire into BOTHarchiveSegment()'s NDJSON-write loop ANDrestoreLocked()'s per-line replay loop. Kernel tests for long-archive and long-restore renewal. - Issue B: fix the deadline short-circuit in
AuditTrailChainWriter::writeChainedRow(). Kernel test pinning that a contended chain-write actually drops after 5s instead of blocking forever.
API changes
None for external callers.
Data model changes
None.
Issue fork audit_trail-3591878
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
mably commentedComment #3
mably commentedComment #5
mably commentedComment #7
mably commented