Lock-handling audit: TTL renewal and deadline-loop correctness across all module locks [#3591878]

Problem/Motivation

The module uses six distinct lock names. After the deadline-bug fix and renewal pattern introduced in #3591871, an audit of every remaining lock surfaced two issues that deserve attention as a follow-up.

Lock inventory

Each row: lock name -> acquirer -> typical critical-section duration -> TTL -> status.

audit_trail.write:[chain] via ChainArchiver::acquireChainWriteLock(). Most callers (purge / file-purge / createBareSegment / transientPurgeSegment) hold the lock for one bounded operation, ~50ms. Two callers run a per-row loop under the lock whose duration is data-dependent: archiveSegment() writes NDJSON line-by-line for the segment range, and restoreLocked() re-INSERTs each line back into the live table. Both can exceed the 30s TTL on multi-million-row segments. Issue A below.
audit_trail.coverage:[chain] via ChainArchiver::acquireCoverageLock(), up to 50 INSERTs. TTL 60s + 20s renewal. Fixed in #3591871.
audit_trail.write:[chain] via AuditTrailChainWriter::writeChainedRow(), ~10-20ms per chain row. TTL 5s. Issue B below.
audit_trail.checkpoint:[chain] via AuditTrailVerifier::mintCheckpoint(), ~milliseconds. TTL 5s. Single try-acquire, no deadline loop. No issue.
audit_trail.chain_writer.dropped:counter via AuditTrailChainWriter, ~microseconds (one State::set). TTL 1s. Single try-acquire, no issue.
audit_trail.cron_archive_throttle:[chain] via CronArchiveHook::cron(), throttle gate only. TTL 1s. Single try-acquire, no issue.

Issue A: chain-write lock TTL can expire mid-loop in archiveSegment() and restoreLocked()

Two chain-write-lock holders run a per-row loop whose runtime scales linearly with row count, and can therefore outlive the 30s TTL on large segments:

ChainArchiver::archiveSegment() writes one NDJSON line per row in the segment range, then hashes the whole file with hash_file('sha256'). Disk-bandwidth bound.
ChainArchiver::restoreLocked() fgets()-loops through the archive file, re-canonicalizes and re-INSERTs each row.

Realistic durations:

100 rows: ~1s.
1000 rows: ~10s.
10000+ rows: 100s+. Crosses the 30s TTL.

For restoreLocked() specifically, the lock is explicitly documented as defensive at ChainArchiver::restore() (lines 2201-2210): chain integrity is actually protected at the DB level by UNIQUE(chain, previous_hash). Restore touches a disjoint id window from the live head, so even if the TTL expires mid-loop and a concurrent logger write claims the chain-write lock, the two writers operate on disjoint slots and disjoint previous_hash regions. No fork, no corruption -- but the defensive layer is temporarily gone.

For archiveSegment(), the consequence of TTL expiry is more concerning: a concurrent logger write on the chain could land WHILE the archive writer is mid-file. The archive's NDJSON envelope captures the live row state at archive time; if a sibling write lands at the head between the start of the archive loop and its end, the chain row sequence the archive snapshotted may not be the sequence on disk in the live table at the moment the segment row is stamped. Verifier reconciliation still works (the NDJSON file is the source of truth for the archived range), but operators inspecting timing forensically could see an off-by-one between archive boundaries and concurrent activity.

Real correctness risk: low for both. Defensive guarantee: degraded.

Issue B: AuditTrailChainWriter has the same deadline-vs-short-circuit bug fixed in #3591871

AuditTrailChainWriter::writeChainedRow() line ~131:

The retry loop reads roughly: set deadline = now + 5s; while we can't acquire the lock, check whether the wait returned FALSE (lock free) AND the deadline has passed; only then bump the drop counter and throw. The conjunction is the bug.

Same bug shape as ChainArchiver::acquireChainWriteLock() had before #3591871: when wait() returns TRUE (timed out without the lock becoming available), !wait() is FALSE and the "and" short-circuits before the deadline check. If a foreign holder keeps the lock for longer than 5s, every wait() returns TRUE, the deadline check never fires, the loop spins forever instead of giving up and bumping the drop counter.

Real correctness risk: medium. The drop-counter contract -- "give up after 5s and surface the drop on /admin/reports/status" -- is silently violated; chained writes block indefinitely under sustained contention instead.

Proposed resolution

For Issue A, apply the same TTL + renewal pattern that #3591871 introduced for the coverage lock: CHAIN_WRITE_LOCK_TTL_S + CHAIN_WRITE_LOCK_RENEW_INTERVAL_S constants, plus a renewChainWriteLockIfStale() helper called inside the two per-row loops -- one call site in archiveSegment() (per-line NDJSON write loop) and one in restoreLocked() (per-line replay loop). Every other chain-write-lock holder is bounded and does not need the renewer.

For Issue B, replace the the "wait result AND deadline" conjunction conjunction with an unconditional deadline check, identical to the fix already shipped for ChainArchiver::acquireChainWriteLock() in #3591871. Use wait() only for its back-off side effect; the deadline check stands on its own.

Why not bump the TTL for Issue A instead of adding renewal?

A 600s TTL covers practical restore sizes but leaves the lock held for 10 minutes after a crashed restore worker -- blocking every legitimate caller for the same window. The renewal pattern gives both bounded recovery on crash (lock frees within TTL) and unbounded happy-path runtime.

#3591871 (parent: introduced the TTL + renewal pattern on the coverage lock; fixed the same deadline short-circuit bug on ChainArchiver::acquireChainWriteLock()).
#3591838 (grandparent: cron coverage-pass refactor).

Remaining tasks

Issue A: add chain-write TTL + renew constants, renewChainWriteLockIfStale() helper, wire into BOTH archiveSegment()'s NDJSON-write loop AND restoreLocked()'s per-line replay loop. Kernel tests for long-archive and long-restore renewal.
Issue B: fix the deadline short-circuit in AuditTrailChainWriter::writeChainedRow(). Kernel test pinning that a contended chain-write actually drops after 5s instead of blocking forever.

API changes

None for external callers.

Data model changes

None.

Issue fork audit_trail-3591878

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3591878-lock-handling-audit-ttl changes, plain diff MR !8
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

25 May 2026 at 10:04

mably created an issue. See original summary.

Comment #2

mably commented 25 May 2026 at 10:09

Title:	Chain-write lock TTL can expire mid-restore() on very large archives	» Lock-handling audit: TTL renewal and deadline-loop correctness across all module locks
Issue summary:	View changes

Comment #3

mably commented 25 May 2026 at 10:49

Issue summary:

View changes

Comment #4

25 May 2026 at 11:13

mably opened merge request !8

Comment #5

mably commented 25 May 2026 at 12:16

Status:

Active

» Needs review

Comment #6

25 May 2026 at 12:31

mably committed c942d42e on 1.x

fix: #3591878 Lock-handling audit: TTL renewal and deadline-loop...

Comment #7

mably commented 25 May 2026 at 12:31

Status:

Needs review

» Fixed

Comment #8

25 May 2026 at 12:31

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Comment #9

8 June 2026 at 12:35

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Lock-handling audit: TTL renewal and deadline-loop correctness across all module locks

Problem/Motivation

Lock inventory

Issue A: chain-write lock TTL can expire mid-loop in archiveSegment() and restoreLocked()

Issue B: AuditTrailChainWriter has the same deadline-vs-short-circuit bug fixed in #3591871

Proposed resolution

Why not bump the TTL for Issue A instead of adding renewal?

Related

Remaining tasks

API changes

Data model changes

Issue fork audit_trail-3591878

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Parent issue

Child issues

News items

Our community

Documentation

Drupal code base

Governance of community