Problem/Motivation
Every type of segment lifecycle event (segment_archived, segment_transient_purged, segment_live_purged, segment_file_purged, segment_restored) accumulates indefinitely in the live audit_trail table. They sit in chain-id gaps between archived segment [from_id, to_id] ranges and are never deleted.
Observed on a real test chain (audit_trail_entity):
chain action n min_id max_id audit_trail_entity segment_archived 8 1509 1870 audit_trail_entity segment_file_purged 3 1823 1872 audit_trail_entity segment_live_purged 7 1547 1871 audit_trail_entity segment_transient_purged 8 1507 1869
The chain's archived segments cover [1441,1459] [1482,1489] [1523,1523] [1569,1663] [1718,1721] [1749,1749] [1801,1801] [1846,1846]. Every lifecycle event id sits in a gap between those ranges. None is inside any segment's row range, so none is deleted by live-purge.
Root cause
The chain has only ONE delete path: a row leaves audit_trail when its id falls inside an archived segment's [from_id, to_id] and live-purge fires on that segment. So a row is cleaned up if and only if it ends up inside a segment.
User rows end up inside segments via transientPurgePass(): it scans rows by context_transient IS NOT NULL, buckets them, slices around existing segments, mints one bare segment per uncovered gap.
Lifecycle events are written via ChainArchiver::chainedWrite() with the permanent bucket only (never a transient bucket). So they have context_transient IS NULL and context_transient_hash = '' from the moment they are INSERTed -- the transient-purge content filter excludes them. They never trigger bare-segment minting.
The only way a lifecycle event currently ends up inside a segment is INCIDENTALLY -- when it happens to sit between two transient-purge-eligible user rows in the same week-bucket envelope, the bare-segment slicer sweeps it in. Lifecycle events that sit in a gap with no transient-purge-eligible neighbors stay in audit_trail forever.
The archive pass cannot help either: its scan uses id > max_archived_to_id as a watermark, which assumes every row id below the max is already inside a segment. That assumption fails for any row not picked up by transient-purge -- the watermark moves past it as later segments are archived, trapping it below.
The architectural gap
No cron pass enforces the invariant "every row past some cutoff is inside a segment" unconditionally. Both existing passes have side-conditions:
transientPurgePass()-- gated on transient-purge being configured AND on rows having transient data. Misses lifecycle events. Misses everything on chains with transient-purge disabled.archivePass()-- watermark filter traps rows below the watermark.
Proposed resolution
Restructure cron so that one unconditional pass establishes segment coverage, and every other pass operates on the segments coverage produced.
New pass: coveragePass(). Runs ALWAYS. Cutoff = the smallest configured retention threshold:
$coverage_cutoff = $resolved['transient_purge_after_us']
?? $resolved['archive_after_us'];
Scan: audit_trail rows past cutoff that are NOT inside any segment (NOT EXISTS against audit_trail_segment). Group by segment_granularity, slice each bucket envelope around existing segments, mint one bare segment per uncovered gap. No per-row side effect.
Refactor the existing passes to operate on segments rather than raw rows:
transientPurgePass()-- scan BARE segments (archived_at = 0 AND transient_purged_at = 0 AND to_created < transient_purge_after_cutoff). For each,UPDATE audit_trail SET context_transient = NULLwith acontext_transient IS NOT NULLfilter to skip rows that already have nothing to purge (lifecycle events swept into the segment, already-purged user rows). Stamptransient_purged_aton the segment and write thesegment_transient_purgedchain event. No more row scanning, no more content-filter row scan, no more bucket-slicing (coverage already did it). Critically: never touches an archived segment -- once the NDJSON is sealed byfile_sha256+archive_hmac, NULLing livecontext_transientwould diverge live data from the signed archive.archivePass()-- scan BARE segments pastarchive_after(archived_at = 0 AND to_created < archive_after_cutoff). For each, write the NDJSON file + segment hashes +segment_archivedchain event. No more raw-row scanning, no more watermark filter. The retention orderingtransient_purge_after < archive_afterguarantees that any bare segment pastarchive_afterhas already been processed bytransientPurgePass()(if configured); the NDJSON captures the post-purge state, live and archived stay consistent.livePurgePass(),filePurgePass()-- unchanged.
Cron pipeline order:
$this->safelyRunPass($chain, 'coverage', ...); // always
if ($resolved['transient_purge_after_us'] !== NULL) {
$this->safelyRunPass($chain, 'transient-purge', ...); // conditional
}
$this->safelyRunPass($chain, 'archive', ...); // always
$this->safelyRunPass($chain, 'live-purge', ...);
$this->safelyRunPass($chain, 'orphan-heal', ...);
$this->safelyRunPass($chain, 'file-purge', ...);
Effect on user rows: identical timing. Coverage mints the bare segment, transient-purge NULLs the column on it, archive promotes it to archived, live-purge deletes the rows. Where today this happens via transient-purge bucket-slicing, it now happens via coverage bucket-slicing -- same logic, moved to a clearer home.
Effect on lifecycle events: coverage now wraps them in bare segments based on chain id alone (not content). They flow through the remaining passes like every other segment and eventually get live-purged.
Effect on chains with transient-purge disabled: coverage uses the larger archive_after cutoff. Bare segments get minted later but still get minted. Both user rows and lifecycle events flow through the pipeline.
Remaining tasks
- Add
coveragePass()insrc/Hook/CronArchiveHook.php. Reuse the bucket-slicing helper from the existing transient-purge pass; extract to a shared method onChainArchiver. - Refactor
transientPurgePass()to scan segments pasttransient_purge_afterrather than rows; remove the bucket-slicing + bare-segment minting (now in coverage). - Refactor
archivePass()to scan bare segments pastarchive_afterrather than rows; remove themax_archived_to_idwatermark. - Wire the new pipeline order in
processChain(). - Kernel test: chain with transient-purge disabled, emit user rows + provoke several archive cycles, assert all five lifecycle event action types eventually leave
audit_trail. Chain still verifies clean. - Kernel test: chain with transient-purge enabled. Same shape. Verify rows get transient-purged on schedule and lifecycle events still flow through.
API changes
None externally observable. Refactor lives entirely inside CronArchiveHook + a shared helper on ChainArchiver.
Data model changes
None.
Issue fork audit_trail-3591838
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
mably commentedComment #3
mably commentedComment #5
mably commentedComment #7
mably commented