Problem/Motivation

Every type of segment lifecycle event (segment_archived, segment_transient_purged, segment_live_purged, segment_file_purged, segment_restored) accumulates indefinitely in the live audit_trail table. They sit in chain-id gaps between archived segment [from_id, to_id] ranges and are never deleted.

Observed on a real test chain (audit_trail_entity):

chain                action                       n   min_id   max_id
audit_trail_entity   segment_archived             8   1509     1870
audit_trail_entity   segment_file_purged          3   1823     1872
audit_trail_entity   segment_live_purged          7   1547     1871
audit_trail_entity   segment_transient_purged     8   1507     1869

The chain's archived segments cover [1441,1459] [1482,1489] [1523,1523] [1569,1663] [1718,1721] [1749,1749] [1801,1801] [1846,1846]. Every lifecycle event id sits in a gap between those ranges. None is inside any segment's row range, so none is deleted by live-purge.

Root cause

The chain has only ONE delete path: a row leaves audit_trail when its id falls inside an archived segment's [from_id, to_id] and live-purge fires on that segment. So a row is cleaned up if and only if it ends up inside a segment.

User rows end up inside segments via transientPurgePass(): it scans rows by context_transient IS NOT NULL, buckets them, slices around existing segments, mints one bare segment per uncovered gap.

Lifecycle events are written via ChainArchiver::chainedWrite() with the permanent bucket only (never a transient bucket). So they have context_transient IS NULL and context_transient_hash = '' from the moment they are INSERTed -- the transient-purge content filter excludes them. They never trigger bare-segment minting.

The only way a lifecycle event currently ends up inside a segment is INCIDENTALLY -- when it happens to sit between two transient-purge-eligible user rows in the same week-bucket envelope, the bare-segment slicer sweeps it in. Lifecycle events that sit in a gap with no transient-purge-eligible neighbors stay in audit_trail forever.

The archive pass cannot help either: its scan uses id > max_archived_to_id as a watermark, which assumes every row id below the max is already inside a segment. That assumption fails for any row not picked up by transient-purge -- the watermark moves past it as later segments are archived, trapping it below.

The architectural gap

No cron pass enforces the invariant "every row past some cutoff is inside a segment" unconditionally. Both existing passes have side-conditions:

  • transientPurgePass() -- gated on transient-purge being configured AND on rows having transient data. Misses lifecycle events. Misses everything on chains with transient-purge disabled.
  • archivePass() -- watermark filter traps rows below the watermark.

Proposed resolution

Restructure cron so that one unconditional pass establishes segment coverage, and every other pass operates on the segments coverage produced.

New pass: coveragePass(). Runs ALWAYS. Cutoff = the smallest configured retention threshold:

$coverage_cutoff = $resolved['transient_purge_after_us']
                     ?? $resolved['archive_after_us'];

Scan: audit_trail rows past cutoff that are NOT inside any segment (NOT EXISTS against audit_trail_segment). Group by segment_granularity, slice each bucket envelope around existing segments, mint one bare segment per uncovered gap. No per-row side effect.

Refactor the existing passes to operate on segments rather than raw rows:

  • transientPurgePass() -- scan BARE segments (archived_at = 0 AND transient_purged_at = 0 AND to_created < transient_purge_after_cutoff). For each, UPDATE audit_trail SET context_transient = NULL with a context_transient IS NOT NULL filter to skip rows that already have nothing to purge (lifecycle events swept into the segment, already-purged user rows). Stamp transient_purged_at on the segment and write the segment_transient_purged chain event. No more row scanning, no more content-filter row scan, no more bucket-slicing (coverage already did it). Critically: never touches an archived segment -- once the NDJSON is sealed by file_sha256 + archive_hmac, NULLing live context_transient would diverge live data from the signed archive.
  • archivePass() -- scan BARE segments past archive_after (archived_at = 0 AND to_created < archive_after_cutoff). For each, write the NDJSON file + segment hashes + segment_archived chain event. No more raw-row scanning, no more watermark filter. The retention ordering transient_purge_after < archive_after guarantees that any bare segment past archive_after has already been processed by transientPurgePass() (if configured); the NDJSON captures the post-purge state, live and archived stay consistent.
  • livePurgePass(), filePurgePass() -- unchanged.

Cron pipeline order:

$this->safelyRunPass($chain, 'coverage', ...);         // always
if ($resolved['transient_purge_after_us'] !== NULL) {
  $this->safelyRunPass($chain, 'transient-purge', ...); // conditional
}
$this->safelyRunPass($chain, 'archive', ...);          // always
$this->safelyRunPass($chain, 'live-purge', ...);
$this->safelyRunPass($chain, 'orphan-heal', ...);
$this->safelyRunPass($chain, 'file-purge', ...);

Effect on user rows: identical timing. Coverage mints the bare segment, transient-purge NULLs the column on it, archive promotes it to archived, live-purge deletes the rows. Where today this happens via transient-purge bucket-slicing, it now happens via coverage bucket-slicing -- same logic, moved to a clearer home.

Effect on lifecycle events: coverage now wraps them in bare segments based on chain id alone (not content). They flow through the remaining passes like every other segment and eventually get live-purged.

Effect on chains with transient-purge disabled: coverage uses the larger archive_after cutoff. Bare segments get minted later but still get minted. Both user rows and lifecycle events flow through the pipeline.

Remaining tasks

  • Add coveragePass() in src/Hook/CronArchiveHook.php. Reuse the bucket-slicing helper from the existing transient-purge pass; extract to a shared method on ChainArchiver.
  • Refactor transientPurgePass() to scan segments past transient_purge_after rather than rows; remove the bucket-slicing + bare-segment minting (now in coverage).
  • Refactor archivePass() to scan bare segments past archive_after rather than rows; remove the max_archived_to_id watermark.
  • Wire the new pipeline order in processChain().
  • Kernel test: chain with transient-purge disabled, emit user rows + provoke several archive cycles, assert all five lifecycle event action types eventually leave audit_trail. Chain still verifies clean.
  • Kernel test: chain with transient-purge enabled. Same shape. Verify rows get transient-purged on schedule and lifecycle events still flow through.

API changes

None externally observable. Refactor lives entirely inside CronArchiveHook + a shared helper on ChainArchiver.

Data model changes

None.

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

mably created an issue. See original summary.

mably’s picture

Title: Segment lifecycle events accumulate indefinitely on the chain (no purge path) » Cron archive watermark skips lifecycle events written between segments
Issue summary: View changes
mably’s picture

Title: Cron archive watermark skips lifecycle events written between segments » Cron has no unconditional segment-coverage pass; lifecycle events accumulate indefinitely
Issue summary: View changes

mably’s picture

Status: Active » Needs review

  • mably committed 8d056282 on 1.x
    fix: #3591838 Cron has no unconditional segment-coverage pass; lifecycle...
mably’s picture

Status: Needs review » Fixed

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.