Problem/Motivation

In Drupal 11.3.x, HookCollectorKeyValueWritePass::setMultiple() writes hook implementation data to key_value('hook_data') as individual SQL MERGE statements without a database transaction or lock. In multi-process environments (multiple FPM workers, multiple application pods), concurrent container compilations allow ModuleHandler::getHookImplementationList() to read partial data from key_value mid-write. The partial data is then cached to cache.bootstrap (CID hook_data), permanently poisoning hook dispatch until manual cache clear.

This is a regression from Drupal 11.2, where hook implementations were compiled directly into the serialized service container as EventDispatcher listeners. In 11.2, concurrent container compilations were benign — each process produced a complete, valid container. In 11.3, the two-phase write (container to cache_container, then hook data to key_value) creates a window where hook data can be read in a partial state.

Affected modules' hooks are silently skipped — no error is logged, no exception is thrown, and no user-facing feedback indicates that hooks were not called. The only symptom is that the hook's side effects do not occur.

Steps to reproduce

This bug requires concurrent container compilation, which is common in production hosting but difficult to reproduce in single-process environments.

Environment requirements:

  • Drupal 11.3.x (confirmed on 11.3.3 and 11.3.5; all relevant core files are byte-for-byte identical)
  • Multiple PHP-FPM workers and/or multiple application pods sharing one database
  • A trigger that invalidates cache_container while web traffic is active (e.g., cron, deployment, drush cr from a separate process)

Reproduction sequence:

  1. Install Drupal 11.3.x with a contributed or custom module that implements hook_entity_insert or any other hook.
  2. Verify the hook works (e.g., save a node, confirm the hook fires).
  3. Invalidate cache_container (e.g., truncate the cache_container table, or trigger a cron job that forces container recompilation) while concurrent web requests are being served.
  4. After the rebuild, save a node again. The hook may no longer fire — it has been silently dropped from hook_data.

Diagnostic confirmation (run before clearing caches):

$hooks = \Drupal::keyValue('hook_data')->get('hook_list');
// Check for your module's hook — it will be absent from the array
// despite the module being installed and loaded:
$module_list = \Drupal::moduleHandler()->getModuleList();
// Module is PRESENT in getModuleList() but ABSENT from hook_list

Proposed resolution

Possible fix - Option A (minimal): Wrap the key_value writes in HookCollectorKeyValueWritePass::process() in a database transaction:

// In HookCollectorKeyValueWritePass::process()
$connection = \Drupal::database();
$transaction = $connection->startTransaction();
try {
  $this->keyValueFactory->get('hook_data')->setMultiple($hook_data);
  // Existing cache delete follows
}
catch (\Exception $e) {
  $transaction->rollBack();
  throw $e;
}

This ensures that concurrent readers see either the complete old data or the complete new data, never a partial write.

Possible fix - Option B (comprehensive): Add a lock around DrupalKernel::compileContainer() to prevent concurrent container compilations entirely:

// In DrupalKernel::compileContainer()
$lock = \Drupal::lock();
if (!$lock->acquire('container_rebuild', 30)) {
  // Another process is compiling; wait or use existing container
  $lock->wait('container_rebuild', 10);
  return;
}
try {
  // existing compilation logic
}
finally {
  $lock->release('container_rebuild');
}

Comments

cytherion created an issue. See original summary.

nicxvan’s picture

Version: 11.3.x-dev » main

Changes are applied in main first.

nod_’s picture

Per drupal.org rules LLM use must be disclosed in the issue queue: thank you

berdir’s picture

Has this actually been reproduced/seen in production or is this just based on (AI) code analysis?

The hook implementations are in a single key within hook data. What's possible is that hooks are there and the order information is not, or maybe the include info is inconsistent. But all these should be a single key and consistent on their own and since it's all based on the code, the different webheads should all build exactly the same thing.

Container build locking is being explored, but it's obviously not as simple as described here, as \Drupal::lock() is a service and wouldn't work when there is no container yet.

nicxvan’s picture

Status: Active » Postponed (maintainer needs more info)

Yes, can you please share a bit more of where your analysis came from?

The symptom sounds like it's actually #3207813: ModuleHandler skips all hook implementations when invoked before the module files have been loaded.

The analysis while plausible on is face has some pieces that are just not correct so I'm curious where it came from.

The solutions proposed also would need some discussion, I think it would be more complex.

Postponing for now for the questions posted.

catch’s picture

The hook implementations are in a single key within hook data.

Yes this makes the bug as described in the issue summary physically impossible.

It's true that hooks are written in a setMultiple(), it's true the setMultiple() doesn't have a transaction, it's not true that hooks are individually written in setMultiple(). There are two things written, one is the hooks, one is the ordering information, and there's no way for an individual hook to go missing due to the lack of transaction.

It would be possible to add a transaction in the database key/value setMultiple() implementation (not in the calling code, which is supposed to be implementation agnostic) or possibly to try to use upsert instead but it wouldn't fix missing hooks.

cytherion’s picture

Ok, let me try to address these questions

Regarding whether this was observed in production or derived from AI code analysis:

This bug was first observed in production on Acquia Cloud Next after updating from 11.2.2 to 11.3.3. The discovery involved an extended back-and-forth with Acquia support (ticket 01234922) as we worked to figure out why hook implementations were intermittently absent after container rebuilds. The Acquia engineers confirmed that no locking exists at the platform level for container compilation and that application-level locking is Drupal's responsibility.

To observe the failure in progress, we added a register_shutdown_function() in settings.php on the affected site that sampled key_value('hook_data'), ModuleHandler::getModuleList(), and cache.bootstrap on each request and injected the results into the application logs. This generated very large raw log files. The consistent signature we observed during active failures was exactly as described in the attached analysis (detailed_technical_analysis.md): the affected module was present in getModuleList() but absent from hook_list in key_value('hook_data'). No exception was thrown and nothing was logged by Drupal itself. The production timeline in the attached document was reconstructed from those logs combined with the Acquia infrastructure scheduler logs.

On the imprecision in the issue summary:
On re-reading the issue summary in light of catch's comment #6, I acknowledge it incorrectly described hooks as being written individually per hook. The more precise claim — which is what the attached analysis describes — is that the four keys written by setMultiple() (hook_list, includes, group_includes, packed_order_operations) are written non-atomically, and a concurrent getMultiple() read can return an inconsistent set across those four keys. That inconsistent result is then promoted to cache.bootstrap without any integrity check, poisoning it indefinitely.

While the issue referenced in #5 is similar on it's face. That is, same observable symptom. #3207813 seems to be an early bootstrap problem whereas our issue happens after the modules are loaded and hook discovery has completed. This is described in the attached technical analysis.

On AI use:
Yes, AI was used extensively in our workflow on this issue — in drafting and synthesising the write-up, in helping analyse the code paths, and in structuring the findings. The production evidence, the Acquia support engagement, and the diagnostic instrumentation are all ours. We had spent weeks accumulating a very large volume of logs, Acquia communications, and diagnostic output, and used AI to help distil that into something a maintainer who hadn't lived through it might find legible. The write mechanism was described at the wrong level of granularity in the summary — that's the cost of compressing weeks of investigation into a few paragraphs. The core observation is real (not AI hallucination) and the attached analysis describes the actual mechanism we observed more precisely.

I am happy to provide sanitised excerpts from the diagnostic logs or share additional detail from the Acquia support ticket if that would help move this forward. But the summary provided in the detailed_technical_analysis.md contains pretty much what we have... distilled into a readable form.

AI was used to help me draft this response.

cytherion’s picture

One additional note:

We were unable to reproduce this locally. Which seems consistent with the nature of the race condition — it requires concurrent container compilations across multiple processes. That's a standard characteristic of the Acquia Cloud Next Kubernetes environment (multiple pods, multiple FPM workers) as confirmed by Acquia, but does not occur in a single-process local development setup. This is also why the steps to reproduce in the issue summary are difficult to follow outside of a cloud hosting environment — the conditions that trigger the race are essentially built into production infrastructure and cannot be easily simulated locally.

I understand that this complicates reproducing the issue. We were only able to fix our sites by downgrading back to 11.2.8, but this is only a temporary solution as we will need to upgrade as some point soon. I was hoping you all--having far more experience with Drupal core than I--would have some suggestions.

AI was used in drafting this comment

berdir’s picture

I don't doubt you're seeing real issues, but the .md file also has sentences like "The full write sequence for a typical site with 20+ hooks involves 20+ individual SQL statements executed sequentially.", which is objectively false. There are always exactly 4 keys on D11 and 2 on D12. And it says further down "the last modules written are the most likely to be missing", which also makes no sense. It's a single key, and the structure is per hook not per module.

> hook_list, includes, group_includes, packed_order_operations) are written non-atomically, and a concurrent getMultiple() read can return an
inconsistent set across those four keys

I get that in theory, but inconsistent how? All data is derived from the data in the files. All concurrent requests should come to exactly the same conclusion and write the same data. There's a stampede of container rebuilds and room for improvement, but I'm assuming that a single record written can not be partially there.

The explained behavior mentions specific hooks missing, which doesn't match inconsistencies between those 4 keys. At best that could mean that a hook is present in an include file, but the include info is not there, so it fails to execute the hook. But it knows about it. But that doesn't seem to be what you are seeing.

berdir’s picture

What I'd try is adding logging to HookCollectorKeyValueWritePass. Just before writing the data, check if you are missing an expected hook (is it always the same or does that vary?), if you detect a a problem, log the context, backtrace and so on and see if you can spot anything unusual. Maybe even log the file content/hash of that file at that point. Could be a weird infrastructure issue where code doesn't exist in some cases or something?

Also, according to your description, you downgraded and did not verify whether adding the transaction wrapper would resolve this, so that's actually just an assumption then?

cytherion’s picture

#10 Yes, the transaction wrapper was actually going to be the next thing to try (sorry, I should have mentioned that) but, to make a long story short, I was out of time and needed a solution. I'm going to circle back to this issue soon because we are going to have to update (still running 11.2.8) and I will try your suggestion (adding logging to HookCollectorKeyValueWritePass).

#9 regarding "The full write sequence for a typical site with 20+ hooks involves 20+ individual SQL statements executed sequentially."

Let me explain (made sense to me when it wrote it) but what is happening is that in order to get the problem to manifest on the hosting platform we would need to get the hook to fire ~20+ times by triggering as many saves (some concurrent). Each save has multiple SQL statements associated with it.

#9 regarding, "the last modules written are the most likely to be missing"

This is what we were seeing. the last two modules (the ones where are critical to our application) were missing when the corruption happens. Truncating the cache container table restores function and the modules return. FYI: Our application is a drupal multisite in which one of the sites is the content publishing hub and ReSTfully publishes content to the other 20+ sites. the custom modules that were missing are the ones responsible for that functionality.

#9 "I'm assuming that a single record written can not be partially there..." Ok, I'll consult the longer analysis document and the logs and post the relevant information... tomorrow (it's well after COB here).

Thanks for your responses.

nicxvan’s picture

Thank you for disclosing the use of AI.

There is another issue about modules not being picked up in certain situations for multisites, that doesn't quite sound like it's the actual cause either.

Here are some other questions for diagnosis.
Are you tracking logs for each site, or just some of the multisites?
Are the modules with the issue in sites/all or in individual site folders or somewhere else?
Hook logs as you mentioned are very, very verbose, are you sure the sites missing hooks are the ones you are logging from that show the modules loaded, is this part of your analysis something you're relying on AI for or did you manually review the output and confirm?
You say the hooks not running are last, are they in .module files, or .inc files?
Are they ordered last due to alphabetical ordering, module weight or are you using hook_module_implements_alter or order parameters to reorder?

What hooks are they?
Are they core hooks, contrib, or custom, are they theme layer or something else?

keyvalue and bootstrap are persistent, can you provide a sanitized dump of each cache entry from one of the poisoned requests?

berdir’s picture

Are the affected modules only installed on some domains? Then that suggests a conflict related to that somehow

catch’s picture

As above I don't see how it can be the cause of the bug you're experiencing, but I've opened #3588385: Add a transaction around the database key value store setMultiple() and it should be straightforward to get an MR up there, once there is it would be good if you could apply that on production to rule it in or out.

Are the hooks that are going missing procedural or OOP hooks?

Also if they're procedural, are the hooks that are going missing in include files / in hook_hook_info().

Also a follow-up question on #7:

The more precise claim — which is what the attached analysis describes — is that the four keys written by setMultiple() (hook_list, includes, group_includes, packed_order_operations) are written non-atomically, and a concurrent getMultiple() read can return an inconsistent set across those four keys.

This is different from what you've described elsewhere in the issue and detailed report .md file, which says more like this:

the affected module was present in getModuleList() but absent from hook_list in key_value('hook_data').

```php
// Check if a module's hooks are missing from hook_data
$hooks = \Drupal::keyValue('hook_data')->get('hook_list');
foreach (['entity_insert', 'entity_update'] as $hook) {
$found = array_filter($hooks[$hook] ?? [], fn($m) => $m === 'your_module');
echo $hook . ': ' . (empty($found) ? 'MISSING' : 'PRESENT') . PHP_EOL;
}

// Confirm the module is still installed and loaded
$list = \Drupal::moduleHandler()->getModuleList();
echo isset($list['your_module']) ? 'Module PRESENT in getModuleList()' : 'Module MISSING';

// Check bootstrap cache age (how long ago the poisoned data was cached)
$c = \Drupal::cache('bootstrap')->get('hook_data');
echo $c ? 'hook_data cached at: ' . date('Y-m-d H:i:s', (int)$c->created) : 'NOT IN CACHE';
```

The signature of this bug: module is **PRESENT** in `getModuleList()` but **ABSENT** from `hook_list` in `key_value('hook_data')`. The module is installed and loaded, but its hooks are not registered — they were lost during a non-atomic write.

What neither of these describes is the module being present in one or more of the four keys in hook_data and absent from one or more of the others.

hook_list, includes, group_includes, packed_order_operations

If the database key value (lack of) transaction is the issue, what I'd expect to see is that some of these entries are set, and some of these entries are entirely missing - e.g. not in the cache object at all.

If it is in fact the case that those four keys in hook_data are inconsistent with each other, either some are empty or somehow have different data, then a dump of the cache item would probably help.

catch’s picture

Given you mentioned multisite, I wonder whether the issue might be #2985199: Extensions in Multisite Directories Not Registered When Rebuilding Cache

cytherion’s picture

StatusFileSize
new12.2 KB

Thank you all for taking the time. I’ll address the questions from #12, #13, and #14 with specifics and raw log evidence. I'm doing my best not to bombard you with too much unnecessary detail. Please let me know if you need more.

The attached document (diagnostic-implementation.md) covers how/where the diagnostic code was implemented and includes log entries of its output. It is included for reference if desired.

Answers to Nicxvan’s questions (#12)

Are you tracking logs for each site, or just some of the multisites?

Only arhuserv.umd.edu (our content publishing hub). it is one site in approximately a 23+ multisite setup.

The diagnostic code is in arhuserv’s settings.php (each multisite site has its own settings.php-- Drupal’s sites/{domain}/settings.php structure).

There is no sites/all folder modules common to all sites are in docroot/modules/custom.

Only arhuserv has the custom modules being discussed(arhuserv_content_push and arhuserv_hook_post_action) and the are in sites/arhuserv.umd.edu/modules; the other 23+ sites don’t have the modules in question.

Are the modules with the issue in sites/all or in individual site folders or somewhere else?

The are in the individual site's folder.

Are you sure the sites missing hooks are the ones you are logging from?

Yes — the diagnostic code logs the site URI in every entry, and the corruption detection checks hook_list for the specific implementations belonging to arhuserv_content_push and arhuserv_hook_post_action. Every corruption entry is from https://arhuserv.umd.edu requests.

Forexample, here is a sample log entry from the corruption detection code (from production, during an active failure):

Apr  6 17:55:12 drupal-75675c8-7rpr9 umdarhuweb: https://arhuserv.umd.edu|1775498112|arhuserv_5619|129.2.x.x|https://arhuserv.umd.edu/batch?_format=json&id=410571&op=do|https://arhuserv.umd.edu/batch?id=410571&op=start|5||HOOK_DATA CORRUPTION | Instance: drupal-75675c8-7rpr9 | Missing: entity_insert, entity_update, entity_delete | URI: /batch?id=410571&op=do_nojs&op=do&_format=json | Method: POST | Container age: -8 s | hash: 6d39306d | entity_insert(5): admin_toolbar_tools, comment, editor, focal_point, pathauto | entity_update(4): admin_toolbar_tools, editor, focal_point, pathauto | entity_delete(5): admin_toolbar_tools, editor, form_mode_control, pathauto, smart_date_recur request_id="v-ba9fefbe-31e1-11f1-8429-5f536dced8cd"

Are the hooks in .module files or .inc files?

.module files. No .inc files, no hook_hook_info().

  • arhuserv_content_push.module contains: arhuserv_content_push_entity_insert(), arhuserv_content_push_entity_update(), arhuserv_content_push_entity_delete(), arhuserv_content_push_media_presave()
  • arhuserv_hook_post_action.module contains: arhuserv_hook_post_action_entity_insert(), arhuserv_hook_post_action_entity_update(), arhuserv_hook_post_action_entity_delete()

Are they ordered last due to alphabetical ordering, module weight, or hook_module_implements_alter?

Module weight. arhuserv_content_push has weight 100 (set via the modules_weight contrib module, which writes to core.extension). arhuserv_hook_post_action has default weight 0. Neither ofour modules uses hook_module_implements_alter or #[Hook] order parameters.

What hooks are they? Core/contrib/custom?

arhuserv_content_push implements core hooks: entity_insert, entity_update, entity_delete, media_presave. arhuserv_hook_post_action implements custom hooks. All procedural implementations in .module files.

Sanitized dump of key_value and cache.bootstrap from a poisoned request:

I don’t have a full serialized dump of the hook_data cache object — the diagnostic code logged the contents of hook_list rather than the raw serialized blob. Here is what the diagnostic captured from key_value('hook_data')->get('hook_list') during an active failure (drush session on production, Mar 30, before any cache clear):

# Poisoned state — drush eval on production during active failure (Mar 30 2026)
# key_value('hook_data')->get('hook_list')['entity_insert']:

Array
(
    [pathauto_entity_insert] => pathauto
    [admin_toolbar_tools_entity_insert] => admin_toolbar_tools
    [Drupal\comment\Hook\CommentHooks::entityInsert] => comment
    [Drupal\editor\Hook\EditorHooks::entityInsert] => editor
    [focal_point_entity_insert] => focal_point
)

# key_value('hook_data')->get('hook_list')['entity_update']:

Array
(
    [pathauto_entity_update] => pathauto
    [admin_toolbar_tools_entity_update] => admin_toolbar_tools
    [Drupal\editor\Hook\EditorHooks::entityUpdate] => editor
    [focal_point_entity_update] => focal_point
)

# ModuleHandler::getModuleList() at the same time:
arhuserv_content_push => PRESENT
arhuserv_hook_post_action => PRESENT

# cache.bootstrap('hook_data'):
CACHED (created: 2026-03-30T12:23:25-04:00)

This was captured via drush @prod.arhuserv eval while the system was actively failing.

Note:

  • arhuserv_content_push_entity_insert, arhuserv_content_push_entity_update, arhuserv_content_push_entity_delete — all absent
  • arhuserv_hook_post_action_entity_insert, arhuserv_hook_post_action_entity_update, arhuserv_hook_post_action_entity_delete — all absent
  • Both modules PRESENT in getModuleList()

The hook_list key itself exists and contains data — it’s not empty or missing. It just doesn’t include our two modules’ implementations.

To Berdir’s question (#13)

Are the affected modules only installed on some domains?

Yes. arhuserv_content_push and arhuserv_hook_post_action are installed ONLY on arhuserv.umd.edu — none of the other 23 sites use them. The codebase is shared across all sites on the same Kubernetes pods, but the module list (from core.extension) differs per site (each site has its own database).

Regarding your earlier suggestion about adding diagnostic code to HookCollectorKeyValueWritePass

After reviewing the documentation from the incident, I already have a patch for this (patches/aais-1005-hook-write-pass-instrumentation.patch) — it logs whether arhuserv_content_push is present in hook_list immediately BEFORE the write to key_value, and verifies the read-back immediately AFTER. It was tested locally but never deployed to production (ran out of time and the downgrade to 11.2.8 happened). I included a description of. the patch int the attached document (toward the end) diagnostic-implementation.md. I will deploy it when upgrading back to 11.3.x.

To Catch’s questions (#14)

Are the hooks procedural or OOP?

Procedural. Standard e.g. function arhuserv_content_push_entity_insert(EntityInterface $entity) in the .module file.

We did test an OOP conversion (#[Hook('entity_insert')] attribute on the content push module) on staging as part of this investigation. The problem still occurred with OOP hooks. That branch was not merged.

Are they in include files / hook_hook_info()?

No. No include files, no hook_hook_info(). All hooks are directly in the .module file.

Regarding the inconsistency between my descriptions:

Let me be precise about what we actually observed:

The module is absent from hook_list — the single key. Not an inconsistency between the four keys (that was speculative and should have been noted as such). Our diagnostic code read key_value('hook_data')->get('hook_list') and checked for the module’s implementations. They were absent from that single structure.

I cannot tell you whether the other three keys (includes, group_includes, packed_order_operations) were also missing references to our modules, because the diagnostic code only checked hook_list. This is a gap in our diagnostics that I will address when I circle back to the Drupal update.

What the log evidence actually shows:
arhuserv_content_push and arhuserv_hook_post_action are our custom modules.

Healthy state (normal operation):

entity_insert(7): admin_toolbar_tools, arhuserv_content_push, arhuserv_hook_post_action, comment, editor, focal_point, pathauto
entity_update(6): admin_toolbar_tools, arhuserv_content_push, arhuserv_hook_post_action, editor, focal_point, pathauto
entity_delete(7): admin_toolbar_tools, arhuserv_content_push, arhuserv_hook_post_action, editor, form_mode_control, pathauto, smart_date_recur

Corrupted state (every single corruption event shows the same signature):

entity_insert(5): admin_toolbar_tools, comment, editor, focal_point, pathauto
entity_update(4): admin_toolbar_tools, editor, focal_point, pathauto
entity_delete(5): admin_toolbar_tools, editor, form_mode_control, pathauto, smart_date_recur

The difference is exactly two modules removed: arhuserv_content_push and arhuserv_hook_post_action. Always both. Never one without the other and we didn’t observe any other module affected.

Regarding #3588385 (transaction MR): I will test this when I upgrade back to 11.3.x.


Catch: I'm looking into #2985199 (#15) thanks for the heads up.


There was some use of AI in drafting this response.

nicxvan’s picture

Can you provide a list of installed modules please as an attachment?

HookCollectorPass uses the container parameter to get the modules $container->getParameter('container.modules') if that got out of sync that would be the only way I could see a module being available but not get scanned. Do any of your modules affect that parameter?

Are you using any hooks like system info alter or hook requirements to ensure those modules are installed or the weight is correct?

cytherion’s picture

StatusFileSize
new20.54 KB

To Nicxvan's questions (#17)

Can you provide a list of installed modules please as an attachment?

Yes, file is attached (3587958_Local-11.3.3-installed-modules.txt)

...Do any of your modules affect that parameter?

No. None of the custom modules contain:

--- A ServiceProvider class (no *ServiceProvider.php files exist)
--- A CompilerPass implementation
--- Any use of ServiceModifierInterface
--- Any call to setParameter() or reference to container.modules
--- Any ContainerBuilder usage

Are any of your custom modules using hook_system_info_alter, hook_requirements, or weight-setting hooks?

No, the two custom modules in question (arhuserv_content_push, and arhuserv_hook_post_action) do not use hook_requirements.

However, when considering ALL of the custom modules installed on the site, yes there are two:

scheduler — (scheduler.install) reports server time info at runtime (status report only, no module installation checks). This is a fork of the scheduler contrib module.

media_entity_flickr — (media_entity_flickr.install) checks that the icon directory is writable at install time. This is a fork of the abandoned contrib module of the same name.

There was some use of AI in researching this response.

catch’s picture

A way to rule in or out #2985199: Extensions in Multisite Directories Not Registered When Rebuilding Cache would be to move the modules in the site-specific directory into modules/custom so that they're discovered on every site - it means they'll show up on the extensions page but otherwise do nothing since they won't be installed. Might need to manually tweak $settings['deployment_identifier'] so that the module move gets picked up immediately.