Closed (fixed)
Project:
Drupal core
Version:
10.3.x-dev
Component:
phpunit
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Issue tags:
Reporter:
Created:
28 Sep 2024 at 17:48 UTC
Updated:
12 Feb 2025 at 21:09 UTC
Jump to comment: Most recent
Comments
Comment #3
spokjeSo MR!9676 (https://git.drupalcode.org/issue/drupal-3477586/-/jobs/2891429) shows that this test fails ~26/1500 runs.
Comment #4
spokjeAt first glance, I can think of two ways the placeholder is not replaced:
$block_contentin\Drupal::state()goes wrong/causes a race condition\Drupal\layout_builder\EventSubscriber\BlockComponentRenderArray::onBuildRenderdoes something unexpected and we end up in theif ($is_content_empty && $is_placeholder_ready)where we set the placeholder again, with similar content as the original one.Let's assert both.
Comment #5
spokjeSo looking at all the test failures being:
I think we can safely say it's #2.
Comment #6
spokje$content = $block->build()in\Drupal\layout_builder\EventSubscriber\BlockComponentRenderArray::onBuildRenderends up in\Drupal\layout_builder\Plugin\Block\ExtraFieldBlock::build.The only way I can see that returning empty would be if we land in this if-branch:
if (!isset($extra_fields['display'][$this->fieldName])).Which in turn would mean
\Drupal\Core\Entity\EntityFieldManager::getExtraFieldswould return an array without$extra_fields['display'][$this->fieldName], which would mean in\Drupal\Core\Entity\EntityFieldManager::getExtraFields,$this->extraFields[$entity_type_id][$bundle]would not be set.Comment #7
spokjeThis is where it ends for me, I really have no idea what's happening to cause the random failure.
Comment #8
spokjeOk, so there seems to be, what looks like, a delay for setting a value in State.
A lot of the currently randomly failing tests are using State to trigger some kind of behaviour change in the middle of a test.
Which, if the above statement is true, would make them also randomly fail, which seems to happen.
Time for some nitty-gritty proof why I think this is this case, dug up from the killing fields of test runs I had on this particular issue:
The testruns were all conducted on an isolated
\Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest::testBlockPlaceholderwhich was stripped down to the bare minimum.Testruns consisted of running
\Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest::testBlockPlaceholderand only that test at least 250x, most of the times 1500x.- Suspected
$this->getSession()->reload();not to work correctly, so if the random failure happened, tried to reload again: https://git.drupalcode.org/project/drupal/-/commit/71cbbeb1a96caf62d23ed.... Random still happened: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3807484- Replaced
$this->getSession()->reload();with adrupalGeton the current page: https://git.drupalcode.org/project/drupal/-/commit/a33805b31010aade71155.... Random still happened: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3807601- If random happened, retry
: https://git.drupalcode.org/project/drupal/-/commit/3b2b0bc960f62dcccc4c9.... That worked: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3817118 1500x without failure.
Since I've ruled out
$this->getSession()->reload();in the first two attempts mentioned here, the problem seems to lay in the State somewhere.- Tried
assertEqualson the state inLayoutBuilderBlocksTest: https://git.drupalcode.org/project/drupal/-/commit/5575da079ce25f4e95546.... Random still happened, but never on theassertEquals: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3817198.So the state is set correct in the test itself.
- Tried
assertEqualson the state inBlockComponentRenderArray.php: https://git.drupalcode.org/project/drupal/-/commit/889df588a31c494f88c51.... Randoms were gone.My theory here is that the (very) small delay caused by the added lines in
BlockComponentRenderArray.php, give the State enough time to set the value fully.At this point I'm in need of Big Brains that can explain why there could be a delay in setting the State, or just straight up tell me the above conclusions are rubbish.
I've got the feeling that somehow the use of the
CacheCollectorinState, introduced in #2575105: Use cache collector for state has something to do with it, but as said, that's just a feeling.Comment #9
spokjeComment #10
godotislateLooking through comments on #2575105: Use cache collector for state, I think one thing to explore would be to replace all the uses of state with keyvalue instead, when testing block content, such as all the lines that look like
\Drupal::state()->set('block_test.content', ...\Drupal::state()->get('block_test.content')and see if random failures for LayoutBuilderBlocksTest go away.
These lines could also be swapped for keyvalue:
\Drupal::state()->set('block_test.attributes', ...\Drupal::state()->get('block_test.attributes')Comment #11
godotislate#157 from #2575105: Use cache collector for state might be relevant here:
Comment #13
godotislatePushed up draft MR 10689 with just the state -> keyvalue changes mentioned in #10. Only can report that the the build did not fail.
Comment #14
spokjeThanks @godotislate.
There's a Slack thread about this here: https://drupal.slack.com/archives/C079NQPQUEN/p1735214972315009
Comment #18
spokjeMR !10692 (which is basically
\Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTesttrimmed down to only containtestBlockPlaceholder()and some pipeline changes to make it, and only it, run 2500 times) gives us 69 fails out of the 2500 runs.Comment #21
berdirWe *did* have random fails in that issue originally as well. But then it's been quiet for quite some time until they got a lot more frequent. Maybe in combination with the wait for terminate stuff and making something else more efficient?
I'm fairly certain that the actual key value write isn't delayed. We also immediately do a cache invalidation. The only thing that's delayed is updating the combined cache. And we do a lot of checks to prevent conflicts and incorrect caches there but apparently something seems to go wrong. It relies a lot on cache write time, but that's in microseconds, so that's very unlikely to happen at the same time.
One thing that however doesn't rely on microseconds and just seconds is cache expiration. \Drupal\Core\Cache\DatabaseBackend::invalidateAll() sets the expiration to request time - 1. I wonder if there's something weird going on there that cache reads in the test have a request time that's still ahead of that? What happens if we set that to -10 instead? Or maybe even just a fixed 1?
The thing is that CacheCollector is basically the _only_ thing in core that uses the invalidate API.
Comment #22
spokjeMR !10694 shows that changing
\Drupal::state()->set()/get()toDrupal::keyValue()->set()/get();as suggested in the Slack thread passes a 2500 times run without errors.Since this involves making changes in
core/modules/block/tests/modules/block_test/src/Plugin/Block/TestCacheBlock.php, this means more tests are affected.@godotislate was already miles ahead of me and created MR !10689, which is the MR to be committed.
Comment #25
spokje@berdir:
\Drupal\Core\Cache\DatabaseBackend::invalidateAll()set the expiration to request time - 10 => MR!1069553 failures out of 2500 runs.\Drupal\Core\Cache\DatabaseBackend::invalidateAll()set the expiration to 1 => MR!1069682 failures out of 2500 runs.So roughly the same as the unchanged 2500 run.
I'm going to set this to NR for #22.
Comment #30
godotislatePer discussion on Slack, we can go forward with the state to keyvalue swap for the relevant block tests here. There should be a follow up though to investigate further about what is the cause of the intermittent failures state failures, if only to establish that it's a test-only condition.
The other random test failure issues linked in the meta #2829040: [meta] Known intermittent, random, and environment-specific test failures should be looked at as well to see whether there are usages of state that can be replaced with key value instead.
Comment #33
dwwExcellent work here!
After checking out the issue fork branch, there are still a ton of references to
\Drupal::state()incore/modules/block/tests/src/*. Do we want to fix everything? I'm not sure why we're converting some tokeyValueand leaving others asstate.Comment #34
godotislatePer conversation with catch on Slack, the overall goal is to change everything in tests over to key value, except for tests specifically testing state, because writing to key value will be cheaper than writing to state.
The references changed in the MR were ones specific to
LayoutBuilderBlocksTestand everything that shared those references. I think it's just a question of how to scope changing over all the state usage across all the tests.Comment #35
catchI think I found the root cause for the random failures in #3496257: Race conditions in CacheCollector/State (again) (an actual race condition in
State) but we as above we should go ahead here too.Comment #36
spokjeComment #38
nicxvan commentedI have been following along most of these issues and read through / followed the slack conversations as well as the root cause.
I know one of the issues requested expanding scope to convert more but it was discussed that each issue should tackle one failing test since it takes over 1000 test runs to confirm.
I have confirmed that the only changes are going from state to keyValue.
Further, while this changes more than just one test it is because the layout test relies on block as well so those can affect it.
Looks good to me and it would be great to put one of the most common test failures to rest.
If we do still want to expand scope and convert more we can, but we will get less confirmation since we'll be limited to the number of test runs we can do.
Comment #39
catchThis looks good but I'm relying on the random test failure to prove that #3496257: Race conditions in CacheCollector/State (again) fixes it, so it would be useful to not commit this for 24-48 hours just to get an extra few test runs on that issue.
Comment #40
spokjeI can see the usefulness, but I also see numerous MRs failing on this (and others).
We _could_ commit this and in your test-MR reverse the changes made here, which leaves you with exactly the same failing MR as before?
Comment #47
catch#40 is a good point, and hopefully got enough data on the other issue now anyway.
Committed/pushed to 11.x and cherry-picked back through to 10.3.x, thanks!
Comment #50
quietone commentedFollowup asked for in #30 has been made, #3506148: [meta] Replace all uses of state in tests with direct key value