Problem/Motivation

    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest::testBlockPlaceholder
    Behat\Mink\Exception\ResponseTextException: The text "Placeholder for the
    "The block label" block" appears in the text of this page, but it should
    not.
I just ran into this and GitLab CI notes that this test "Failed 15 times in 11.x in the last 14 days" so this one probably needs an issue to investigate.

@longwave and @quietone in 2829040-202 and 203.

Steps to reproduce

Proposed resolution

Replace use of:

  • \Drupal::state()->set('block_test.content', ...
  • \Drupal::state()->get('block_test.content')
  • \Drupal::state()->set('block_test.attributes', ...
  • \Drupal::state()->get('block_test.attributes')

with

  • \Drupal::keyvalue('block_test')->set('content', ...
  • \Drupal::keyvalue('block_test')->get('content')
  • \Drupal::keyvalue('block_test')->set('attributes', ...
  • \Drupal::keyvalue('block_test')->get('attributes')

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Issue fork drupal-3477586

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

spokje created an issue. See original summary.

spokje’s picture

So MR!9676 (https://git.drupalcode.org/issue/drupal-3477586/-/jobs/2891429) shows that this test fails ~26/1500 runs.

spokje’s picture

At first glance, I can think of two ways the placeholder is not replaced:

  1. Setting the $block_content in \Drupal::state() goes wrong/causes a race condition
  2. The "magic" in \Drupal\layout_builder\EventSubscriber\BlockComponentRenderArray::onBuildRender does something unexpected and we end up in the if ($is_content_empty && $is_placeholder_ready) where we set the placeholder again, with similar content as the original one.

Let's assert both.

spokje’s picture

So looking at all the test failures being:

Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest::testBlockPlaceholder
    Behat\Mink\Exception\ResponseTextException: The text "is_content_empty &&
    is_placeholder_ready Placeholder for the "The block label" block" appears
    in the text of this page, but it should not.

I think we can safely say it's #2.

spokje’s picture

$content = $block->build() in \Drupal\layout_builder\EventSubscriber\BlockComponentRenderArray::onBuildRender ends up in \Drupal\layout_builder\Plugin\Block\ExtraFieldBlock::build.

The only way I can see that returning empty would be if we land in this if-branch:
if (!isset($extra_fields['display'][$this->fieldName])).

Which in turn would mean \Drupal\Core\Entity\EntityFieldManager::getExtraFields would return an array without $extra_fields['display'][$this->fieldName], which would mean in \Drupal\Core\Entity\EntityFieldManager::getExtraFields, $this->extraFields[$entity_type_id][$bundle] would not be set.

spokje’s picture

Assigned: spokje » Unassigned

This is where it ends for me, I really have no idea what's happening to cause the random failure.

spokje’s picture

Assigned: Unassigned » spokje
Status: Active » Needs work

Ok, so there seems to be, what looks like, a delay for setting a value in State.

A lot of the currently randomly failing tests are using State to trigger some kind of behaviour change in the middle of a test.
Which, if the above statement is true, would make them also randomly fail, which seems to happen.

Time for some nitty-gritty proof why I think this is this case, dug up from the killing fields of test runs I had on this particular issue:

The testruns were all conducted on an isolated \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest::testBlockPlaceholder which was stripped down to the bare minimum.
Testruns consisted of running \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest::testBlockPlaceholder and only that test at least 250x, most of the times 1500x.
- Suspected $this->getSession()->reload(); not to work correctly, so if the random failure happened, tried to reload again: https://git.drupalcode.org/project/drupal/-/commit/71cbbeb1a96caf62d23ed.... Random still happened: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3807484
- Replaced $this->getSession()->reload(); with a drupalGet on the current page: https://git.drupalcode.org/project/drupal/-/commit/a33805b31010aade71155.... Random still happened: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3807601
- If random happened, retry

\Drupal::state()->set('block_test.content', $block_content); $this->getSession()->reload();

: https://git.drupalcode.org/project/drupal/-/commit/3b2b0bc960f62dcccc4c9.... That worked: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3817118 1500x without failure.
Since I've ruled out $this->getSession()->reload(); in the first two attempts mentioned here, the problem seems to lay in the State somewhere.
- Tried assertEquals on the state in LayoutBuilderBlocksTest: https://git.drupalcode.org/project/drupal/-/commit/5575da079ce25f4e95546.... Random still happened, but never on the assertEquals: https://git.drupalcode.org/issue/drupal-3477586/-/jobs/3817198.
So the state is set correct in the test itself.
- Tried assertEquals on the state in BlockComponentRenderArray.php: https://git.drupalcode.org/project/drupal/-/commit/889df588a31c494f88c51.... Randoms were gone.

My theory here is that the (very) small delay caused by the added lines in BlockComponentRenderArray.php, give the State enough time to set the value fully.

At this point I'm in need of Big Brains that can explain why there could be a delay in setting the State, or just straight up tell me the above conclusions are rubbish.

I've got the feeling that somehow the use of the CacheCollector inState, introduced in #2575105: Use cache collector for state has something to do with it, but as said, that's just a feeling.

spokje’s picture

Assigned: spokje » Unassigned
godotislate’s picture

Looking through comments on #2575105: Use cache collector for state, I think one thing to explore would be to replace all the uses of state with keyvalue instead, when testing block content, such as all the lines that look like

  • \Drupal::state()->set('block_test.content', ...
  • \Drupal::state()->get('block_test.content')

and see if random failures for LayoutBuilderBlocksTest go away.

These lines could also be swapped for keyvalue:

  • \Drupal::state()->set('block_test.attributes', ...
  • \Drupal::state()->get('block_test.attributes')
godotislate’s picture

#157 from #2575105: Use cache collector for state might be relevant here:

however we're seeing other test failures in runs which aren't consistent. I think the issue is that any test using state to track.. well.. state between requests, now needs to either switch to WaitTerminateTestTrait or to use the key value service directly without the state wrapper - because the cache will be updated in terminate.

godotislate’s picture

Pushed up draft MR 10689 with just the state -> keyvalue changes mentioned in #10. Only can report that the the build did not fail.

spokje’s picture

Thanks @godotislate.

There's a Slack thread about this here: https://drupal.slack.com/archives/C079NQPQUEN/p1735214972315009

spokje changed the visibility of the branch 11.x to hidden.

spokje changed the visibility of the branch 3477586-random-test-failure to hidden.

spokje’s picture

MR !10692 (which is basically \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTest trimmed down to only contain testBlockPlaceholder() and some pipeline changes to make it, and only it, run 2500 times) gives us 69 fails out of the 2500 runs.

berdir’s picture

We *did* have random fails in that issue originally as well. But then it's been quiet for quite some time until they got a lot more frequent. Maybe in combination with the wait for terminate stuff and making something else more efficient?

I'm fairly certain that the actual key value write isn't delayed. We also immediately do a cache invalidation. The only thing that's delayed is updating the combined cache. And we do a lot of checks to prevent conflicts and incorrect caches there but apparently something seems to go wrong. It relies a lot on cache write time, but that's in microseconds, so that's very unlikely to happen at the same time.

One thing that however doesn't rely on microseconds and just seconds is cache expiration. \Drupal\Core\Cache\DatabaseBackend::invalidateAll() sets the expiration to request time - 1. I wonder if there's something weird going on there that cache reads in the test have a request time that's still ahead of that? What happens if we set that to -10 instead? Or maybe even just a fixed 1?

The thing is that CacheCollector is basically the _only_ thing in core that uses the invalidate API.

spokje’s picture

MR !10694 shows that changing \Drupal::state()->set()/get() to Drupal::keyValue()->set()/get(); as suggested in the Slack thread passes a 2500 times run without errors.

Since this involves making changes in core/modules/block/tests/modules/block_test/src/Plugin/Block/TestCacheBlock.php, this means more tests are affected.

@godotislate was already miles ahead of me and created MR !10689, which is the MR to be committed.

spokje’s picture

Status: Needs work » Needs review

@berdir:

\Drupal\Core\Cache\DatabaseBackend::invalidateAll() set the expiration to request time - 10 => MR!1069553 failures out of 2500 runs.

\Drupal\Core\Cache\DatabaseBackend::invalidateAll() set the expiration to 1 => MR!1069682 failures out of 2500 runs.

So roughly the same as the unchanged 2500 run.

I'm going to set this to NR for #22.

godotislate changed the visibility of the branch 3477586-2500x-LayoutBuilderBlocksTest-testBlockPlaceholder-replaced to hidden.

godotislate changed the visibility of the branch 3477586-2500x-LayoutBuilderBlocksTest-testBlockPlaceholder-as-is to hidden.

godotislate’s picture

Issue summary: View changes
Issue tags: +Needs followup

Per discussion on Slack, we can go forward with the state to keyvalue swap for the relevant block tests here. There should be a follow up though to investigate further about what is the cause of the intermittent failures state failures, if only to establish that it's a test-only condition.

The other random test failure issues linked in the meta #2829040: [meta] Known intermittent, random, and environment-specific test failures should be looked at as well to see whether there are usages of state that can be replaced with key value instead.

catch made their first commit to this issue’s fork.

dww’s picture

Issue tags: +Bug Smash Initiative

Excellent work here!

After checking out the issue fork branch, there are still a ton of references to \Drupal::state() in core/modules/block/tests/src/*. Do we want to fix everything? I'm not sure why we're converting some to keyValue and leaving others as state.

godotislate’s picture

Per conversation with catch on Slack, the overall goal is to change everything in tests over to key value, except for tests specifically testing state, because writing to key value will be cheaper than writing to state.

The references changed in the MR were ones specific to LayoutBuilderBlocksTest and everything that shared those references. I think it's just a question of how to scope changing over all the state usage across all the tests.

catch’s picture

I think I found the root cause for the random failures in #3496257: Race conditions in CacheCollector/State (again) (an actual race condition in State) but we as above we should go ahead here too.

spokje’s picture

spokje changed the visibility of the branch 3477586-2500x-LayoutBuilderBlocksTest-testBlockPlaceholder-invalidate-cache-minus-10 to hidden.

nicxvan’s picture

Status: Needs review » Reviewed & tested by the community

I have been following along most of these issues and read through / followed the slack conversations as well as the root cause.

I know one of the issues requested expanding scope to convert more but it was discussed that each issue should tackle one failing test since it takes over 1000 test runs to confirm.

I have confirmed that the only changes are going from state to keyValue.
Further, while this changes more than just one test it is because the layout test relies on block as well so those can affect it.

Looks good to me and it would be great to put one of the most common test failures to rest.

If we do still want to expand scope and convert more we can, but we will get less confirmation since we'll be limited to the number of test runs we can do.

catch’s picture

This looks good but I'm relying on the random test failure to prove that #3496257: Race conditions in CacheCollector/State (again) fixes it, so it would be useful to not commit this for 24-48 hours just to get an extra few test runs on that issue.

spokje’s picture

so it would be useful to not commit this for 24-48 hours just to get an extra few test runs on that issue.

I can see the usefulness, but I also see numerous MRs failing on this (and others).

We _could_ commit this and in your test-MR reverse the changes made here, which leaves you with exactly the same failing MR as before?

  • catch committed c3ba4cff on 10.4.x
    Issue #3477586 by spokje, catch, godotislate, berdir, dww, nicxvan: [...

  • catch committed 73a1a827 on 10.5.x
    Issue #3477586 by spokje, catch, godotislate, berdir, dww, nicxvan: [...

  • catch committed 967198ce on 11.0.x
    Issue #3477586 by spokje, catch, godotislate, berdir, dww, nicxvan: [...

  • catch committed 29972a4e on 11.1.x
    Issue #3477586 by spokje, catch, godotislate, berdir, dww, nicxvan: [...

  • catch committed 3df2a166 on 11.x
    Issue #3477586 by spokje, catch, godotislate, berdir, dww, nicxvan: [...

  • catch committed 80be5979 on 10.3.x
    Issue #3477586 by spokje, catch, godotislate, berdir, dww, nicxvan: [...
catch’s picture

Version: 11.x-dev » 10.3.x-dev
Status: Reviewed & tested by the community » Fixed

#40 is a good point, and hopefully got enough data on the other issue now anyway.

Committed/pushed to 11.x and cherry-picked back through to 10.3.x, thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

quietone’s picture