Problem/Motivation

Originally reported by @xjm on May 14, who pointed me to https://dispatcher.drupalci.org/job/drupal8_core_regression_tests/54877/... and asked to investigate.

  1. 9.5.x-dev test with PHP 7.3 & MariaDB 10.3.22 having "build successful" builds, which if you open them say "no failures" (see https://dispatcher.drupalci.org/job/drupal8_core_regression_tests/54779/), but if you look at the console output (https://dispatcher.drupalci.org/job/drupal8_core_regression_tests/54779/... and https://dispatcher.drupalci.org/job/drupal8_core_regression_tests/54873/... for example), you'll see that *some* tests are timing out.
  2. Note that as 9.5.x-dev test with PHP 7.3 & MariaDB 10.3.22 shows, there *have* been successful builds since this ticket was created: on May 16 and 18 the builds worked just fine.
  3. Also note that https://www.drupal.org/pift-ci-job/2386056 has only existed since April 30. This suggests that this is a new DrupalCI image.

Steps to reproduce

N/A — only happens on DrupalCI, impossible to reproduce locally.

Proposed resolution

TBD

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

Release notes snippet

TBD

CommentFileSizeAuthor
#40 3281863-alt-40.patch1.58 KBalexpott
#36 3281863-alt-36.patch4.6 KBalexpott
#34 3281863-alt-33.patch3.98 KBalexpott
#30 3281863-alt-30.patch1012 bytesalexpott
#29 3281863-alt-29.patch3.23 KBalexpott
#24 3281863-alt-22.patch3.23 KBalexpott
#21 3281863-21.patch5.31 KBalexpott
#20 3281863-20.patch4.63 KBalexpott
#19 3281863-18.patch3.33 KBalexpott
#15 3281863-15.patch3.31 KBalexpott
#14 3281863-14.patch3.31 KBalexpott
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Wim Leers created an issue. See original summary.

Wim Leers’s picture

Issue summary: View changes

My analysis:

  1. Because these are new testing images, I suspect the problem lies with either their configuration or the minimum hardware requirements: either performance is too slow, or too much load (resource contention?) exists on these machines, which could explain why tests sometimes pass.
  2. Errors like ReferenceError: jQuery is not defined seem to confirm this: that is indicate of a static asset having failed to load.
  3. In some builds times, that error is completely absent.
  4. A strong confirmation of that is the fact that I can locally (2018 MacBook Pro) execute the test suite in the same amount of time on both PHP 7.3 and PHP 8.0: yarn test:nightwatch modules/toolbar/tests/src/Nightwatch/Tests/toolbarTest.js (~92 seconds) and yarn test:nightwatch modules/toolbar/tests/src/Nightwatch/Tests/toolbarApiTest.js (~85 seconds).
    Note that if I start doing other things while the test is running, e.g. browsing drupal.org, then I do see these tests failing occasionally. This confirms that Nightwatch tests are extremely susceptible to resource contention.

The only possible conclusion I can reach is: tests are failing only because they're running in a resource-constrained environment. Otherwise they should consistently fail, with explicit errors.

Wim Leers credited xjm.

Wim Leers’s picture

Wim Leers credited nod_.

Wim Leers’s picture

Discussed in detail with @nod_.

He had a hunch … which turned out to be helpful! :D 

 

https://dispatcher.drupalci.org/job/drupal8_core_regression_tests/54779/...

→ 

00:48:02.370 ✔ Expected element <.user-role-form .machine-name-value> to be visible in 2000ms (551ms)
00:48:03.839    NoSuchElementError: An error occurred while running .click() command on <input[name="5uxusq8bknx[access toolbar]"]>: 
00:48:09.753    {"sessionId":"39e5e5e812e16a510a7cbc8e8217b96b","status":0,"value":[]}
00:48:09.753        at /var/www/html/core/tests/Drupal/Nightwatch/Commands/drupalCreateRole.js:44:24
00:48:09.753        at new Promise (<anonymous>)
00:48:09.753        at map (/var/www/html/core/tests/Drupal/Nightwatch/Commands/drupalCreateRole.js:43:15)
00:48:09.753        at Array.map (<anonymous>)
00:48:09.753        at Object.<anonymous> (/var/www/html/core/tests/Drupal/Nightwatch/Commands/drupalCreateRole.js:41:23)
00:48:09.753        at processTicksAndRejections (node:internal/process/task_queues:78:11)
00:48:09.753 ✔ User "user" was created successfully (62ms)
00:48:12.307 ✔ Passed [equal]: The user "user" was logged in.
00:48:12.929 ✖ Timed out while waiting for element <#toolbar-administration> to be present for 10000 milliseconds. - expected "found" but got: "not found" (10332ms)
00:48:23.373     at Object.before (/var/www/html/core/modules/toolbar/tests/src/Nightwatch/Tests/toolbarTest.js:39:8)
00:48:23.373     at new Promise (<anonymous>)

You can see that it failed to set the access toolbar permission and then silently continued anyway, then crashes after 10 seconds of waiting for a toolbar to appear … but the user does not have access to it, so that makes total sense.

 

https://dispatcher.drupalci.org/job/drupal8_core_regression_tests/54877/...

→ 

00:34:45.804 [Tests/Toolbar Api Test] Test Suite
00:34:45.804 ===================================
00:34:45.804 - Connecting to chromedriver-jenkins-drupal8-core-regression-tests-54877 on port 9515...
00:34:45.824 
00:34:45.824   Using: chrome (98.0.4758.102) on Linux platform.
00:34:46.013 
00:34:46.013 ℹ Connected to chromedriver-jenkins-drupal8-core-regression-tests-54877 on port 9515 (208ms).
00:34:46.013 ✔ Element <[name="modules[breakpoint][enable]"]> was visible after 37 milliseconds.
00:34:49.795 ✔ Element <.system-modules> was visible after 67 milliseconds.
00:34:50.129 ✔ Element <[name="modules[toolbar][enable]"]> was visible after 47 milliseconds.
00:34:51.683 ✖ Timed out while waiting for element <.system-modules> to be present for 10000 milliseconds. - expected "visible" but got: "not found" (10300ms)
00:35:02.491     at Object.call (/var/www/html/core/tests/Drupal/Nightwatch/Commands/drupalInstallModule.js:21:8)
00:35:02.491     at Object.drupalLoginAsAdmin (/var/www/html/core/tests/Drupal/Nightwatch/Commands/drupalLoginAsAdmin.js:35:14)
00:35:02.491     at processImmediate (node:internal/timers:466:21) 
00:35:02.491 
00:35:02.491 - Running Drupal.Toolbar.models:
00:35:02.493 
00:35:02.493  Error while running .executeScript() protocol action: An error occurred while executing user supplied JavaScript. – javascript error: Cannot destructure property 'models' of ...
00:35:02.776 
00:35:02.776 ✖ Failed [deepEqual]: (Keys to check match) - expected "error,message" but got: "hasMenuModel,hasToolbarModel,menuModelType,toolbarModelActiveTab,toolbarModelActiveTray,toolbarModelAreSubtreesLoaded,toolbarModelHeight,toolbarModelIsFixed,toolbarModelIsOriented,toolbarModelIsTrayToggleVisible,toolbarModelIsViewportOverflowConstrained,toolbarModelLocked,toolbarModelOffsetsBottom,toolbarModelOffsetsLeft,toolbarModelOffsetsRight,toolbarModelOffsetsTop,toolbarModelOrientation,toolbarModelSubtrees,toolbarModelType" (12ms)
00:35:02.792     at NightwatchAPI.<anonymous> (/var/www/html/core/modules/toolbar/tests/src/Nightwatch/Tests/toolbarApiTest.js:117:24)
00:35:02.792     at processTicksAndRejections (node:internal/process/task_queues:96:5) 

You can see that the checkbox to enable the toolbar checkbox never appeared, so it failed to check it to install the Toolbar module, then it silently continues anyway, then eventually crashes because of course executing Toolbar JS is not going to work if Toolbar JS is not loaded because the module was never installed.

Conclusion: machine performance is too low and/or machine load is too high. That in turn causes very fundamental things (installing the Toolbar module) to fail, which then of course makes all Toolbar tests fail.

nod_’s picture

as a followup we should also somehow kill the test if the setup fails instead of going ahead.

Mixologic’s picture

So, a few things from infra perspective -> containers / images shouldnt really have any bearing on underlying hardware/resource contention, *unless* the container itself is somehow hamstrung from accessing the full resources. When the nightwatch tests are running, there is nothing else at all happening on the machine, and thus no contention at all. Nightwatch tests run serially if I remember correctly, so its not an issue of parallelism either.

In this case it could be that the mariadb container isnt properly configured to take advantage of the massive amounts of memory/disk/cpu available. The configurations for those containers were just what I received from a community member, so dont know if there is more performance to be gained there, or if there is some other issue.

The image itself is not new, it was just recently added as another daily test. The image was added in mar, 2020.

So, Im not sure how we can test if mariadb is underperforming/underconfigured.

hestenet’s picture

For reference, the tests are running on a C4 8XL AWS instance (or one of a very comparable one in the pool) which has 32vCPUs/64gigs of RAM. And as @mixologic says, the nightwatch tests run serially, so they shouldn't be constrained by the instance itself.

Wim Leers’s picture

Thanks for the context, @mixologic and @hestenet!

The image dating back to >2 years ago and (I'm assuming) not having caused issues until now is a strong indication that the image itself is fine.

The machine sounds about 4 times more powerful than my measly laptop, and with no other work happening concurrently with the Nightwatch tests, my resource contention theory seems very unlikely.

Unfortunately … that leaves me in a place where I have no idea how to dig deeper 😔

hestenet’s picture

We think it was maybe @daffie who helped get the MariaDB container set up. Perhaps they would have some ideas here.

daffie’s picture

MariaDB is saying that they fixed some PHP PDO stuff in version 10.3.27 and we are using version 10.3.22. See: https://mariadb.com/kb/en/mariadb-10327-release-notes/
Maybe update MariaDB to a newer version and see if it fixes the problem.

alexpott’s picture

It's not just MariaDB though... see https://www.drupal.org/pift-ci-job/2388848 - same fail and MySQL.

It's something that has changed on 9.4.x that is not on 9.3.x because it does not seem to be happening at all there - you can see this if you look at the build history on https://www.drupal.org/node/3060/qa - compare https://www.drupal.org/pift-ci-job/2388394 and https://www.drupal.org/pift-ci-job/2388902 - these are both PHP 8.0 and MySQL 5.7 - so exactly the same build environment.

Wim Leers’s picture

Same failures on https://www.drupal.org/pift-ci-job/2388963, which is PHP 8.1 & MySQL 5.7 for 10.0.x

xjm’s picture

#15 looks promising, but there are two Toolbar tests with different fails in its result set:

We should consider just deploying that fix and then filing followups for the different fails.

Wim Leers’s picture

Title: Nightwatch tests failing >50% of test runs on PHP 7.3 in 9.4.x and 9.5.x HEAD » Nightwatch tests failing >50% of test runs on PHP 7.3 in 9.4.x and 9.5.x, as well as PHP 8.1 on 10.0.x
alexpott’s picture

Wim Leers credited huzooka.

Wim Leers’s picture

@huzooka just pointed something out:

00:02:36.768 ----------------   Starting yarn_install   ----------------
00:02:36.779 Directory created at /var/lib/drupalci/workspace/jenkins-drupal_patches-130111/ancillary/yarn_install
00:02:36.779 Executing yarn install for core nodejs dev dependencies.
00:02:36.779 sudo -u www-data yarn install --no-progress --non-interactive --cwd /var/www/html/core
00:02:37.162 yarn install v1.22.17
00:02:37.312 [1/5] Validating package.json...
00:02:37.321 [2/5] Resolving packages...
00:02:37.771 [3/5] Fetching packages...
00:02:48.472 [4/5] Linking dependencies...
00:02:53.079 [5/5] Building fresh packages...
00:02:53.880 Done in 16.72s.
00:02:53.949 Yarn install success
00:02:53.949 sudo -u www-data yarn list --no-progress --non-interactive --cwd /var/www/html/core > /var/lib/drupalci/workdir/yarn_install/yarn_list.txt
00:02:55.006 sudo -u www-data yarn --no-progress --non-interactive --cwd /var/www/html/core licenses list > /var/lib/drupalci/workdir/yarn_install/yarn_licenses.txt
00:02:56.439 ---------------- Finished yarn_install in 19.671 seconds ---------------- 

maybe it's the fact that we're doing a plain yarn install that is the problem?

See https://classic.yarnpkg.com/en/docs/cli/install — we should do yarn install --frozen-lockfile (the yarn equivalent of npm ci) if we want to ensure that the dependencies in the yarn.lock file are actually respected!

Related reading: https://newbedev.com/how-to-install-packages-based-on-the-lock-file-with.... Also, this is being proposed as the default behavior for yarn install: https://github.com/yarnpkg/yarn/issues/4147 — which I suspect all of us thought was already the default behavior…

Maybe that solves it? 🤞 If we don't do that here, we should have a follow-up for it for sure.

alexpott’s picture

@Wim Leers the problem is all in core/tests/Drupal/Nightwatch/Commands/drupalInstallModule.js - that's proved by #21 which does the module install a different way. But looking at this code. It's not correct. It's waiting for something that is already on the page when it presses the submit button. We can improve this considerably.

I think the attached patch will fix the problem.

alexpott’s picture

@Wim Leers yarn.lock is not currently updated by running a yarn install - so that's not the issue here. I'm not sure about adding --frozen-lockfile yarn respects the lock file in nearly all cases anyway.

huzooka’s picture

What about chromedriver?

alexpott’s picture

I'm pretty sure that the nightwatch tests on DrupalCI are not using the chromedriver in core/node_modules... fwiw. See the DrupalCI logs...

15:46:41 Starting chromedriver container ...
15:46:41 Adding image: drupalci/chromedriver:production
15:46:42 Pulling from drupalci/chromedriver
15:46:42 Digest: sha256:2294a42e164243ae06e617b9c14039613dc69320cc269067389c79c437440fbf
15:46:42 Status: Image is up to date for drupalci/chromedriver:production
15:46:42 
15:46:42 Container chromedriver-jenkins-drupal-patches-130303 created from image drupalci/chromedriver:production with ID 9bad60a6
15:48:40 - Connecting to chromedriver-jenkins-drupal-patches-130303 on port 9515...
15:48:40 
15:48:40 ℹ Connected to chromedriver-jenkins-drupal-patches-130303 on port 9515 (412ms).
15:48:40   Using: chrome (98.0.4758.102) on Linux platform.

This is the chromedriver being used by Nightwatch - the version brought down by yarn is ignored.

alexpott’s picture

+++ b/core/tests/Drupal/Nightwatch/Commands/drupalInstallModule.js
@@ -17,8 +17,12 @@ exports.command = function drupalInstallModule(module, callback) {
-      // Wait for the install message to show up.
-      .waitForElementVisible('.system-modules', 10000);
+      // Wait for the checkbox for the module to be disabled as a sign that the
+      // module has been enabled.
+      .waitForElementVisible(
+        `[name="modules[${module}][enable]"]:disabled`,
+        10000,
+      );

The errors are happening because the .waitForElementVisible('.system-modules', 10000); does not work. On really fast computers that assert is able to complete before the browser has refreshed after clicking on the submit. So we have the opposite of @Wim Leers performance suspicions. If we wait for something that can only be true after the module install has completed that this will work. The reason for the different errors is that sometimes the breakpoint install was being interrupted and sometimes the toolbar was so we'd see slightly different errors after this has occurred. Fun stuff. This will also resolve probable random errors in core/tests/Drupal/Nightwatch/Tests/touchDetectionTest.js as that uses drupalInstallModule() too.

alexpott’s picture

And here's just the fix on itself without the changes to drupalci.yml.

Wim Leers’s picture

So we have the opposite of @Wim Leers performance suspicions.

🤣 🥳 👏👏👏👏

nod_’s picture

ooh I see, very nice find, thanks :)

alexpott’s picture

#29 shows that we still have the problem with #30. The more specific assert narrows it down to drupalInstallModule() in both https://www.drupal.org/pift-ci-job/2389713 and https://www.drupal.org/pift-ci-job/2389714

So options are to do #21 and use a PHP site setup script to do the module enabling and we should deprecate drupalInstallModule() because it is broken. This was introduced in #3255809: Add nightwatch tests for toolbar - so we still have the question as to why 9.3.x does not seem to be as affected.

alexpott’s picture

Let's do two things before we fallback to PHP. Let's make all the data input and clicking and waiting specific to the form.system-modules. I'm pretty sure that is not an issue here but it help make drupalInstallModules more robust. And let's do a little pause after pressing submit. I don;t think there is much harm in that here and if it results in it being fixed that feels acceptable.

nod_’s picture

+++ b/core/tests/Drupal/Nightwatch/Commands/drupalInstallModule.js
@@ -13,12 +13,25 @@ exports.command = function drupalInstallModule(module, callback) {
+      .click('form.system-modules input[data-drupal-selector="edit-submit"]')

don't remember why I didn't use submit form here: https://nightwatchjs.org/api/submitForm.html

alexpott’s picture

alexpott’s picture

Okay submitForm() got a sea of green - yay... triggering the tests again. If it is another sea of green I think we should proceed here.

alexpott’s picture

So 9.3.x is not failing because for some reason core/modules/toolbar/tests/src/Nightwatch/Tests/toolbarApiTest.js does not use drupalInstallModule on that branch. In fact drupalInstallModule does not exist on 9.3.x so this is all down to how #3255809: Add nightwatch tests for toolbar was committed.

This is incorrect. Still can't explain 9.3.x vs 9.4.x

alexpott’s picture

So one difference between 9.4.x and 9.3.x is that #3272727: Nightwatch's drupalModuleInstall() doesn't handle test modules or modules w underscores in machine name - but that was 7 weeks ago and I don't think this has been happening for 7 weeks.

alexpott’s picture

Anyways #36 is a sea of green after 2 goes so the problem is fixed - yay!

+++ b/core/tests/Drupal/Nightwatch/Commands/drupalInstallModule.js
@@ -13,12 +13,35 @@ exports.command = function drupalInstallModule(module, callback) {
+      .click(`form.system-modules [name="modules[${module}][enable]"]`)
+      .submitForm('form.system-modules')
+      // Ensure module enabling has a chance to start before asserting that it
+      // is finished.
+      // .pause(50)
+      // Wait for the checkbox for the module to be disabled as a sign that the
+      // module has been enabled.
+      .waitForElementPresent(
+        `form.system-modules [name="modules[${module}][enable]"]:disabled`,

So the fix seems to be changing to submitForm() and making the wait something actually specific that has changed due to enabling a module.

Here's a patch the has just the necessary stuff to fix this.

Wim Leers’s picture

Status: Needs review » Reviewed & tested by the community

I'm in awe of this epic showcase of "oh but if you really know what you're doing, it is possible to debug Nightwatch tests on DrupalCI" that we got served here. 🤯

Next time I need to do something like this I'm gonna google "nightwatch drupal module install alexpott" 😁

I kind of want to uncredit everyone except @alexpott on this one because he's basically come in and saved the day release.

nod_’s picture

Rtbc+1

  • catch committed d69212e on 10.0.x
    Issue #3281863 by alexpott, Wim Leers, nod_, hestenet, xjm, huzooka,...
  • catch committed 9b82c5c on 9.4.x
    Issue #3281863 by alexpott, Wim Leers, nod_, hestenet, xjm, huzooka,...
  • catch committed cc55493 on 9.5.x
    Issue #3281863 by alexpott, Wim Leers, nod_, hestenet, xjm, huzooka,...
catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed/pushed to 10.0.x, cherry-picked to 9.5.x and 9.4.x, thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.