Problem/Motivation
THINGS WE *THOUGHT* WERE CAUSING THE PROBLEM BUT WERE NOT
Test patches within this issue have ruled out several causes as they demonstrate:
- The failures are not specific to a database type, but the nature of the failures can differ.
- The failures occur even when running at single concurrency
- The failures regardless of how long a wait is provided for the process to complete
Details
This was discovered while working on #3055648: Frequent random fail in \Drupal\Tests\media_library\FunctionalJavascript\MediaLibraryTest. There are a few random failures that persist regardless of how the test is refactored. They are all related to the file upload widget failing in some manner, typically on the transition from uploading an image to rebuilding to display form fields for the image's Alternative text. In the Media Library tests, these errors often first reveal themselves after an image is uploaded. The form that appears post-upload should include hidden fields with height and width values, but these values are occasionally empty. (when this occurs, a validation message about an incorrect primitive value appears, which is incorrectly attributed to the alt text field -- this is actually due to the empty width/height hidden fields)
@tedbow helped with a test that ruled out this being an issue related only to the image widget. The test in
article-100x-image-uploads.patch uploads and removes and image on the Article content type 100 times. These 100 attempts are made on the usual node add form, as well as a version where the node add form is inside a dialog. These tests do not fail.
Also attached is a test - media-widget-100-consecutive-uploads.patch - that consistently reproduces the problem (on testbot, the problem never occurs on localhost). There are 100 instances of a file being uploaded then removed via the Media Library widget. On MySql and Postgres, the tests nearly always fail within the iteration 47-48. On SQlite, the iterations are far less predictable, sometimes failing on the first iteration, and never getting further than 48 iterations before failing. It's unlikely that the tests in #3055648: Frequent random fail in \Drupal\Tests\media_library\FunctionalJavascript\MediaLibraryTest will reliably pass until the symptoms demonstrated in this test are addressed.
In the MySql and Postgres tests, the failures always happen during the same step: the test has added a file to the media upload field, but the JS-triggered upload process does not occur. This is a screenshot of when the test fails -- it is expecting the upload process to happen then see a form with the fields associated with this media.

On the SQLite tests, the failures can happen in different places, but most them happen after uploading the media, filling out the necessary fields then clicking "Save and select". The test then asserts that the text "Save and select" is longer present, but in some instances it is.

In other instances, the test fails because it can't find a remove button for the just-uploaded media. This screenshot taken during the fail shows that the widget was never rebuilt:

There are a few other assorted fails in SQlite, screenshots of each can be accessed by viewing the test on the dispatcher and downloading the test artifacts.
Proposed resolution
Not sure yet, but I'm hoping that the non-SQlite tests always failing at iteration 47-48 provide insight into why this is happening.
Remaining tasks
Determine cause of problem, fix.
User interface changes
N/A
API changes
N/A
Data model changes
N/A
Release notes snippet
N/A
| Comment | File | Size | Author |
|---|---|---|---|
| #68 | 500x_WidgetUploadTest.patch | 2.84 KB | spokje |
| #65 | WidgetUploadTest-no_#3350972-1000x-should_fail.patch | 2.26 KB | spokje |
| #65 | WidgetUploadTest-1000x.patch | 1.76 KB | spokje |
| #57 | WidgetUploadTest-50x-3066447-57.patch | 1.74 KB | lauriii |
| #53 | 3066447-cke5randomfail-debug-53.patch | 2.83 KB | wim leers |
Comments
Comment #2
bnjmnmComment #3
bnjmnmComment #4
bnjmnmComment #5
phenaproxima"It fails only on testbot"...this is a magic phrase that made me summon the sorcerer known as @Mixologic.
@bnjmnm, @oknate and I discussed this issue with him on Slack. TL;DR: the best lead we have, at the moment, points to this being a potential race condition. The major difference between running the tests on localhost with PHPUnit, and running them on testbot, is that testbot runs tests in parallel and PHPUnit doesn't. So, as @Mixologic said:
I would suggest, as a next step, that this test be run locally with
run-tests.sh, which is what runs the tests in parallel. If it fails locally with that, we can be even more sure that our upload woes are caused by a race condition.Comment #6
phenaproximaMore info from @Mixologic:
Comment #7
bnjmnmPer the suggestions in #5 and #6, this is the 100x media library patch with single concurrency and unique filenames.
With this change, all tests fail at the 45th or 46th iteration on all databases tested (MySql, SQlite, Postgres). The only difference is SQlite and Postgres fail at the step immediately following upload, waiting for the presence of the alternative text field. The MySql test fails while looking for the media remove button -- because the media never got added to the form.
(and probably worth mentioning that although 40+ uploads in a short period of time is highly unlikely to occur on a real site, addressing the failures surfaced by these tests will likely address the random Media Library test fails being worked on in #3055648: Frequent random fail in \Drupal\Tests\media_library\FunctionalJavascript\MediaLibraryTest
Comment #8
xjmComment #10
oknateReroll of #7
Comment #11
oknateComment #12
oknateOops, patch failed to attach last time. Here's the patch. It updates the test to use "Save" button and removes an unused variable.
Comment #13
oknateComment #14
oknateTesting reloading the page between uploads. I'm curious if this will help. Given that this fails sometimes on Sqlite on the first iteration, I'm not sure if this will help. But if helps a little, perhaps it's part of a multifaceted solution.
Comment #15
oknateInteresting: reloading the page between iterations doesn't help. So it's probably a back-end issue, not the media library JavaScript. Given that it works under 40 times more consistently, perhaps just breaking up the tests so we never hit more than forty uploads would help.
Comment #16
oknateThis time, instead of using the same container, I'm getting a new one. Also, I'm using
$this->container->get('file_system')->copyinstead ofcopy.Comment #17
oknateTrying longer wait time, replacing 10000 with 30000 to see if it makes any difference.
Comment #18
oknateI noticed that $file_system->copy returns a filepath, so using that. Also, I added the timestamp as a prefix in uniqid(). Let's see if returning the actually file path from copy helps.
Comment #19
oknateSeeing if deleting files makes a difference.
Also, I noticed there are duplicate files, perhaps there are finite limits on how many images can be uploaded on testbot? when you hit 80 uploads, it gives up.
https://www.drupal.org/files/issues/2019-10-31/duplicate-images.png
It uploads them in the files directory, then copies it to files/type-four-dir
Comment #20
oknateWeird, it seems to have made a slight difference (Postgres failed on 47 instead of 45). This cleans up the temporary file at the end of each iteration.
Comment #21
bnjmnmAdding a different sort of test to help with troubleshooting (thank you for being another set of eyes on this @oknate!)
This is an abbreviated version of \Drupal\Tests\media_library\FunctionalJavascript\MediaLibraryTest::testWidgetUpload, where only the first two uploads are performed. There's also a database check after media is saved to confirm it was written to the db.
In one version, the first uploaded media is removed from the field before the second is uploaded. In the other, no removal takes place. At least in my burner issue, the SQLite errors are more likely to occur when there's a remove between the two uploads (testbot may prove me wrong....).
Either way, for the sake of narrowing things down I wanted to supplement the current failing patch with another example that surfaces the issue in a different way.
Comment #22
oknateredacted. I realize now my comment was not valid.
Comment #23
oknateAdding a check on the file_managed table to see if it increments. I tested locally with sqlite and it worked up to 49, then I stopped the test. So this should give a little more info about how far along the process it goes before failing before the alt text failure we've been seeing.
Comment #24
oknateTurning on db log and checking for any messages from file_managed_file_save_upload().
Comment #25
oknateDang, had a really dumb error on #24, updating.
The output from this test, where there are no log messages means that it's most likely failing before file_managed_file_save_upload(), or perhaps it's failing in that function where there is no logging.
Comment #26
oknateAdding more logging to file_managed_file_save_upload, and a fallback error message if there are no logs.
Update, this tells us it's exiting at the beginning of file_managed_file_save_upload(), here, even when it works correctly, so we don't need to look at that function any more, so next, let's look at what calls it
ManagedFile::valueCallback().This is very interesting, because when I run this locally, it is not exiting at the beginning of file_managed_file_save_upload().
It gets passed this part:
Comment #27
oknateClue:
More on #26, on my local, here's the output of
$all_files[$upload_name]right at the beginning offile_managed_file_save_uploadThis is very interesting because this is empty and exiting the function on testbot here:
But on my local it is not empty. So this is a helpful clue as to what's different on testbot vs local. (Remember it doesn't fail locally.)
There must be an alternate way the images are attached that doesn't use file_managed_file_save_upload.
Comment #28
oknateAdding more logging output to ManagedFile::valueCallback to try to figure out what's happening.
I get the same output on my local (in reverse cron order), see ManagedFile::valueCallback (in the patch)
Comment #29
oknateClearing out messages when a run was successful and adding some spacing.
Comment #30
oknateforgot execute() on truncate:
Comment #31
oknateTrying
$this->resetAll()to see if cache clearing makes any difference. Also, I changed my mind about clearing out previous log messages. I think they're helpful.Ok, here's another clue:
FileUploadForm::uploadButtonSubmitisn't hit when it fails, but is hit when it succeeds.Comment #32
oknateAdding more logging.
Comment #33
oknateAdding a bit more logging. I want to see if updateFormCallback is called on the last iteration.
Comment #34
oknateOops, serialize was blowing up. Removing for now. Although it would be good to get some info about the upload.
Another clue:
::updateFormCallback is not called on the last iteration.
Comment #35
oknateRemoving the 'type' filter on the watchdog messages. Perhaps something else will come up. Hopefully it won't be too much output.
Comment #36
oknateAdding more logging around the backtrace leading up to FileUploadForm::uploadButtonSubmit, since we know it's not called on the failed iteration, but that the BE is being hit as it calls ManagedFile::valueCallback.
Comment #37
oknatefixing the failure I introduced in the last patch and adding file size output.
Comment #38
oknateComment #39
oknateadding logging of form_state values near where it's failing.
Update:
At least based on the logging I've added, it seems to fail after FormBuilder calls $this->formSubmitter->doSubmitForm for the "Apply filters" button.
Last iteration before it fails:
Last failing iteration:
Comment #40
oknatetesting disabling the "Apply filters" button since it looks like it's submitting around where it's failing.
Comment #41
oknateignore this one, was missing the change I was trying to add.
Comment #42
oknateremoving the exposed filters to test if that makes a difference.
Comment #43
oknateAdding logging info about the file in NegotiationMiddleware::handle
Update:
this seems to indicate that that when it fails it fails very early in the stack, because NegotiationMiddleware::handle isn't hit the second time, like the other iterations.
Comment #44
oknateUpdate: on my local the temporary file worked, but it doesn't here. So we should remove that change or figure out how to get it to work.
Comment #45
catchComment #46
catchMarking #3103492: Random fail in WidgetUploadTest as duplicate.
This is failing very, very often. Should we mark the test as skipped until there's a viable patch here?
Comment #48
catchComment #49
phenaproximaComment #52
xjmPer @bnjmnm, this is now also the cause of many, many random fails in CKEditor 5 tests, e.g.:
https://www.drupal.org/pift-ci-job/2238883
This has more than doubled our random fail rate following beta1.
Comment #53
wim leersI was actually monitoring this too, and was pleasantly surprised that we still don't have any random fails on MySQL: https://www.drupal.org/pift-ci-job/2238957. Never any random fails on the contrib project either!
It can’t be a concurrency problem either, because these areAha, core runs these with concurrency 15! So that could be it … but only on PostgreSQL then. Could be some locking/conflict thing?FunctionalJavascripttests, and they run with concurrency level1both in contrib and core.But … we didn’t have tests against PostgreSQL in the contrib module. So that is likely to be slower somehow? 🤔 The random fail does not seem to be random at all, but ALMOST consistent, on
PostgreSQL, with the vast majority of test runs crashing like this:That is the last line of this chunk of code:
… there's not even any JS/AJAX involved there! So let's add some debug output.
Comment #54
bnjmnmComment #55
longwaveI downloaded the artifacts from the Postgres run at https://dispatcher.drupalci.org/job/drupal_patches/103942/artifact/jenki...
There are 45 files output with an
<img>tag in them; of these 30 look correct (e.g.<img data-entity-uuid="4ee93afa-6380-47c1-89da-603a2915c88c" data-entity-type="file" src="/subdirectory/sites/simpletest/19350594/files/inline-images/image-test.png" alt="</em> Kittens & llamas are cute" />) but the other 15 are just empty<img />tags with no attributes.Example: https://dispatcher.drupalci.org/job/drupal_patches/103942/artifact/jenki...
Comment #56
lauriiiOpened new issue for #52 since it was likely unrelated. Also reverting the title change.
Comment #57
lauriiiComment #58
wim leers#55: that is consistent with the findings in #3250587: \Drupal\Tests\ckeditor5\FunctionalJavascript\CKEditor5Test::testEditorFileReferenceIntegration fails on PostgreSQL!
Comment #60
alexpottDrupal\Tests\ckeditor5\FunctionalJavascript\MediaLibraryTest is failing pretty consistently on SQLite - see https://www.drupal.org/pift-ci-job/2268909 - given SQLite is the fastest test environment we have this suggests there's a timing issue in this test.
Comment #61
anybodyHi @all here,
I am facing this Media Library Browser race condition issue in UI: #3174361: Media Library modal opens randomly on AJAX requests
TL;DR: Opening / Closing (paragraphs) modals very fast a few times leads to unexpected calls to MediaLibraryWidget::openMediaLibrary() from request parameters that should not exist. So the Media Library Modal is randomly opened on race conditions, if just the Media Library Widget is present on the page (but never clicked!).
There are strange things going on with the #ajax handling and the subsequent requests, when clicking fast, as I documented in the issue. As there are several similarities and both seem to be caused by a race condition, I thought the issues might be related or have the same cause. Perhaps finding one reason also fixes the other problem. Could someone with a deeper understanding of the AJAX / Commands handling in core have a look and give an opinion?
I'm sorry, if I'm wrong, and they're unrelated.
Comment #64
spokjeComment #65
spokje1500x is too much in one go, whenever a patch like this hits the 50+ minutes, "wonky" things happen.
Let's see if we can keep it below that in a 1000x patch
Comment #66
spokjeFails on the test runs with and without #3350972: [random test failure] Drupal\Tests\layout_builder\FunctionalJavascript\LayoutBuilderUiTest::testReloadWithNoSections(), so that's not the solution for this one.
Comment #68
spokjeI believe this random test failure doesn't appear anymore.
To prove this, we need about 5000-8000 runs failure-free runs of the suspicious test.
Let's try that here. I use 500 runs in one patch, since any time the repeated runs of this test get's over 45 minutes, things get "wonky" and unreliable. The 500x run should finish within 30-35 minutes.
Comment #69
catchThe fails are
we can just ignore that IMO, looks solid otherwise.
Comment #70
spokjeAgreed, this "somehow" happens (randomly) when running single Functional(Javascript)Tests a lot of times.
Seen this happen on a lot of my fix-the-bloody-random-failure-and-run-it-500+-times patches and they are (at the very least) not related to the test that is the SUT.
As far as I'm concerned: This random test failure is gone (for now).
Unsure what should happen with this issue.
I've seen older similiar issues being closed as cannot reproduce. If needed they can be found in the search and reopened (See for example #2980750: [random test failure] Random failure in CommentStatisticsTest)
Comment #71
longwaveYep, let's close as cannot reproduce, and this can be reopened if it starts appearing again.
Thanks for looking into this.