Problem/Motivation
in core/lib/Drupal/Core/Config/DatabaseStorage.php
public function readMultiple(array $names) {
$list = [];
try {
$list = $this->connection->query('SELECT name, data FROM {' . $this->connection->escapeTable($this->table) . '} WHERE collection = :collection AND name IN ( :names[] )', [':collection' => $this->collection, ':names[]' => $names], $this->options)->fetchAllKeyed();
foreach ($list as &$data) {
$data = $this->decode($data);
}
}
catch (\Exception $e) {
// If we attempt a read without actually having the database or the table
// available, just return an empty array so the caller can handle it.
}
return $list;
}
if there is any connection failure, exception will be catched, and an empty value will be returned.
According to the comment, the caller should handle it
while in
core/lib/Drupal/Core/Config/CachedStorage.php -> readMultiple
$list = $this->storage->readMultiple($names_to_get);
// Cache configuration objects that were loaded from the storage, cache
// missing configuration objects as an explicit FALSE.
$items = [];
foreach ($names_to_get as $name) {
$data = isset($list[$name]) ? $list[$name] : FALSE;
$data_to_return[$name] = $data;
$items[$cache_keys_map[$name]] = ['data' => $data];
}
$this->cache->setMultiple($items);
looks like it just set the FALSE to the cache backend
This is here because config could be fetched before the database is setup and it should succeed - and tables might only be created for some config upon first write - so lack of a table is not a failure.
However the catch is not just catching table not exists failures but also connection failures. So it can happen that the connection is lost and instead of an exception the code assumes no table and therefore gives a successful empty response even if there is configuration data in the database. This then gets injected into cache so even once the connection to the database is restored, the configuration is always returned empty
Steps to reproduce
Can be reproduced in a test by creating an invalid table with missing columns that will force an exception when calling the method. (Note: due to SQLite quoting this won’t work with a SQLite database)
Proposed resolution
Do not set empty config to the cache backend when there is an exception on getting the data, unless that exception is because the table does not exist. Instead, throw the exception to end the request.
We can detect table not exists failures by subsequently checking for the table in the exception. If it is not there we have confirmed it is safe to return empty. If it is indeed there then we can assume the query failed in some other way that needs rethrowing. In a connection loss instance this check for table will itself detect the loss of connection and throw (e.g. MySQL has gone away)
MR used
MR 2839 = 11.x
Remaining tasks
Review
User interface changes
NA
API changes
NA
Data model changes
NA
Release notes snippet
NA
| Comment | File | Size | Author |
|---|
Issue fork drupal-3181013
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
Pan Lee commentedComment #3
Pan Lee commentedComment #4
alexpottI agree that this looks a bit odd.
I think we can fix this by checking if the table exists.
Comment #8
driskell commentedDue to https://www.drupal.org/project/drupal/issues/3267207 this patch won't work properly with MySQL installations, as the tableExists will return false if the connection is broken.
Comment #9
driskell commentedPotentially this needs to wait for #2797141
Comment #11
driskell commentedAdded patch for 9.4+ where there is a new method "exists" exhibiting the same issue that has caused me two outages so far and at least one "weird missing button issue"
Comment #12
driskell commentedResolved comment changes in the patch so they are maintained properly.
Comment #15
driskell commentedComment #16
driskell commentedComment #18
kristen pol@Driskell Thanks for updating this issue. It wasn't clear what changes you have made as there is no interdiff on the patch changes and you have switched to an MR. So I have diff'ed the changes manually between the latest MR and patch in #4. There are some minor changes listed below that are needed. Thanks.
I see that the code was changed as follows:
existsmethodexistsmethodThe comment changes that were made in #4 have been reverted. They should be the following in all places:
Also, the test comments are now more than 80 characters and should be reverted back to what they were or reformatted.
Comment #19
driskell commentedHi Kristen
The patch in #4 is for 9.2
The MR and patches I added are for 9.4/9.5 which has new exists method so I thought an interdiff would confuse as it’s somewhat a new patch for 9.4+ as opposed to a change to the 9.2 one.
I did apply the code patch manually though and yes you’re right the comments were changed in the 9.2 patch. I’ll resolve that in an update to the MR soon and I will hide the patches in preference of the MR.
Thank you.
Jason
Comment #20
driskell commentedI've updated the MR with the comment changes so it matches the 9.2 patch, and hidden the patches where I didn't include the comment changes.
Thanks for the review! And apologies for missing the comment change!
Comment #21
driskell commentedGot PHPCS working and fixed up some comments.
Comment #22
mxr576Comment #23
driskell commented@mxr576 I've applied your requested changes.
I have also added further checks as I have two open problem reports on my side over the last few months:
* listAll we traced to returning empty array during a network glitch and that also cached an empty list somewhere and broke something
* getAllCollectionNames we traced as well to doing the same thing
Further to adding the checks to the above two so they behave like read/readMultiple/write/exists, I've also added the same check to deleteAll in anticipation that it can cause a similar outcome (leaving config behind but caching have no config, and causing a split between cache and what's in storage accessible directly)
Comment #24
driskell commentedComment #25
driskell commentedFixed line length issues of test comments from what @kirsten-pol mentioned earlier
Comment #26
smustgrave commentedThis issue is being reviewed by the kind folks in Slack, #needs-review-queue-initiative. We are working to keep the size of Needs Review queue [2700+ issues] to around 400 (1 month or less), following Review a patch or merge request as a guide.
Reviewing MR #2839. Hiding files to avoid confusion.
Code wise everything looks fine
But when I ran the tests locally without the fix to DataStorage.php they still passed, when they should fail to show the bug.
Thanks!
Comment #27
driskell commented@smustgrave
I just ran the tests without the fix and received a failure. I then applied the fix and it worked again.
However, from a hunch, I hooked up SQLite as the test database, and yes it did not fail!
Bit of debugging later...
https://www.sqlite.org/lang_keywords.html
> If a keyword in double quotes (ex: "key" or "glob") is used in a context where it cannot be resolved to an identifier but where a string literal is allowed, then the token is understood to be a string literal instead of an identifier.
The query that is meant to fail because of a missing column actually now resolves to a "does this string equal this string" and is accepted as a string expression!
> SELECT 1 FROM "test57957719"."config" WHERE "collection" = :collection AND "name" = :name LIMIT 0, 1
Will have a think about a way to make the test fail properly on SQLite as well.
Comment #28
driskell commentedOK, tracked the quoting to double quotes back to https://www.drupal.org/project/drupal/issues/3126658
It seems not to have taken into account SQLite.
I checked and the ANSI quoting in Pgsql and MySQL should be fine as that forces double quote as identifier and it will never be string literal.
In SQLLite it actually is string literal if the column does not exist.
I don't know but... I am wondering if the above needs to be a separate issue as that's in a different module. So I'll get a ticket raised.
I've noticed there's some duplication of tests in the MR anyway so I'll push a cleaned up version - and I'll set the other ticket for SQLLite as blocking this one.
Comment #29
driskell commentedComment #30
driskell commentedComment #31
driskell commented@smustgrave
I'm guessing the impact to all SQLite queries of a change to that would prevent this from being a bug fix in 9.5.
However, perhaps there just needs to be an acceptance that this test won't function properly with SQLite until the other issue is fixed so maybe there's a decision to be made here - in that the test will properly work on MySQL so the automated tests will work fine. It only will cause a problem for running locally with SQLite. Just I feel this patch has saved me many issues and is hugely beneficial - our incidents relating to cache corruption in this area are eliminated so far.
So perhaps the SQLite issue does not postpone this? So I'm U-turning as usual and returning to Needs review to get some thoughts.
Comment #32
andypostThe workaround looks incomplete
tableExists()making a query on exception, but when other process creating table it could be wrong assumptionComment #33
driskell commented@andypost
At the moment it fails silently and corrupts cache during normal day to day running. The scenario of a writer creating the table and causing a valid read to then throw exception is true, but I think that would only occur once, when the table doesn't yet exist, and only in the perfect race between read and the very first ever write to the table, and it would recover completely on the second request and never happen again. Unfortunately it's difficult to quantify the real impact of that scenario but it seems to be small, and much smaller than the current issue's impact.
Technically speaking I think the correct fix is to detect the specific error of the table not existing. However, this might be complex across the different drivers and versions supported to detect the specific error message, and would require driver changes. If that was opted into it would be better I think as a post-task after this to get those driver implementation for throwing a specific exception for table missing, for use in this code, so we're never handling arbitrary exceptions and only handling the exception we intend to: Table does not exist. Arguably that would be easier to test too and wouldn't need the quoting change for SQLite.
What are your thoughts?
Comment #34
driskell commentedNot seen it before but found an issue that was adding the table exception checks that would also resolve this: https://www.drupal.org/project/drupal/issues/2371709
Comment #35
catchThis is fine - the worst that happens is a cache miss, and it's how we handle it in other subsystems.
Comment #36
needs-review-queue-bot commentedThe Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".
Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #37
longwaveI have a busy site that is experiencing occasional strange bugs where fields suddenly no longer exist (causing subsequent fatal errors in different places each time), but this is fixed on cache rebuild, and I am wondering if this is the culprit.
I think the fix here looks good for the problem at hand but I wonder if there is a way of refactoring this so we explicitly create the table separately and then the service itself doesn't have to be concerned with catching exceptions at all; it can just pass them up to the caller. This should be a followup, though.]
Comment #38
ericgsmith commentedComment #40
ericgsmith commentedI suspect we also hit this issue recently. The symptoms were very similar to #37 - config status showed a list of config as in sync only - clearing the cache went back to showing no differences.
Fixed CS issue reported in #36 and opened new MR to 11.x - setting back to needs review.
Comment #41
ericgsmith commentedAh - setting back to needs work. As was indicated in #26 the test is failing on SQLite.
Have triggered the additional pipelines for PostgreSQL and SQLite and it indeed is still failing with SQLite https://git.drupalcode.org/issue/drupal-3181013/-/pipelines/47013
Comment #42
driskell commentedI am putting this back into Needs Review because a failing test is impossible for SQLite due to a known issue (see #3349286). The test system online is MySQL so reproduces it correctly. You cannot test this locally with a SQLite database, it won't fail.
As for the other issues, #2371709 should be a follow on, it's a bit more involved. And hopefully at some point #3349286 can be looked at to fix the SQLite issue. But I don't feel either of these block this fix. The first improved it. The second is purely relating to local tests.
Comment #43
driskell commentedComment #44
ericgsmith commentedHaving also been meaning to come back to say we can almost certainly confirm this issue is what is responsible for the config shown as overridden.
Further investigation showed multiple errors in the logs that mysql had gone away over about 1 minute. We can also see higher than expected numbers of evictions in memcache during this period from a combination of high traffic and our memcache cache bin size being lower than what the site needed. Immediately after the mysql gone away logs we start seeing warning of missing fields that imply the field does not exist, and as mentioned in #40 this was visible via drush config status and resolved after a cache clear.
We are now running this patch in production now along side increasing the cache bin size of memcache.
Comment #45
driskell commentedI've updated the MR to include a skip for SQLLite and linked it to the issue about the double quoting that prevents the test from working.
Comment #46
driskell commentedSorry - that last pushed commit was to fix the URL to be correct issue
Comment #47
needs-review-queue-bot commentedThe Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".
This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #48
driskell commentedResolve review queue bot issue
Comment #49
smustgrave commentedCan the 9.5.x MR be closed since D9 is EOL please
The other MR seems to have some test failures potentially.
Comment #50
driskell commented@smustgrave I rebased it to 11.x and updated the MR. Not sure how to update the other one so I kept my existing.
It skips the test for SQLite since it will fail due to the linked issue. But it successfully fails without the fix on MariaDB / MySQL and passes after the patch.
Comment #51
driskell commentedNot sure the test failures are related.
Comment #52
smustgrave commentedGlad switching targets didn't cause issues, usually changing that causes issues.
Checking the issue summary
Isn't a full solution, can the proposed solution be documented helps reviewers/committers.
Added missing sections too, left TBD for sections that may apply but not familiar with issue to know myself. If not applicable just change to NA.
Comment #53
driskell commented@smustgrave I updated issue summary. It was proposed resolution that was taken so I just dropped the not sure bit and I’ve updated other parts to reflect final status. Thanks for checking
Comment #54
driskell commentedAdded more detailed description of problem and solution to IS
Comment #55
smustgrave commentedThanks issue summary reads much better now!
Comment #60
catchCommitted/pushed to 11.x, cherry-picked to 10.2.x and 10.1.x, thanks!