Race conditions in the twig template cache [#2429659]

Reference: https://www.drupal.org/core/beta-changes
Issue category	Bug because fatal errors can cause sites to be unreachable until manual cache clear is performed
Issue priority	Critical because its easy to replicate error situation
Prioritized changes	Bug fixes are prioritized changes in the beta phase/td>
Disruption	The change is not disruptive

Comment	File	Size	Author
#119	race_conditions_in_the-2429659-119.patch	1.22 KB	geerlingguy
#114	race_conditions_in_the-2429659-114.patch	2.48 KB	geerlingguy
#108	race_conditions_in_the-2429659-108.patch	2.46 KB	star-szr
#108	race_conditions_in_the-2429659-108-testonly.patch	1.25 KB	star-szr
#98	drupal-error-first-page-load.jpg	7.45 KB	geerlingguy
#89	2429659_89.patch	1.22 KB	chx
#88	38461dd565579de4d7e93cbc2d9ac60601804d7fbd95e49d31cbf6ae8400643f.php_.txt	335.25 KB	geerlingguy
#85	twig-cache-file-4cc1fc2e1dc455a9a45bbe4ad00241000181573fffb63d719af245daa0036c36.php_.txt	1.09 KB	geerlingguy
#74	interdiff.txt	7.39 KB	chx
#74	2429659_74.patch	15.15 KB	chx
#71	2429659_71.patch	13.88 KB	chx
#69	2429659_69.patch	13.88 KB	chx
#66	2429659_66.patch	12.24 KB	chx
#62	2429659_62.patch	11.76 KB	chx
#59	2429659_59.patch	10.11 KB	chx
#52	interdiff-44-52.txt	2.43 KB	mpdonadio
#52	race_conditions_in_the-2429659-52.patch	5.28 KB	mpdonadio
#44	interdiff-40-44.txt	1.68 KB	mpdonadio
#44	race_conditions_in_the-2429659-44.patch	4.57 KB	mpdonadio
#40	race_conditions_in_the-2429659-40.patch	4.27 KB	berdir
#35	race_conditions_in_the-2429659-35.patch	4.26 KB	rteijeiro
#35	interdiff.txt	930 bytes	rteijeiro
#30	race_conditions_in_the-2429659-30-interdiff.txt	2.82 KB	berdir
#30	race_conditions_in_the-2429659-30.patch	4.32 KB	berdir
#24	race_conditions_in_the-2429659-24.patch	1.33 KB	mpdonadio
#22	race_conditions_in_the-2429659-22-interdiff.txt	881 bytes	berdir
#22	race_conditions_in_the-2429659-22.patch	2.21 KB	berdir
#20	race_conditions_in_the-2429659-20.patch	1.35 KB	mpdonadio
#3	evil-test-2429659-1.patch	2.05 KB	berdir

Comment #1

German

Switzerland

commented 19 February 2015 at 18:15

Note: This is a on a shared file system, so that could make this more likely, but I'm quite sure that I've seen this locally as well

Log in or register to post comments

Comment #2

dawehner

German

commented 19 February 2015 at 18:19

For me this happened in some scenarios when the stream wrappers haven't been setup properly.

Log in or register to post comments

Comment #3

berdir

German

Switzerland

commented 19 February 2015 at 21:40

Status:

Active

» Needs review

Status	File	Size
new	evil-test-2429659-1.patch	2.05 KB

This might be the most evil test I've ever written.

Unfortunately, the fix I thought of doesn't work because the file storage uses include_once, so it won't load the same file again and still fails. Maybe is also related to the actual issue, I don't know what is really going on.

Log in or register to post comments

Comment #4

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 19 February 2015 at 21:51

I must say, that I was looking at the TestBot queue to see how big it was and couldn't help but read this issue...

Log in or register to post comments

Comment #5

19 February 2015 at 21:57

Status:

Needs review

» Needs work

The last submitted patch, 3: evil-test-2429659-1.patch, failed testing.

Log in or register to post comments

Comment #6

star-szr

he/him

English

commented 20 February 2015 at 13:15

Issue tags:

+Twig

Log in or register to post comments

Comment #7

berdir

German

Switzerland

commented 21 February 2015 at 19:18

Talked about this with @Cottser, looks like the only useful thing that we can do here is throw an exception instead of having a fatal error. That still means the user gets an error page, but at least it's not an WSOD.

Log in or register to post comments

Comment #8

berdir

German

Switzerland

commented 22 February 2015 at 21:05

#2296009: Use APC Classloader by default (when available) just had a random fail with this, apparently caused by using APC. I think this means that those errors happen because of the apc classloader, although I don't know why yet.

Log in or register to post comments

Comment #9

fabianx commented 22 February 2015 at 21:29

Twig uses a double caching logic to avoid having to check the filesystem again and again for things that are in cache.

Therefore a race condition during cache clear could be expected.

I think the class_exists($cls, FALSE) is the proper fix.

Log in or register to post comments

Comment #10

berdir

German

Switzerland

commented 22 February 2015 at 21:32

Where exactly? And what should we do if it returns FALSE?

Log in or register to post comments

Comment #11

catch

he/him

English

commented 23 February 2015 at 14:51

HEAD just had a random fail with this as well.

Drupal core - 8.0.x fail: https://qa.drupal.org/8.0.x-status

Overall Summary
==============================================

FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 87,171 pass(es), 1 fail(s), and 0 exception(s).

Individual Environment Summaries
==============================================

-- [[SimpleTest]]: [PHP 5.4 MySQL] --

* Drupal\update\Tests\UpdateCoreTest (852 pass(es), 1 fail(s), and 0 exception(s))
   - [fail] [PHP Fatal error] " Class '__TwigTemplate_834b8992aafa7a111df3eb478be6ddcbc43ccfb5d5020e08270ddf79a80c5e7c' not found" in TwigEnvironment.php on line 141 of Unknown.

Log in or register to post comments

Comment #12

xjm

she/her

English

commented 23 February 2015 at 14:57

Issue tags:

+Random test failure

Log in or register to post comments

Comment #13

fabianx commented 25 February 2015 at 00:27

The caching logic happens in isFresh() and likely that test for a cache miss is wrong, but this should not affect core as its only happening when auto_reload => TRUE, which we don't do for performance reasons.

That means somehow either ->save() to storage fails or ->load() loads something else.

That is tricky for sure, hardening we should do anyway, so as proposed already:

-        if (!$this->storage()->load($cache_filename)) {
+        if (!$this->storage()->load($cache_filename) || !lclass_exists($cls, FALSE)) {

at the minimum.

Then we could again do a check afterwards and throw an Exception, though that is really fatal if a newly dumped file cannot be loaded as class ...

Log in or register to post comments

Comment #14

berdir

German

Switzerland

commented 26 February 2015 at 21:20

Yeah, but the class exists check won't help, because the php storage won't ever load the file again. We would have to combine it with the non-cache version of using eval?

Log in or register to post comments

Comment #15

fabianx commented 26 February 2015 at 21:58

Why not?

If the class does not exist and we store and retrieve it then, it should exist afterwards or not?

Log in or register to post comments

Comment #16

berdir

German

Switzerland

commented 26 February 2015 at 22:05

Because FileStorage::load() does an include_once.

If the first load() call loads the file but that for some reason does *not* contain the class (only then we get into the || !class_exists() case), then calling load() again will *not* try to include that file again.

I quickly tried to switch to an include there, but Drupal dieded hard and I didn't try to fix it.

Log in or register to post comments

Comment #17

berdir

German

Switzerland

commented 9 March 2015 at 11:59

While doing stress tests on #2336627-16: Deadlock on cache_config (DatabaseBackend::setMultiple()), I noticed this happening again. so this might be more common/easier to reproduce than I thought.

The only additional information I have right now is that I also got a few corresponding watchdog messages:

 ID  Date          Type      Severity  Message                                                                                                                                                             
 41  09/Mar 12:50  php                 Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of                                                            
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).                                                                                     
 40  09/Mar 12:50  php                 Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of                                                            
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).                                                                                     
 39  09/Mar 12:50  php                 Warning: file_put_contents(sites/default/files/php/twig/1#06#1e#9975f4ab1194ed306eb5bad0ce7095521d1c9621e71adf689e5016f60a8f/.htaccess):   
                                       failed to open stream: Permission denied in Drupal\Component\PhpStorage\FileStorage->ensureDirectory() (line 136 of                                                 
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).                                                                                     
 38  09/Mar 12:50  php                 Warning: file_put_contents(sites/default/files/php/twig/1#61#f8#2e0aac09f5d1126d7ded09d615cd03eb2a460783f2d84db6dec1de74ecfa/.htaccess):   
                                       failed to open stream: No such file or directory in Drupal\Component\PhpStorage\FileStorage->ensureDirectory() (line 136 of                                         
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).

So definitely looks like some sort of conflict in the phpstorage, where multiple processes try to write the file?

Log in or register to post comments

Comment #18

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 9 March 2015 at 13:25

Would a `flock($handle, LOCK_EX);` be appropriate here? Since the lock is needed to write the cache and not read it, it won't be a huge performance hit?

Log in or register to post comments

Comment #19

berdir

German

Switzerland

commented 9 March 2015 at 13:32

I was thinking something like that yes, but it would probably have to be abstracted through PhpStorage? So we could have a lockForWrite() method that we can call immediately after failing to include the fail, or possibly even automatically in load (maybe with an option to disable it), that is removed again when the file was written.

Log in or register to post comments

Comment #20

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 9 March 2015 at 14:50

Status:

Needs work

» Needs review

Status	File	Size
new	race_conditions_in_the-2429659-20.patch	1.35 KB

Hmm. PhpStorage doesn't use handles, so flock() can't be used directly unless we also add a lockfile to the Twig cache.

Maybe just this is all that needed?

Log in or register to post comments

Comment #21

berdir

German

Switzerland

commented 11 March 2015 at 13:23

I tried that, it gets rid of the watchdog messages, but I still get some class not found fatal errors.

Log in or register to post comments

Comment #22

berdir

German

Switzerland

commented 12 March 2015 at 00:08

Status	File	Size
new	race_conditions_in_the-2429659-22.patch	2.21 KB
new	race_conditions_in_the-2429659-22-interdiff.txt	881 bytes

Ok, this fixes the fatal errors for me, although I still get some warnings.

 60  12/Mar 01:01     php     Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of /home/berdir/Projekte/d8/core/lib/Drupal/Component/PhpStorage/FileStorage.php).     
 61  12/Mar 01:01     php     Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of /home/berdir/Projekte/d8/core/lib/Drupal/Component/PhpStorage/FileStorage.php).     
 62  12/Mar 01:02     php     Warning: file_put_contents(/home/berdir/Projekte/d8/sites/default/files/php/twig/1#bc#f7#30392e18d13056f9a940f0627ea8e84e3406b2877b6391fdc398a09c6201/.htaccess): failed to open stream: No  
 63  12/Mar 01:02     php     Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of /home/berdir/Projekte/d8/core/lib/Drupal/Component/PhpStorage/FileStorage.php).

Log in or register to post comments

Comment #23

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 12 March 2015 at 01:06

We could @mkdir to make the warnings go away, but it kinda sounds like there are still two processes in PhpStorage::save() at the same time.

Do we need to add an optional lock to PhpStorage or make a LockablePhpStorage, either of which whose constructor accepts a LockBackendInterface, and then protect save with it? Or a lock in TwigEnvironment?

Log in or register to post comments

Comment #24

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 12 March 2015 at 01:24

Status	File	Size
new	race_conditions_in_the-2429659-24.patch	1.33 KB

Quick and dirty locking in FileStorage::save() to see what implodes. No other changes from HEAD.

Log in or register to post comments

Comment #25

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 12 March 2015 at 01:31

OK, some additional testing with increased concurrency in ab reveals that Bad Things happen with a lock there, and I still got the Twig errors...

Log in or register to post comments

Comment #26

12 March 2015 at 01:41

Status:

Needs review

» Needs work

The last submitted patch, 24: race_conditions_in_the-2429659-24.patch, failed testing.

Log in or register to post comments

Comment #27

geerlingguy commented 22 March 2015 at 01:51

Cottser sent me this way from #2234229: PHP Fatal Error: Class __TwigTemplate not found in TwigEnvironment.php... I had the same strange issue on an infrastructure with a replicated file system:

In setting up the Dramble cluster of D8 servers, I was running into the same issue. A while back, it had to do with file permissions issues in the shared GlusterFS mount, but I had resolved that error. After a few hours' debugging, I found that this WSOD and the exact same error can be caused by time drift on multiple servers with a shared mount.

It was the strangest thing, too. If I pointed the load balancer at just one of the servers, everything worked perfectly. If I pointed it at more than one server, then after accessing the second server, the Twig error would start popping up in the logs for all servers, and they'd all WSOD. Then if I pointed the balancer at one server again, that server would load Drupal just fine.

In my case, since I was using Raspberry Pis (without built-in clocks) on a local network, the time from system boot had about +/- 3 seconds of drift, and apparently that was enough to cause this strange error. I even tried having APC on with stat=0, APC on with stat=1, and APC off, and none of those changes made a difference.

Log in or register to post comments

Comment #28

fabianx commented 22 March 2015 at 09:24

#27: Your best bet is to put the twig template cache on /tmp, by setting up the phpstorage configuration in settings.php.

The template cache is also DB backed, so a global cache clear will clear all templates as the mtime is stored in the database.

Log in or register to post comments

Comment #29

berdir

German

Switzerland

commented 22 March 2015 at 09:35

@Fabianx: That's not correct?

Yes there's the auto refresh feature, but you definitely don't want to have that enabled on production, because that does a cache_get() for each template file.

Log in or register to post comments

Comment #30

berdir

German

Switzerland

commented 22 March 2015 at 10:38

Status:

Needs work

» Needs review

Status	File	Size
new	race_conditions_in_the-2429659-30.patch	4.32 KB
new	race_conditions_in_the-2429659-30-interdiff.txt	2.82 KB

I think there's not much we can do about the mkdir, see https://bugs.php.net/bug.php?id=35326 and #392100: Warning: mkdir(): File exists in imagecache_build_derivative() for somewhat similar issues.

Here's a combined patch with #22, the test from #3 (had to make a small change because the filename is not validated) and the @ for mkdir().

I think it would be great to get it in like that, which should take care of most issues.

Log in or register to post comments

Comment #31

geerlingguy commented 22 March 2015 at 12:15

Your best bet is to put the twig template cache on /tmp, by setting up the phpstorage configuration in settings.php.

But doesn't the twig cache need to be on a shared filesystem if you have multiple webservers? Otherwise, each server would have to regenerate its own twig cache on each request... at least that's how it seems this would work. I could turn off the Twig cache entirely, but that would result in abysmal performance on the Raspberry Pis.

Log in or register to post comments

Comment #32

berdir

German

Switzerland

commented 22 March 2015 at 12:21

They would have to, but that's not the problem. The problem is distributing invalidations/cache clears. Just like ChainedFast cache backends, we'd need a shared flag/storage that they can check or you need a custom process that on cache clear, triggers a delete of those files on all servers.

Log in or register to post comments

Comment #33

wim leers

Ghent 🇧🇪🇪🇺

commented 22 March 2015 at 12:21

Issue tags:

+D8 cacheability, +Performance

Sounds like we need these additional tags.

Log in or register to post comments

Comment #34

fabianx commented 22 March 2015 at 15:37

#29: Right, thats not active, I was mistaken there.

Still putting it on /tmp/drupal-cache/ is good.

Need to clear /tmp/drupal-cache then on code deploys, but with code deploy on multiple webservers being usually by jenkins, that is not a big deal either ...

Log in or register to post comments

Comment #35

rteijeiro commented 22 March 2015 at 20:02

Status	File	Size
new	interdiff.txt	930 bytes
new	race_conditions_in_the-2429659-35.patch	4.26 KB

Fixed a couple of nitpicks.

Log in or register to post comments

Comment #36

stefan.r commented 6 April 2015 at 14:26

I just had this issue as well on a vagrant box, this patch fixed it :)

Log in or register to post comments

Comment #37

berdir

German

Switzerland

commented 6 April 2015 at 17:56

Anything left to do here? It has tests, I can't think of a different fix, it should also help to fix random testbot fails that sometimes happen.

Log in or register to post comments

Comment #38

fabianx commented 6 April 2015 at 17:58

Status:

Needs review

» Reviewed & tested by the community

Nope, lets get this in.

This is RTBC.

Log in or register to post comments

Comment #39

6 April 2015 at 18:08

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 35: race_conditions_in_the-2429659-35.patch, failed testing.

Log in or register to post comments

Comment #40

berdir

German

Switzerland

commented 6 April 2015 at 18:11

Status:

Needs work

» Reviewed & tested by the community

Status	File	Size
new	race_conditions_in_the-2429659-40.patch	4.27 KB

Simple conflict in the use statements in the test after the String/SafeMarkup change.

Log in or register to post comments

Comment #41

6 April 2015 at 18:29

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 40: race_conditions_in_the-2429659-40.patch, failed testing.

Log in or register to post comments

Comment #42

berdir

German

Switzerland

commented 6 April 2015 at 20:31

Uhm, did something change with the file storage tests that this now fails?

Log in or register to post comments

Comment #43

star-szr

he/him

English

commented 8 April 2015 at 01:55

Indeed, git bisect points to #2453399: Use VFS for FileStorage tests.

Log in or register to post comments

Comment #44

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 16 April 2015 at 13:56

Status	File	Size
new	race_conditions_in_the-2429659-44.patch	4.57 KB
new	interdiff-40-44.txt	1.68 KB

Maybe this? Used `parse_url` so the tests wouldn't have to rely on the file_system service just to check the scheme.

Log in or register to post comments

Comment #45

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 16 April 2015 at 13:56

Status:

Needs work

» Needs review

Grrr.

Log in or register to post comments

Comment #46

berdir

German

Switzerland

commented 20 April 2015 at 00:21

Tricky, but seems to be working. Anyone wants to RTBC this? This is causing a lot of fatal errors, in one case, it even caused a permanent error that we had to fix with a manual cache clear. This should basically be self-healing...

Log in or register to post comments

Comment #47

fabianx commented 20 April 2015 at 09:11

+++ b/core/modules/system/src/Tests/Theme/TwigEnvironmentTest.php
@@ -44,6 +45,19 @@ public function testInlineTemplate() {
+    $element['test'] = array(
+      '#type' => 'inline_template',
+      '#template' => $name,
+    );

I had to look three times, but yes due to how ChainedFileLoader works, you should indeed be able to embed not only inline templates, but also real templates ...

However is this still true? (We fixed some error messages and changed that part.)

Let's check that first ...

Log in or register to post comments

Comment #48

fabianx commented 20 April 2015 at 09:15

Status:

Needs review

» Reviewed & tested by the community

This gives just back the string maintenance-html.twig, but that is fine (as that is a valid twig template).

Therefore - even though that is probably not what berdir wanted originally - as it successfully tests the condition of a wrong template in the cache => RTBC.

If we wanted to we could change:

$template = 'Hello World - {{ 1+0 }} ';

or something like that ...

But leave the rest the same.

Log in or register to post comments

Comment #49

catch

he/him

English

commented 20 April 2015 at 11:06

Status:

Reviewed & tested by the community

» Needs review

+++ b/core/lib/Drupal/Core/Template/TwigEnvironment.php
@@ -127,10 +127,20 @@ public function loadTemplate($name, $index = NULL) {
+          eval('?' . '>' . $compiled_source);

I really don't like adding the eval() here even in a rare case. Is it worth looking at the switch to include Berdir mentioned in #16?

Log in or register to post comments

Comment #50

catch

he/him

English

commented 20 April 2015 at 11:42

Also if we have to have this protection for reads, what indication is there that the lock on write is doing anything?

Log in or register to post comments

Comment #51

catch

he/him

English

commented 20 April 2015 at 11:51

Status:

Needs review

» Needs work

```
+++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
@@ -53,14 +53,15 @@ public function load($name) {
+      if (!file_exists($htaccess_path) && file_put_contents($htaccess_path, static::htaccessLines(), $flags)) {
```
file_put_contents() doesn't care if you write to an already existing file, so what does the LOCK_EX actually help with here?

It'll stop another process from writing to the file at the same time, but then it's going to write once the lock is released - it won't stop the write from happening at all. So potentially we'll actually spend longer writing to the same file?

I'm not sure what the implication actually is, but that needs inline documentation (if it's actually useful here).

+++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
@@ -168,7 +170,7 @@ protected function createDirectory($directory, $mode = 0777, $is_backwards_recur
+      if ($status = @mkdir($directory)) {

Needs a comment.

+++ b/core/lib/Drupal/Core/Template/TwigEnvironment.php
@@ -127,10 +127,20 @@ public function loadTemplate($name, $index = NULL) {
+          // have failed load load the class. In that case, execute the code

"failed to load".

Log in or register to post comments

Comment #52

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 23:15

Status:

Needs work

» Needs review

Status	File	Size
new	race_conditions_in_the-2429659-52.patch	5.28 KB
new	interdiff-44-52.txt	2.43 KB

#51 has been addressed, but as I wrote out the comment for (1), I wondered if file_put_contents($path, $code, LOCK_EX); is really sufficient. It will grab an exclusive lock for the .htaccess (if needed), and the file itself, but two processes could still be in the function at the same time making the directory. I'm wondering if the ::save() method needs to be refactored to use flock() directly to make the entire method process exclusive?

Log in or register to post comments

Comment #53

24 April 2015 at 07:07

Status:

Needs review

» Needs work

The last submitted patch, 52: race_conditions_in_the-2429659-52.patch, failed testing.

Log in or register to post comments

Comment #54

catch

he/him

English

commented 24 April 2015 at 10:01

From everything I could fine, LOCK_EX is only a write lock, it's not a read lock at all, i.e. http://php.net/manual/en/function.flock.php

If it was a read lock, we wouldn't need the fallback to eval() later on in the patch.

If it's only a write lock, then I don't see how it helps the race condition - it might even make it worse since you'd have two, subsequent writes to the file, where it can be partially read from, rather than two concurrent ones which may finish faster.

Log in or register to post comments

Comment #55

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 24 April 2015 at 13:43

I am going to try to get confirmation about LOCK_EX; maybe I can raise @ircmaxell. My understanding from the comments is that it is an exclusive lock, which means only one process can have the file at any time; IOW it would block both readers and writers to the file.

Log in or register to post comments

Comment #56

catch

he/him

English

commented 24 April 2015 at 13:49

There's a stackoverflow thread here: http://stackoverflow.com/questions/4899737/should-lock-ex-on-both-read-w... which says it's only an advisory lock - so it'd work if all calling code used LOCK_EX - but file_get_contents() can't. Also it looks like ircmaxell answered that thread..

Log in or register to post comments

Comment #57

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 24 April 2015 at 14:07

Blerg. Do you think refactoring this to create a class AtomicFileStorage implements PhpStorageInterface would be worth it?

Log in or register to post comments

Comment #58

chx commented 29 April 2015 at 05:34

My solution would be to remove exists from PhpStorageInterface as it is pointless and rather pass an optional callable to PhpStorageInterface::load which it can run after including to determine whether the loading was successful. The caller has logic already to deal with load failures.

Log in or register to post comments

Comment #59

chx commented 29 April 2015 at 05:33

Status:

Needs work

» Needs review

Status	File	Size
new	2429659_59.patch	10.11 KB

This patch removes exists as it is unusable and adds loadClass instead.

Log in or register to post comments

Comment #60

dawehner

German

commented 29 April 2015 at 05:48

+++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
@@ -33,14 +33,30 @@ public function __construct(array $configuration) {
+    while (!($exists = class_exists($class_name, FALSE)) && $this->doLoadClass($name, TRUE) && !($exists = class_exists($class_name, FALSE)) && $retry--) {
+      usleep(mt_rand($wait, 2 * $wait));
+    }

It would be great to explain why we are waiting here ...

Log in or register to post comments

Comment #61

29 April 2015 at 05:51

Status:

Needs review

» Needs work

The last submitted patch, 59: 2429659_59.patch, failed testing.

Log in or register to post comments

Comment #62

chx commented 29 April 2015 at 06:17

Status:

Needs work

» Needs review

Status	File	Size
new	2429659_62.patch	11.76 KB

Sure.

Log in or register to post comments

Comment #63

dawehner

German

commented 29 April 2015 at 06:21

Thank you chx. I'm sorry that I haven't see the bit on the interface.

Log in or register to post comments

Comment #65

29 April 2015 at 06:47

Status:

Needs review

» Needs work

The last submitted patch, 62: 2429659_62.patch, failed testing.

Log in or register to post comments

Comment #66

chx commented 29 April 2015 at 07:00

Issue summary:	View changes
Status:	Needs work	» Needs review

Status	File	Size
new	2429659_66.patch	12.24 KB

5 files were hidden/shown/deleted

Status	File	Size
hidden	interdiff-40-44.txt	1.68 KB
hidden	race_conditions_in_the-2429659-52.patch	5.28 KB
hidden	interdiff-44-52.txt	2.43 KB
hidden	2429659_59.patch	10.11 KB
hidden	2429659_62.patch	11.76 KB

Last fail: @covers exists when exists no longer , well, exists.

Log in or register to post comments

Comment #67

berdir

German

Switzerland

commented 29 April 2015 at 10:14

This looks quite interesting.

Not sure if it is possible to restore my test, or test it in a different way somehow?

But I'll repeat my manual tests (mixing ab -c 10 with drush cr calls)

Log in or register to post comments

Comment #68

chx commented 29 April 2015 at 15:16

If it were fundamentally broken nothing would work since this loads DrupalKernel. If we badly want to test this then we could test the implementation which I never like by creating two files:

// This goes to a known fixed path.
class foobar();

// Put this in phpstorage.
if ($GLOBALS['test_retry']--) return;
include_once 'know/fixed/path/foobar.php';

and then test

$GLOBALS['test_retry'] = 5;
for ($i = 0; $i < 9 && !class_exists('foobar', FALSE); $i++) {
  include "x.php";
}
// assert $i is 6.

Log in or register to post comments

Comment #69

chx commented 30 April 2015 at 16:45

Status	File	Size
new	2429659_69.patch	13.88 KB

1 file was hidden/shown/deleted

Status	File	Size
hidden	2429659_66.patch	12.24 KB

We could use a success callable which opens the door for success checking including arrays, functions and whatever else. Even if we want to go back to the previous version, there are small improvements in this one compared to previous.

Log in or register to post comments

Comment #70

30 April 2015 at 17:07

Status:

Needs review

» Needs work

The last submitted patch, 69: 2429659_69.patch, failed testing.

Log in or register to post comments

Comment #71

chx commented 30 April 2015 at 17:12

Status:

Needs work

» Needs review

Status	File	Size
new	2429659_71.patch	13.88 KB

1 file was hidden/shown/deleted

Status	File	Size
hidden	2429659_69.patch	13.88 KB

Only 22439 fails? I need to work harder.

Log in or register to post comments

Comment #72

chx commented 30 April 2015 at 18:01

Issue summary:

View changes

Log in or register to post comments

Comment #73

dawehner

German

commented 4 May 2015 at 14:01

Regarding test coverage, I think non race condition problem will be visible immediately, given how often we render something using a template,
so that part is fine for me :)

+++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
@@ -33,18 +33,55 @@ public function __construct(array $configuration) {
+    if (!isset($success)) {
+      return $this->doIncludeOnce($name);
+    }
...
+    while ($this->doInclude($name) && !($return = $success()) && $retry--) {

It seems entirely not obvious why we have both doIncludeOnce and doInclude ... do you mind explaining this somewhere? I assume there is some intention behind it?

Log in or register to post comments

Comment #74

chx commented 4 May 2015 at 16:21

Status	File	Size
new	2429659_74.patch	15.15 KB
new	interdiff.txt	7.39 KB

Added code and doxygen to make load and loadClass safe by default but overridable. Much nicer and very visible of why doIncludeOnce is necessary:

    if (!isset($success)) {
      return $force ? $this->doInclude($name) : $this->doIncludeOnce($name);
    }

Log in or register to post comments

Comment #75

4 May 2015 at 16:50

Status:

Needs review

» Needs work

The last submitted patch, 74: 2429659_74.patch, failed testing.

Log in or register to post comments

Comment #76

8 May 2015 at 14:04

Berdir queued 74: 2429659_74.patch for re-testing.

Log in or register to post comments

Comment #77

8 May 2015 at 14:57

The last submitted patch, 74: 2429659_74.patch, failed testing.

Log in or register to post comments

Comment #78

berdir

German

Switzerland

commented 8 May 2015 at 15:41

The last patch doesn't solve the problem for me. When stress-testing this according to #17, then I still sometimes get the fatal error.

Note that on HEAD, it's more likely that you will run into serialization exceptions, see the issue in #17.

I also get watchdog entries like this:

1173055  08/Mai 16:59     php    Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 213 of .../core/lib/Drupal/Component/PhpStorage/FileStorage.php). 
 1173056  08/Mai 16:59     php    Warning: file_put_contents(.../sites/default/files/php/twig/1#dc#a6#d781522ea6a6ab18c0e6450635eab7eb38b52f46063d71ba1fff69b068ca/.htaccess): failed to open stream:

I'm not sure, but I think the patch also resulted in apache getting stuck completely :)

Going to try with the patch from #52 on production now, that seems to work fine locally.

Log in or register to post comments

Comment #79

chx commented 10 May 2015 at 04:35

Something is odd here: we write the file first to a random location (str_shuffle combined with microtime) and then do a rename. And yet somehow we manage to read a partially written file? How come?

Log in or register to post comments

Comment #80

geerlingguy commented 15 May 2015 at 17:35

So (mostly for my own benefit), I can reproduce the error:

2015/05/08 21:59:34 [error] 2192#0: *23 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class '__TwigTemplate_3df18936b0043763da1aa91635787628747bdf2bc4c34a6457aea4fc3f8fa5ea' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /user/login HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com"

On my Raspberry Pi Dramble stack (4 webservers using a shared Gluster mount).

Steps to reproduce:

Set up Drupal 8 HEAD (or beta10), have all servers' clocks in sync (in my case, 4 servers behind Nginx as a load balancer, with Nginx + PHP-fpm on each webserver, with all Drupal system defaults, e.g. Twig cache stored in normal files folder—files folder mounted via GlusterFS shared on all the webservers).
Load up the user login page (notice that it works fine on any of the servers).
Set the date on each of the webservers to be about a minute off from each other.
Reload the user login page, get WSOD, see above error.

Notes:

If I switch the Nginx load balancer to point at only one of the webservers (any one of them), everything works fine again.
If I reset all the clocks to be the exact same time (e.g. $ ansible all -m shell -a "date --set='10:25:55'" -s, everything works fine again.
Interestingly, every once in a while, the login page (or I've also tested the 'create new account' and 'reset password' pages) will actually load, randomly, from one of the four webservers behind the balancer. But if I run through a bunch of refreshes again, the same server WSODs with the above error again).

I'm at the DrupalCon LA sprints, and I will be testing the patch here and seeing if I can help get this resolved (or at least more thoroughly tested) today.

Log in or register to post comments

Comment #81

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 15 May 2015 at 18:08

@geerlingguy, are you NTP synced, or just jam syncing?

Log in or register to post comments

Comment #82

geerlingguy commented 15 May 2015 at 19:57

I tested the patch in #74, and while it doesn't solve the WSOD issue on the Pis with time drift, it doesn't cause any regression either, and seems like a good approach towards improving the situation. Still doing more testing, too, and I'll try to wrk with chx via IRC if I can some time this afternoon.

Log in or register to post comments

Comment #83

geerlingguy commented 15 May 2015 at 20:18

@mpdonadio - to test time drift, I set each of the four webservers' clocks explicitly to a value +/- 5 seconds from the actual time (using, e.g. sudo date --set='21:52:48'). Then, to bring them back in sync, I'm jam syncing from my Mac's current time to the stack.

Since the servers have no RTC, and are not connected to the Internet (and don't have NTP), they don't have any (simple) ability to keep their time in sync otherwise. I've thought about setting up one of the servers as an NTP server, with an RTC set up through GPIO, so the other servers could sync to the master clock, but haven't yet set that up (see: https://github.com/geerlingguy/raspberry-pi-dramble/issues/43).

Granted, this is (or at least should be) a pretty rare use case—having a multi-server setup with the twig cache on a shared filesystem with servers that have out-of-sync clocks. So we might not want to bikeshed on that. I'd just like to see if I can figure out exactly where the issue is coming from, since the twig file exists on all the servers, and it seems like a semi-random failure (some requests go through, some are 500s with the above error).

Log in or register to post comments

Comment #84

chx commented 15 May 2015 at 21:24

Yes, if you can figure this out, the ocntents of the file being read would be invaluable. See #79

Log in or register to post comments

Comment #85

geerlingguy commented 15 May 2015 at 21:44

Status	File	Size
new	twig-cache-file-4cc1fc2e1dc455a9a45bbe4ad00241000181573fffb63d719af245daa0036c36.php_.txt	1.09 KB

@chx - I've attached the file that's present at /mnt/gluster/files/php/twig/1#36#e9#218a72f912db7dd7fe18458418b9145115f045ba90b5100e5d5fe8b5716d/4cc1fc2e1dc455a9a45bbe4ad00241000181573fffb63d719af245daa0036c36.php (that file is available at that mount, which is mounted at the point /var/www/drupal/sites/default/files). The watchdog message that appears when that file is trying to be loaded is:

2015/05/09 21:58:35 [error] 4073#0: *3059 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class '__TwigTemplate_36e9218a72f912db7dd7fe18458418b9145115f045ba90b5100e5d5fe8b5716d' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /user/login HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com"

The file is definitely present on the system, it just won't be loaded for some reason, when the clock drifts from other servers handling active requests.

Log in or register to post comments

Comment #86

geerlingguy commented 15 May 2015 at 22:27

I'm also trying to reproduce the OP's bug on the Dramble stack, and can't quite do so—steps followed:

Start a cache clear on one of the webservers: drush cr
(While the cache clear is ongoing): Refresh page over and over. Page loads slowly, but loads after a few seconds with a 200.
(Also while the cache clear is ongoing): Run wrk -t4 -c24 -d10 http://pidramble.com/?nocache=true (this loads the page continuously). Page loads slowly, but all requests still go through, with 0 errors.

I tried the above with both beta10 and with HEAD, and I can't reproduce the original issue, except when I explicitly set the Pi clocks to drift. The patch in #74 seems like it's a good approach here for any edge cases, but I am finding it hard to reproduce the actual issue with the cache rebuild race condition.

Log in or register to post comments

Comment #87

geerlingguy commented 15 May 2015 at 23:00

I also added the debug code inside MTimeProtectedFileStorage.php:

  public function load($name) {
    file_put_contents('/tmp/log.txt', $name, FILE_APPEND);
    if (($filename = $this->checkFile($name)) !== FALSE) {
      // Inline parent::load() to avoid an expensive getFullPath() call.
      return (@include_once $filename) !== FALSE;
    }
    return FALSE;
  }

Contents of /tmp/log.txt:

service_container_prod.php

And with that call to file_put_contents() (and no other changes), I'm now getting a new error message in the logs:

2015/05/09 15:56:50 [error] 5110#0: *15 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class 'Drupal\Core\Template\Loader\FilesystemLoader' not found in /mnt/gluster/files/php/service_container/service_container_prod/38461dd565579de4d7e93cbc2d9ac60601804d7fbd95e49d31cbf6ae8400643f.php on line 6859" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /user/login?oijsdfasd HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com"

I will attach the loaded service container file as mentioned in the above error message in the next comment.

Log in or register to post comments

Comment #88

geerlingguy commented 15 May 2015 at 23:01

Status	File	Size
new	38461dd565579de4d7e93cbc2d9ac60601804d7fbd95e49d31cbf6ae8400643f.php_.txt	335.25 KB

(And a note: sorry for oversharing the data here... I just want to make sure I put as much debug info into this issue as possible since it's rare I get more than a few minutes to dig into this bug, and it's probably not a bug that's quick/easy to reproduce!)

Attached is the service container mentioned in the previous comment.

Log in or register to post comments

Comment #89

chx commented 16 May 2015 at 02:16

Status:

Needs work

» Needs review

Status	File	Size
new	2429659_89.patch	1.22 KB

This patch implements logic similar to DrupalKernel to make sure the error can't exist.

Log in or register to post comments

Comment #90

pbuyle commented 28 June 2015 at 17:11

I had the same issue on a fresh Drupal 8 install made with http://www.drupalvm.com/, the patch in #89 fixed it.

Log in or register to post comments

Comment #91

geerlingguy commented 29 June 2015 at 03:47

@pbuyle @chx - Strangely, I just started seeing Twig errors on initial site install (they cleared up after a second page refresh) today, reliably reproducible on a fresh Drupal 8 HEAD install on either Ubuntu 14.04 or 12.04 with PHP 5.5. I'm testing this patch now (using Drupal VM).

Log in or register to post comments

Comment #92

geerlingguy commented 29 June 2015 at 04:22

Status:

Needs review

» Reviewed & tested by the community

To reproduce (if you don't want to set up a cluster of servers with a Gluster filesystem):

Build an instance of Drupal VM with the default settings (will install D8 HEAD)
Load the default URL for the D8 site http://drupaltest.dev/

Observe a twig error:

No front page content has been created yet.

Fatal error: Class '__TwigTemplate_94bd2225e3fafe16fc2afc51cd67340aa54d733f8325c53771de007781bb0398' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141

Refresh the page, and observe that the site loads correctly (with the default standard profile home page layout).

After applying the patch in #89, and completely rebuilding the VM from scratch (or reinstalling the site, the first page load is fine.

I was able to reproduce 3/3 tries, and used both Ubuntu 12.04 and 14.04, with PHP 5.5 and 5.6 with each of the OSes. The patch was also successful in not resulting in an error on first page load 3/3 tries.

Thumbs up from me; this is an annoying bugger, and I think I've now spent at least 12 hours of my life on it! Thanks so much for the patch, @chx :)

Log in or register to post comments

Comment #93

fabianx commented 29 June 2015 at 08:22

RTBC + 1, even if chx' approach in the other issue is looking promising, this should help to at least avoid the race condition for now.

Can we open a (postponed) follow-up based on #2513326: Performance: create a PHP storage backend directly backed by cache to enable that for twig templates, too.

Log in or register to post comments

Comment #94

alexpott

he/they

English

🇪🇺🌍

commented 30 June 2015 at 12:08

I hesitate to ask but is this in anyway testable?

Log in or register to post comments

Comment #95

3 July 2015 at 06:39

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 89: 2429659_89.patch, failed testing.

Log in or register to post comments

Comment #96

geerlingguy commented 3 July 2015 at 20:46

Status:

Needs work

» Needs review

@alexpott - I can reproduce 100% of the time in two different scenarios (one that's simple enough for anyone to reproduce, anywhere, the other requiring a set of servers that have their filesystems shared via GlusterFS... so the first way is probably the simplest for easy reproduction):

Follow the Drupal VM Quick Start Guide to install an instance of Drupal VM.
After about 5-10 minutes (depending on connection speed), provisioning will be complete, with D8 HEAD installed.
Fire up your web browser and load http://drupalvm.dev/. Observe attached image and logged PHP error in /var/log/apache2/error.log (pasted below).

[Fri Jul 03 20:36:26.943786 2015] [:error] [pid 23023] [client 192.168.88.1:63018] Uncaught PHP Exception Drupal\\Core\\Database\\DatabaseExceptionWrapper: "SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction: INSERT INTO {cache_config} (cid, expire, created, tags, checksum, data, serialized) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array\n(\n    [:db_insert_placeholder_0] => bartik.settings\n    [:db_insert_placeholder_1] => -1\n    [:db_insert_placeholder_2] => 1435955786.931\n    [:db_insert_placeholder_3] => \n    [:db_insert_placeholder_4] => 0\n    [:db_insert_placeholder_5] => b:0;\n    [:db_insert_placeholder_6] => 1\n)\n" at /var/www/drupal/core/lib/Drupal/Core/Database/Connection.php line 609

Side note: Judging by the error message above, this could be related to #2336627: Deadlock on cache_config (DatabaseBackend::setMultiple()) as well. Regardless, it seems the patch in #89 fixes at least this particular issue.

To reproduce, vagrant destroy -f && vagrant up, then load the page again. If I add the patch to the drush make example that's included with Drupal VM, then provision a new instance of the VM, the error is 100% reliably fixed! I also tested it on the Raspberry Pi #Dramble cluster, and while it's harder to reliably reproduce in that scenario, with the patch applied I never saw the error, but without it, I did see the error from time to time when the clocks were more than a few seconds out of sync (since they're not synced via NTP, they drift after a few days).

Here's the exact Drupal make file I was using with Drupal VM (could work with other setups, just make sure PHP/webserver has been cleanly restarted so nothing is in memory already prior to first page load, and make sure you don't install Drupal via web UI, since that may pre-build some of the twig caches):

---
api: 2
core: "8.x"

projects:

  drupal:
    type: "core"
    download:
      branch: "8.0.x"
    patch:
      - "https://www.drupal.org/files/issues/2429659_89.patch"

  devel: "1.x-dev"

Nudging testbot again...

Log in or register to post comments

Comment #97

3 July 2015 at 20:47

geerlingguy queued 89: 2429659_89.patch for re-testing.

Log in or register to post comments

Comment #98

geerlingguy commented 3 July 2015 at 20:49

Status	File	Size
new	drupal-error-first-page-load.jpg	7.45 KB

Forgot to attach the first page load error image (this is what the first visitor to a new site sees):

Drupal - error on first page load

Log in or register to post comments

Comment #99

fabianx commented 4 July 2015 at 14:28

Status:

Needs review

» Reviewed & tested by the community

I don't think its possible to test the race condition, but the manual testing should be good enough.

Log in or register to post comments

Comment #100

Anonymous (not verified) commented 4 July 2015 at 14:41

a long time ago, in a galaxy far, far away... #850782: allow testing lock code via async http calls

well, 2010. who can remember wtf was happening back then these days?

Log in or register to post comments

Comment #101

lauriii

he/him

Finnish

Finland

commented 4 July 2015 at 14:52

Issue summary:

View changes

Added beta evaluation

Log in or register to post comments

Comment #102

berdir

German

Switzerland

commented 4 July 2015 at 15:00

Hm. Actually, that's more or less the same code as I had from the beginning. And that had tests. See for example in #52 which I think is the last place they existed.

@beejebus: I don't see how that would help here. This is not predictable, it happens when you have many requests that are trying to build the same twig templates, possibly on a slow file system. That's very different from a predictable lock where you know the second request will hit the lock.

Log in or register to post comments

Comment #103

Anonymous (not verified) commented 4 July 2015 at 15:13

@berdir - the only predictable thing here is that we'll never catch these bugs by sending one request at a time.

we have many sections of code that make assertions about safety when running across multiple concurrent requests, and test literally 0% of them that way. hard to figure out why 'it might not catch everything with our test suite' is used to defend 'we will never catch any of this class of bugs with our current test suite'.

Log in or register to post comments

Comment #104

fabianx commented 4 July 2015 at 15:35

Status:

Reviewed & tested by the community

» Needs work

Yes, right, sorry berdir I totally forgot that #3 had a test, which at least tests the error condition part of it.

Lets merge the test of #3 into this patch.

Log in or register to post comments

Comment #105

berdir

German

Switzerland

commented 4 July 2015 at 17:25

The test in #52 might be easier to merge, I had to update it at some point.

Log in or register to post comments

Comment #106

geerlingguy commented 4 July 2015 at 22:25

Just to give a little more of a push for this; I fired up a fresh instance of the entire Dramble stack (I re-imaged all the 6 Pi microSD cards, then reprovisioned everything, and installed Drupal 8 HEAD from this morning), and now it seems about every 5th or 6th page load (when logged in) results in a WSOD on first load, then if I refresh, the page loads fine.

I currently have the Nginx balancer set to go round-robin, with no IP pinning, so each request goes to a new web node behind the balancer. If I turn IP pinning on, then requests work correctly. But in the real world, most requests would be distributed amongst the backend servers, resulting in users hitting the WSOD pages on initial page load.

The error in the Nginx log on the backend server is as follows:

2015/07/04 22:22:59 [error] 2290#0: *52 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class '__TwigTemplate_813b6bb6bcccc2ae6f6de69fbc5bf6c3ebe61f1c97eb2e81de9e4d330bb17c61' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /admin/reports/dblog?page=2 HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com", referrer: "http://pidramble.com/admin/reports/dblog?page=1"

Log in or register to post comments

Comment #107

geerlingguy commented 8 July 2015 at 18:58

I definitely still agree with 'Major' priority... might even argue for Critical, since now I can reproduce this error in three different environments—running Debian 7.8 with PHP 7.0.0 beta1 compiled, I get errors like the following on every 3-5 page loads:

Error: Class '__TwigTemplate_408ac94a722c70a8d9e243d9eebb745aeff646c4fc11d22dd91885dbb5553bce' not found in Drupal\Core\Template\TwigEnvironment->loadTemplate() (line 141 of core/lib/Drupal/Core/Template/TwigEnvironment.php).

This is on a Raspberry Pi with slow disk writes (since Drupal is running on a microSD card)... maybe the slow disk access is causing the template to not be found? The compiled directory/file is most definitely present at sites/default/files/php/twig/1#40#8a#c94a722c70a8d9e243d9eebb745aeff646c4fc11d22dd91885dbb5553bce, but the .php file is empty...

Though the Pi is an exceptional use case, I have definitely seen weird/slow disk writes in many cloud environments, and I can imagine more people will run into these issues as well.

Update: It seems that it may have been a weird case of FPM having some split-brained behavior. A full restart of the server seems to have cleared up the errors this time; maybe different child processes had different/broken cached file directory information or something.

Log in or register to post comments

Comment #108

star-szr

he/him

English

commented 9 July 2015 at 02:51

Status:

Needs work

» Needs review

Status	File	Size
new	race_conditions_in_the-2429659-108-testonly.patch	1.25 KB
new	race_conditions_in_the-2429659-108.patch	2.46 KB

I can't get the test-only to fail, but here's something.

Log in or register to post comments

Comment #109

berdir

German

Switzerland

commented 9 July 2015 at 08:21

Would be very interesting to see if #2527478: Resolve infinite stampede in mtime protected PHP storage improves the frequency of this happening.

I still think we should get this committed to avoid the fatals *if* it happens. But I hope that the other issue means it happens much less frequently.

Log in or register to post comments

Comment #110

geerlingguy commented 9 July 2015 at 19:20

Status:

Needs review

» Reviewed & tested by the community

Still RTBC in my book—and that test probably won't ever fail unless you can make testbots do multiple concurrent requests (#2527478: Resolve infinite stampede in mtime protected PHP storage should help). Let's stop the bleeding...

Log in or register to post comments

Comment #111

fabianx commented 11 July 2015 at 13:56

Status:

Reviewed & tested by the community

» Needs review

+++ b/core/modules/system/src/Tests/Theme/TwigEnvironmentTest.php
@@ -46,6 +47,19 @@ public function testInlineTemplate() {
+    $cache_file = $environment->getCacheFilename($name);
...
+      '#template' => $name,

Don't we add something by now to inline template names in the InlineLoader?

I think if we want to fool the cache, we need to do the same {# #} comment adding here ...

Log in or register to post comments

Comment #112

star-szr

he/him

English

commented 11 July 2015 at 17:40

Yup that is probably needed to calculate the cache filename properly: https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21Template%...

Doing a quick test of this:
$cache_file = $environment->getCacheFilename('{# inline_template_start #}' . $name);

Seemed to result in an infinite loop for me. Not sure why…

Edit: And also combined with doing an actual inline template like @Fabianx suggested earlier: Hello World - {{ 1+0 }}.

Log in or register to post comments

Comment #113

fabianx commented 11 July 2015 at 20:44

Hmm, maybe lets use a real template then instead of an inline template or call twig_render_template directly?

Log in or register to post comments

Comment #114

geerlingguy commented 27 July 2015 at 21:51

Status	File	Size
new	race_conditions_in_the-2429659-114.patch	2.48 KB

Reroll of #108; test file changed.

Note that this patch is still necessary to get things working with Drupal 8 on my little Raspberry Pi cluster, using GlusterFS for a shared files directory mount. I seem to have encountered another race condition over in #2540912: Installation fails with files directory on glusterfs: "Warning: mkdir(): File exists", but it looks like that one is resolved by #2497243-188: Replace Symfony container with a Drupal one, stored in cache.

Log in or register to post comments

Comment #115

catch

he/him

English

commented 28 July 2015 at 13:10

Priority:

Major

» Critical

I'd been assuming this was self-healing, but had missed berdir's comment in https://www.drupal.org/node/2429659#comment-9845943 where he pointed out the site was down permanently until they manually intervened. Since berdir's site is more or less the only 8.x site with any traffic at the moment, we should assume that people will run into this more often.

I'm bumping this to critical. However also this is pretty much there except for test coverage. Given it's very hard to test, and we have months of manual testing on patched installs, let's add the regression test in a follow-up issue.

Log in or register to post comments

Comment #116

fabianx commented 28 July 2015 at 13:16

Status:

Needs review

» Reviewed & tested by the community

RTBC then, we can fix the test in a follow-up. (and should create one for it)

Log in or register to post comments

Comment #117

lauriii

he/him

Finnish

Finland

commented 28 July 2015 at 13:16

I've seen this problem too so agree that this is quite critical

Log in or register to post comments

Comment #118

catch

he/him

English

commented 28 July 2015 at 13:21

Status:

Reviewed & tested by the community

» Needs work

Should we not remove the test and put it back in the follow-up?

Or is it worth having the test there to avoid a further regression which might actually make the test fail? If so I could go for that, but I think it needs a @todo in the test to point out that we've not been able to get it to fail.

Log in or register to post comments

Comment #119

geerlingguy commented 28 July 2015 at 16:30

Status:

Needs work

» Needs review

Status	File	Size
new	race_conditions_in_the-2429659-119.patch	1.22 KB

Patch with bugfix-only is attached. I'll split the test into a separate patch for a follow-up issue, to be opened momentarily...

Log in or register to post comments

Comment #120

geerlingguy commented 28 July 2015 at 16:35

Issue summary:

View changes

Updated IS, since it's been a while...

Log in or register to post comments

Comment #121

geerlingguy commented 28 July 2015 at 16:41

Added related issue #2541440: Add tests for race conditions in Twig template cache for the tests. I'll supply the first patch, with the broken out test, over there.

Can we get an RTBC on the above reroll without the tests?

Log in or register to post comments

Comment #122

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 28 July 2015 at 16:46

I kinda think we should add a @todo linking to the issue mentioning, and also mention that there is no coverage for that code.

Log in or register to post comments

Comment #123

dawehner

German

commented 28 July 2015 at 16:56

I kinda think we should add a @todo linking to the issue mentioning, and also mention that there is no coverage for that code.

+1 for that idea. Yeah race conditions can be hard to test, this is for sur.

Thank you @geerlingguy for updating the issue summary.

Log in or register to post comments

Comment #124

damien tournoud commented 28 July 2015 at 17:23

From the look of this, the main bug is actually in \Drupal\Component\PhpStorage\FileStorage, this has more race conditions than lines of code.

The main problem is the implementation of PhpStorageInterface::save(), which is basically a wrapper around file_put_contents() and as a consequence has two major problems:

First, it is not protected against partial writes. By default file_put_contents() doesn't perform any kind of locking, which means that readers are going to see partial content. Because the readers are using @include, all the errors are ignored, and partial content is most likely going to appear as a cache miss (but really, who knows? we could also totally load a partial template, or execute some arbitrary code);
Second, it doesn't call fsync() on the file, which means that the write is not guaranteed to be immediately committed (and as a consequence visible from the other nodes in a distributed filesystem).

We need to implement a standard "write to a temporary file, flush and rename" protocol here.

We could also use a fixed name for the temporary file, and lock it with LOCK_EX, which might or might not help with stampedes. (Nowadays, most distributed filesystems have locking primitives, but it might be only local to the node.)

Log in or register to post comments

Comment #125

berdir

German

Switzerland

commented 28 July 2015 at 18:03

@znerol tried to improve that in #2527478: Resolve infinite stampede in mtime protected PHP storage I think. A review from you would be great there.

There were also various attempts by @chx and @mpdonadio earlier in this issue to solve it in the php storage but none really worked so this eventually circled back to this workaround, which is also very close to what I originally wrote in my early patches.

Log in or register to post comments

Comment #126

amateescu commented 28 July 2015 at 18:05

Edited: cross-post with the first part of #125.

Log in or register to post comments

Comment #127

geerlingguy commented 28 July 2015 at 18:22

So, should I stick a @todo mentioning the other issue in the loadTemplate() method, or somewhere else?

Log in or register to post comments

Comment #128

fabianx commented 28 July 2015 at 18:39

#127: Lets add the @todo to the test that the current test is non-functional and will be fixed in that other issue.

Log in or register to post comments

Comment #129

geerlingguy commented 28 July 2015 at 18:53

@Fabianx - latest patch (#119) removes the test entirely as per #118 and above comments. Should we add that test back in for now, and leave the @todo with that test, or keep the test out for now, and put the todo with the changed function?

Log in or register to post comments

Comment #130

fabianx commented 28 July 2015 at 19:19

Status:

Needs review

» Reviewed & tested by the community

Oh, I missed #119. The code does not need a comment as the test is major.

Log in or register to post comments

Comment #131

alexpott

he/they

English

🇪🇺🌍

commented 29 July 2015 at 10:37

Status:

Reviewed & tested by the community

» Fixed

The patch attached improves the situation so I'm going ahead before beta 13 since this is not disruptive at all. I agree with @Damien Tournoud and @znerol - the real issue will be solved in #2527478: Resolve infinite stampede in mtime protected PHP storage which looks promising.

Committed 7595aa7 and pushed to 8.0.x. Thanks!

Log in or register to post comments

Comment #132

29 July 2015 at 10:37

alexpott committed 7595aa7 on 8.0.x

Issue #2429659 by chx, Berdir, mpdonadio, geerlingguy, Cottser,...

Log in or register to post comments

Comment #133

12 August 2015 at 10:44

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #134

star-szr

he/him

English

commented 13 September 2015 at 16:08

I'm working on #2555243: Upgrade path / plan to Twig 2.x aka 2.0 which involves porting some of our custom code in TwigEnvironment into a new Twig cache class.

I'm wondering if #2527478: Resolve infinite stampede in mtime protected PHP storage maybe makes this change no longer necessary. I'm a bit stuck because I don't see a way right now to incorporate this "just in case" handling into the cache class I'm writing so I'd like to ask for some testing to see if this issue can be reproduced with the current code.

If @geerlingguy or @Berdir or anyone else who was able to reproduce this can test the following patch that would be extremely helpful:

#2426563-93: Ignore: Patch testing issue
https://www.drupal.org/files/issues/upgrade_twig_test_1.x-cache4.patch

I tried it myself using @geerlingguy's Drupal VM steps in #96 on Drupal 8 HEAD with the patch and couldn't reproduce any errors. However I also tested on Drupal 8 commit b158c354b15abee86b58c2342c985456392d44a8 (based on the timestamp of #96) and was still not able to reproduce the error so I don't consider my testing to be definitive. I also tried with an older version of Drupal VM (3d5b33be5c974f4e26c96947d4f82239231c5301) just in case but still no luck breaking it.

Log in or register to post comments

Comment #135

star-szr

he/him

English

commented 13 September 2015 at 16:31

Actually I just got it to break using Drupal 8 b158c354b15abee86b58c2342c985456392d44a8 and the latest Drupal VM. Yay! I'll try with the patch on D8 HEAD a few more times.

Log in or register to post comments

Comment #136

star-szr

he/him

English

commented 14 September 2015 at 15:03

Development has now moved to a proper issue of its own (#2568171: Upgrade to Twig 1.22 and implement our own cache class) and we have some plans now for re-implementing this fallback in our cache class. Still would be good to have it tested manually once we have some code because we don't have automated tests for it.

Log in or register to post comments

Race conditions in the twig template cache

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Beta phase evaluation

Comments