Problem/Motivation

When you are clearing caches while a site is being visited, then you sometimes get fatal errors like this:

Class '__TwigTemplate_e420be0e76e7b05d16c9c1bd35cee0103be31106e96972ab01f86f0a60b76811' not found

Happens in core/lib/Drupal/Core/Template/TwigEnvironment.php:141

I'm not quite sure why this happens. Was writing a few theories, but every time I look into them closer, I find out something new.

So it seems to be able to load the file, but then the class isn't found. Maybe the file is still being written? but then the file system should protect us from that?

It sometimes only happens once, the second request is then fine. But I have a few of those fatal errors in our logs. This error is also reproducible in some cluster/shared filesystem circumstances, like when running drupal with a shared files directory with slow disk access, or after initial install or cache clears in certain circumstances.

Proposed resolution

  1. Remove the exists method. It's broken.
  2. Augment load with a success callback. This way it can check whether the loading resulted in what the caller asked for and retry if not. Possible use cases: loading classes, functions, variables.
  3. Add a loadClass method wrapping load as this is the most common case.

Remaining tasks

RTBC the patch (which has a lot of manual testing at this point, but is decidedly tricky to write automated tests for).

User interface changes

N/A

API changes

N/A

Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue category Bug because fatal errors can cause sites to be unreachable until manual cache clear is performed
Issue priority Critical because its easy to replicate error situation
Prioritized changes Bug fixes are prioritized changes in the beta phase/td>
Disruption The change is not disruptive
CommentFileSizeAuthor
#119 race_conditions_in_the-2429659-119.patch1.22 KBgeerlingguy
#114 race_conditions_in_the-2429659-114.patch2.48 KBgeerlingguy
#108 race_conditions_in_the-2429659-108.patch2.46 KBstar-szr
#108 race_conditions_in_the-2429659-108-testonly.patch1.25 KBstar-szr
#98 drupal-error-first-page-load.jpg7.45 KBgeerlingguy
#89 2429659_89.patch1.22 KBchx
#88 38461dd565579de4d7e93cbc2d9ac60601804d7fbd95e49d31cbf6ae8400643f.php_.txt335.25 KBgeerlingguy
#85 twig-cache-file-4cc1fc2e1dc455a9a45bbe4ad00241000181573fffb63d719af245daa0036c36.php_.txt1.09 KBgeerlingguy
#74 interdiff.txt7.39 KBchx
#74 2429659_74.patch15.15 KBchx
#71 2429659_71.patch13.88 KBchx
#69 2429659_69.patch13.88 KBchx
#66 2429659_66.patch12.24 KBchx
#62 2429659_62.patch11.76 KBchx
#59 2429659_59.patch10.11 KBchx
#52 interdiff-44-52.txt2.43 KBmpdonadio
#52 race_conditions_in_the-2429659-52.patch5.28 KBmpdonadio
#44 interdiff-40-44.txt1.68 KBmpdonadio
#44 race_conditions_in_the-2429659-44.patch4.57 KBmpdonadio
#40 race_conditions_in_the-2429659-40.patch4.27 KBBerdir
#35 race_conditions_in_the-2429659-35.patch4.26 KBrteijeiro
#35 interdiff.txt930 bytesrteijeiro
#30 race_conditions_in_the-2429659-30-interdiff.txt2.82 KBBerdir
#30 race_conditions_in_the-2429659-30.patch4.32 KBBerdir
#24 race_conditions_in_the-2429659-24.patch1.33 KBmpdonadio
#22 race_conditions_in_the-2429659-22-interdiff.txt881 bytesBerdir
#22 race_conditions_in_the-2429659-22.patch2.21 KBBerdir
#20 race_conditions_in_the-2429659-20.patch1.35 KBmpdonadio
#3 evil-test-2429659-1.patch2.05 KBBerdir
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Berdir’s picture

Note: This is a on a shared file system, so that could make this more likely, but I'm quite sure that I've seen this locally as well

dawehner’s picture

For me this happened in some scenarios when the stream wrappers haven't been setup properly.

Berdir’s picture

Status: Active » Needs review
FileSize
2.05 KB

This might be the most evil test I've ever written.

Unfortunately, the fix I thought of doesn't work because the file storage uses include_once, so it won't load the same file again and still fails. Maybe is also related to the actual issue, I don't know what is really going on.

mpdonadio’s picture

I must say, that I was looking at the TestBot queue to see how big it was and couldn't help but read this issue...

Status: Needs review » Needs work

The last submitted patch, 3: evil-test-2429659-1.patch, failed testing.

star-szr’s picture

Issue tags: +Twig
Berdir’s picture

Talked about this with @Cottser, looks like the only useful thing that we can do here is throw an exception instead of having a fatal error. That still means the user gets an error page, but at least it's not an WSOD.

Berdir’s picture

#2296009: Use APC Classloader by default (when available) just had a random fail with this, apparently caused by using APC. I think this means that those errors happen because of the apc classloader, although I don't know why yet.

Fabianx’s picture

Twig uses a double caching logic to avoid having to check the filesystem again and again for things that are in cache.

Therefore a race condition during cache clear could be expected.

I think the class_exists($cls, FALSE) is the proper fix.

Berdir’s picture

Where exactly? And what should we do if it returns FALSE?

catch’s picture

HEAD just had a random fail with this as well.

Drupal core - 8.0.x fail: https://qa.drupal.org/8.0.x-status

Overall Summary
==============================================

FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 87,171 pass(es), 1 fail(s), and 0 exception(s).

Individual Environment Summaries
==============================================

-- [[SimpleTest]]: [PHP 5.4 MySQL] --

* Drupal\update\Tests\UpdateCoreTest (852 pass(es), 1 fail(s), and 0 exception(s))
   - [fail] [PHP Fatal error] " Class '__TwigTemplate_834b8992aafa7a111df3eb478be6ddcbc43ccfb5d5020e08270ddf79a80c5e7c' not found" in TwigEnvironment.php on line 141 of Unknown.

xjm’s picture

Issue tags: +Random test failure
Fabianx’s picture

The caching logic happens in isFresh() and likely that test for a cache miss is wrong, but this should not affect core as its only happening when auto_reload => TRUE, which we don't do for performance reasons.

That means somehow either ->save() to storage fails or ->load() loads something else.

That is tricky for sure, hardening we should do anyway, so as proposed already:

-        if (!$this->storage()->load($cache_filename)) {
+        if (!$this->storage()->load($cache_filename) || !lclass_exists($cls, FALSE)) {

at the minimum.

Then we could again do a check afterwards and throw an Exception, though that is really fatal if a newly dumped file cannot be loaded as class ...

Berdir’s picture

Yeah, but the class exists check won't help, because the php storage won't ever load the file again. We would have to combine it with the non-cache version of using eval?

Fabianx’s picture

Why not?

If the class does not exist and we store and retrieve it then, it should exist afterwards or not?

Berdir’s picture

Because FileStorage::load() does an include_once.

If the first load() call loads the file but that for some reason does *not* contain the class (only then we get into the || !class_exists() case), then calling load() again will *not* try to include that file again.

I quickly tried to switch to an include there, but Drupal dieded hard and I didn't try to fix it.

Berdir’s picture

While doing stress tests on #2336627-16: Deadlock on cache_config (DatabaseBackend::setMultiple()), I noticed this happening again. so this might be more common/easier to reproduce than I thought.

The only additional information I have right now is that I also got a few corresponding watchdog messages:

 ID  Date          Type      Severity  Message                                                                                                                                                             
 41  09/Mar 12:50  php                 Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of                                                            
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).                                                                                     
 40  09/Mar 12:50  php                 Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of                                                            
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).                                                                                     
 39  09/Mar 12:50  php                 Warning: file_put_contents(sites/default/files/php/twig/1#06#1e#9975f4ab1194ed306eb5bad0ce7095521d1c9621e71adf689e5016f60a8f/.htaccess):   
                                       failed to open stream: Permission denied in Drupal\Component\PhpStorage\FileStorage->ensureDirectory() (line 136 of                                                 
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php).                                                                                     
 38  09/Mar 12:50  php                 Warning: file_put_contents(sites/default/files/php/twig/1#61#f8#2e0aac09f5d1126d7ded09d615cd03eb2a460783f2d84db6dec1de74ecfa/.htaccess):   
                                       failed to open stream: No such file or directory in Drupal\Component\PhpStorage\FileStorage->ensureDirectory() (line 136 of                                         
                                       core/lib/Drupal/Component/PhpStorage/FileStorage.php). 

So definitely looks like some sort of conflict in the phpstorage, where multiple processes try to write the file?

mpdonadio’s picture

Would a `flock($handle, LOCK_EX);` be appropriate here? Since the lock is needed to write the cache and not read it, it won't be a huge performance hit?

Berdir’s picture

I was thinking something like that yes, but it would probably have to be abstracted through PhpStorage? So we could have a lockForWrite() method that we can call immediately after failing to include the fail, or possibly even automatically in load (maybe with an option to disable it), that is removed again when the file was written.

mpdonadio’s picture

Status: Needs work » Needs review
FileSize
1.35 KB

Hmm. PhpStorage doesn't use handles, so flock() can't be used directly unless we also add a lockfile to the Twig cache.

Maybe just this is all that needed?

Berdir’s picture

I tried that, it gets rid of the watchdog messages, but I still get some class not found fatal errors.

Berdir’s picture

Ok, this fixes the fatal errors for me, although I still get some warnings.

 60  12/Mar 01:01     php     Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of /home/berdir/Projekte/d8/core/lib/Drupal/Component/PhpStorage/FileStorage.php).     
 61  12/Mar 01:01     php     Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of /home/berdir/Projekte/d8/core/lib/Drupal/Component/PhpStorage/FileStorage.php).     
 62  12/Mar 01:02     php     Warning: file_put_contents(/home/berdir/Projekte/d8/sites/default/files/php/twig/1#bc#f7#30392e18d13056f9a940f0627ea8e84e3406b2877b6391fdc398a09c6201/.htaccess): failed to open stream: No  
 63  12/Mar 01:02     php     Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 171 of /home/berdir/Projekte/d8/core/lib/Drupal/Component/PhpStorage/FileStorage.php).     
mpdonadio’s picture

We could @mkdir to make the warnings go away, but it kinda sounds like there are still two processes in PhpStorage::save() at the same time.

Do we need to add an optional lock to PhpStorage or make a LockablePhpStorage, either of which whose constructor accepts a LockBackendInterface, and then protect save with it? Or a lock in TwigEnvironment?

mpdonadio’s picture

Quick and dirty locking in FileStorage::save() to see what implodes. No other changes from HEAD.

mpdonadio’s picture

OK, some additional testing with increased concurrency in ab reveals that Bad Things happen with a lock there, and I still got the Twig errors...

Status: Needs review » Needs work

The last submitted patch, 24: race_conditions_in_the-2429659-24.patch, failed testing.

geerlingguy’s picture

Cottser sent me this way from #2234229: PHP Fatal Error: Class __TwigTemplate not found in TwigEnvironment.php... I had the same strange issue on an infrastructure with a replicated file system:

In setting up the Dramble cluster of D8 servers, I was running into the same issue. A while back, it had to do with file permissions issues in the shared GlusterFS mount, but I had resolved that error. After a few hours' debugging, I found that this WSOD and the exact same error can be caused by time drift on multiple servers with a shared mount.

It was the strangest thing, too. If I pointed the load balancer at just one of the servers, everything worked perfectly. If I pointed it at more than one server, then after accessing the second server, the Twig error would start popping up in the logs for all servers, and they'd all WSOD. Then if I pointed the balancer at one server again, that server would load Drupal just fine.

In my case, since I was using Raspberry Pis (without built-in clocks) on a local network, the time from system boot had about +/- 3 seconds of drift, and apparently that was enough to cause this strange error. I even tried having APC on with stat=0, APC on with stat=1, and APC off, and none of those changes made a difference.

Fabianx’s picture

#27: Your best bet is to put the twig template cache on /tmp, by setting up the phpstorage configuration in settings.php.

The template cache is also DB backed, so a global cache clear will clear all templates as the mtime is stored in the database.

Berdir’s picture

@Fabianx: That's not correct?

Yes there's the auto refresh feature, but you definitely don't want to have that enabled on production, because that does a cache_get() for each template file.

Berdir’s picture

Status: Needs work » Needs review
FileSize
4.32 KB
2.82 KB

I think there's not much we can do about the mkdir, see https://bugs.php.net/bug.php?id=35326 and #392100: Warning: mkdir(): File exists in imagecache_build_derivative() for somewhat similar issues.

Here's a combined patch with #22, the test from #3 (had to make a small change because the filename is not validated) and the @ for mkdir().

I think it would be great to get it in like that, which should take care of most issues.

geerlingguy’s picture

Your best bet is to put the twig template cache on /tmp, by setting up the phpstorage configuration in settings.php.

But doesn't the twig cache need to be on a shared filesystem if you have multiple webservers? Otherwise, each server would have to regenerate its own twig cache on each request... at least that's how it seems this would work. I could turn off the Twig cache entirely, but that would result in abysmal performance on the Raspberry Pis.

Berdir’s picture

They would have to, but that's not the problem. The problem is distributing invalidations/cache clears. Just like ChainedFast cache backends, we'd need a shared flag/storage that they can check or you need a custom process that on cache clear, triggers a delete of those files on all servers.

Wim Leers’s picture

Issue tags: +D8 cacheability, +Performance

Sounds like we need these additional tags.

Fabianx’s picture

#29: Right, thats not active, I was mistaken there.

Still putting it on /tmp/drupal-cache/ is good.

Need to clear /tmp/drupal-cache then on code deploys, but with code deploy on multiple webservers being usually by jenkins, that is not a big deal either ...

rteijeiro’s picture

Fixed a couple of nitpicks.

stefan.r’s picture

I just had this issue as well on a vagrant box, this patch fixed it :)

Berdir’s picture

Anything left to do here? It has tests, I can't think of a different fix, it should also help to fix random testbot fails that sometimes happen.

Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

Nope, lets get this in.

This is RTBC.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 35: race_conditions_in_the-2429659-35.patch, failed testing.

Berdir’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
4.27 KB

Simple conflict in the use statements in the test after the String/SafeMarkup change.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 40: race_conditions_in_the-2429659-40.patch, failed testing.

Berdir’s picture

Uhm, did something change with the file storage tests that this now fails?

star-szr’s picture

Indeed, git bisect points to #2453399: Use VFS for FileStorage tests.

mpdonadio’s picture

Maybe this? Used `parse_url` so the tests wouldn't have to rely on the file_system service just to check the scheme.

mpdonadio’s picture

Status: Needs work » Needs review

Grrr.

Berdir’s picture

Tricky, but seems to be working. Anyone wants to RTBC this? This is causing a lot of fatal errors, in one case, it even caused a permanent error that we had to fix with a manual cache clear. This should basically be self-healing...

Fabianx’s picture

+++ b/core/modules/system/src/Tests/Theme/TwigEnvironmentTest.php
@@ -44,6 +45,19 @@ public function testInlineTemplate() {
+    $element['test'] = array(
+      '#type' => 'inline_template',
+      '#template' => $name,
+    );

I had to look three times, but yes due to how ChainedFileLoader works, you should indeed be able to embed not only inline templates, but also real templates ...

However is this still true? (We fixed some error messages and changed that part.)

Let's check that first ...

Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

This gives just back the string maintenance-html.twig, but that is fine (as that is a valid twig template).

Therefore - even though that is probably not what berdir wanted originally - as it successfully tests the condition of a wrong template in the cache => RTBC.

If we wanted to we could change:

$template = 'Hello World - {{ 1+0 }} ';

or something like that ...

But leave the rest the same.

catch’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/core/lib/Drupal/Core/Template/TwigEnvironment.php
@@ -127,10 +127,20 @@ public function loadTemplate($name, $index = NULL) {
+          eval('?' . '>' . $compiled_source);

I really don't like adding the eval() here even in a rare case. Is it worth looking at the switch to include Berdir mentioned in #16?

catch’s picture

Also if we have to have this protection for reads, what indication is there that the lock on write is doing anything?

catch’s picture

Status: Needs review » Needs work
  1. +++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
    @@ -53,14 +53,15 @@ public function load($name) {
    +      if (!file_exists($htaccess_path) && file_put_contents($htaccess_path, static::htaccessLines(), $flags)) {
    

    file_put_contents() doesn't care if you write to an already existing file, so what does the LOCK_EX actually help with here?

    It'll stop another process from writing to the file at the same time, but then it's going to write once the lock is released - it won't stop the write from happening at all. So potentially we'll actually spend longer writing to the same file?

    I'm not sure what the implication actually is, but that needs inline documentation (if it's actually useful here).

  2. +++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
    @@ -168,7 +170,7 @@ protected function createDirectory($directory, $mode = 0777, $is_backwards_recur
    +      if ($status = @mkdir($directory)) {
    

    Needs a comment.

  3. +++ b/core/lib/Drupal/Core/Template/TwigEnvironment.php
    @@ -127,10 +127,20 @@ public function loadTemplate($name, $index = NULL) {
    +          // have failed load load the class. In that case, execute the code
    

    "failed to load".

mpdonadio’s picture

Status: Needs work » Needs review
FileSize
5.28 KB
2.43 KB

#51 has been addressed, but as I wrote out the comment for (1), I wondered if file_put_contents($path, $code, LOCK_EX); is really sufficient. It will grab an exclusive lock for the .htaccess (if needed), and the file itself, but two processes could still be in the function at the same time making the directory. I'm wondering if the ::save() method needs to be refactored to use flock() directly to make the entire method process exclusive?

Status: Needs review » Needs work

The last submitted patch, 52: race_conditions_in_the-2429659-52.patch, failed testing.

catch’s picture

From everything I could fine, LOCK_EX is only a write lock, it's not a read lock at all, i.e. http://php.net/manual/en/function.flock.php

If it was a read lock, we wouldn't need the fallback to eval() later on in the patch.

If it's only a write lock, then I don't see how it helps the race condition - it might even make it worse since you'd have two, subsequent writes to the file, where it can be partially read from, rather than two concurrent ones which may finish faster.

mpdonadio’s picture

I am going to try to get confirmation about LOCK_EX; maybe I can raise @ircmaxell. My understanding from the comments is that it is an exclusive lock, which means only one process can have the file at any time; IOW it would block both readers and writers to the file.

catch’s picture

There's a stackoverflow thread here: http://stackoverflow.com/questions/4899737/should-lock-ex-on-both-read-w... which says it's only an advisory lock - so it'd work if all calling code used LOCK_EX - but file_get_contents() can't. Also it looks like ircmaxell answered that thread..

mpdonadio’s picture

Blerg. Do you think refactoring this to create a class AtomicFileStorage implements PhpStorageInterface would be worth it?

chx’s picture

My solution would be to remove exists from PhpStorageInterface as it is pointless and rather pass an optional callable to PhpStorageInterface::load which it can run after including to determine whether the loading was successful. The caller has logic already to deal with load failures.

chx’s picture

Status: Needs work » Needs review
FileSize
10.11 KB

This patch removes exists as it is unusable and adds loadClass instead.

dawehner’s picture

+++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
@@ -33,14 +33,30 @@ public function __construct(array $configuration) {
+    while (!($exists = class_exists($class_name, FALSE)) && $this->doLoadClass($name, TRUE) && !($exists = class_exists($class_name, FALSE)) && $retry--) {
+      usleep(mt_rand($wait, 2 * $wait));
+    }

It would be great to explain why we are waiting here ...

Status: Needs review » Needs work

The last submitted patch, 59: 2429659_59.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
11.76 KB

Sure.

dawehner’s picture

Thank you chx. I'm sorry that I haven't see the bit on the interface.

Status: Needs review » Needs work

The last submitted patch, 62: 2429659_62.patch, failed testing.

chx’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
12.24 KB

Last fail: @covers exists when exists no longer , well, exists.

Berdir’s picture

This looks quite interesting.

Not sure if it is possible to restore my test, or test it in a different way somehow?

But I'll repeat my manual tests (mixing ab -c 10 with drush cr calls)

chx’s picture

If it were fundamentally broken nothing would work since this loads DrupalKernel. If we badly want to test this then we could test the implementation which I never like by creating two files:

// This goes to a known fixed path.
class foobar();
// Put this in phpstorage.
if ($GLOBALS['test_retry']--) return;
include_once 'know/fixed/path/foobar.php';

and then test

$GLOBALS['test_retry'] = 5;
for ($i = 0; $i < 9 && !class_exists('foobar', FALSE); $i++) {
  include "x.php";
}
// assert $i is 6.
chx’s picture

FileSize
13.88 KB

We could use a success callable which opens the door for success checking including arrays, functions and whatever else. Even if we want to go back to the previous version, there are small improvements in this one compared to previous.

Status: Needs review » Needs work

The last submitted patch, 69: 2429659_69.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
13.88 KB

Only 22439 fails? I need to work harder.

chx’s picture

Issue summary: View changes
dawehner’s picture

Regarding test coverage, I think non race condition problem will be visible immediately, given how often we render something using a template,
so that part is fine for me :)

+++ b/core/lib/Drupal/Component/PhpStorage/FileStorage.php
@@ -33,18 +33,55 @@ public function __construct(array $configuration) {
+    if (!isset($success)) {
+      return $this->doIncludeOnce($name);
+    }
...
+    while ($this->doInclude($name) && !($return = $success()) && $retry--) {

It seems entirely not obvious why we have both doIncludeOnce and doInclude ... do you mind explaining this somewhere? I assume there is some intention behind it?

chx’s picture

FileSize
15.15 KB
7.39 KB

Added code and doxygen to make load and loadClass safe by default but overridable. Much nicer and very visible of why doIncludeOnce is necessary:

    if (!isset($success)) {
      return $force ? $this->doInclude($name) : $this->doIncludeOnce($name);
    }

Status: Needs review » Needs work

The last submitted patch, 74: 2429659_74.patch, failed testing.

Berdir queued 74: 2429659_74.patch for re-testing.

The last submitted patch, 74: 2429659_74.patch, failed testing.

Berdir’s picture

The last patch doesn't solve the problem for me. When stress-testing this according to #17, then I still sometimes get the fatal error.

Note that on HEAD, it's more likely that you will run into serialization exceptions, see the issue in #17.

I also get watchdog entries like this:

1173055  08/Mai 16:59     php    Warning: mkdir(): File exists in Drupal\Component\PhpStorage\FileStorage->createDirectory() (line 213 of .../core/lib/Drupal/Component/PhpStorage/FileStorage.php). 
 1173056  08/Mai 16:59     php    Warning: file_put_contents(.../sites/default/files/php/twig/1#dc#a6#d781522ea6a6ab18c0e6450635eab7eb38b52f46063d71ba1fff69b068ca/.htaccess): failed to open stream:

I'm not sure, but I think the patch also resulted in apache getting stuck completely :)

Going to try with the patch from #52 on production now, that seems to work fine locally.

chx’s picture

Something is odd here: we write the file first to a random location (str_shuffle combined with microtime) and then do a rename. And yet somehow we manage to read a partially written file? How come?

geerlingguy’s picture

So (mostly for my own benefit), I can reproduce the error:

2015/05/08 21:59:34 [error] 2192#0: *23 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class '__TwigTemplate_3df18936b0043763da1aa91635787628747bdf2bc4c34a6457aea4fc3f8fa5ea' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /user/login HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com"

On my Raspberry Pi Dramble stack (4 webservers using a shared Gluster mount).

Steps to reproduce:

  1. Set up Drupal 8 HEAD (or beta10), have all servers' clocks in sync (in my case, 4 servers behind Nginx as a load balancer, with Nginx + PHP-fpm on each webserver, with all Drupal system defaults, e.g. Twig cache stored in normal files folder—files folder mounted via GlusterFS shared on all the webservers).
  2. Load up the user login page (notice that it works fine on any of the servers).
  3. Set the date on each of the webservers to be about a minute off from each other.
  4. Reload the user login page, get WSOD, see above error.

Notes:

  1. If I switch the Nginx load balancer to point at only one of the webservers (any one of them), everything works fine again.
  2. If I reset all the clocks to be the exact same time (e.g. $ ansible all -m shell -a "date --set='10:25:55'" -s, everything works fine again.
  3. Interestingly, every once in a while, the login page (or I've also tested the 'create new account' and 'reset password' pages) will actually load, randomly, from one of the four webservers behind the balancer. But if I run through a bunch of refreshes again, the same server WSODs with the above error again).

I'm at the DrupalCon LA sprints, and I will be testing the patch here and seeing if I can help get this resolved (or at least more thoroughly tested) today.

mpdonadio’s picture

@geerlingguy, are you NTP synced, or just jam syncing?

geerlingguy’s picture

I tested the patch in #74, and while it doesn't solve the WSOD issue on the Pis with time drift, it doesn't cause any regression either, and seems like a good approach towards improving the situation. Still doing more testing, too, and I'll try to wrk with chx via IRC if I can some time this afternoon.

geerlingguy’s picture

@mpdonadio - to test time drift, I set each of the four webservers' clocks explicitly to a value +/- 5 seconds from the actual time (using, e.g. sudo date --set='21:52:48'). Then, to bring them back in sync, I'm jam syncing from my Mac's current time to the stack.

Since the servers have no RTC, and are not connected to the Internet (and don't have NTP), they don't have any (simple) ability to keep their time in sync otherwise. I've thought about setting up one of the servers as an NTP server, with an RTC set up through GPIO, so the other servers could sync to the master clock, but haven't yet set that up (see: https://github.com/geerlingguy/raspberry-pi-dramble/issues/43).

Granted, this is (or at least should be) a pretty rare use case—having a multi-server setup with the twig cache on a shared filesystem with servers that have out-of-sync clocks. So we might not want to bikeshed on that. I'd just like to see if I can figure out exactly where the issue is coming from, since the twig file exists on all the servers, and it seems like a semi-random failure (some requests go through, some are 500s with the above error).

chx’s picture

Yes, if you can figure this out, the ocntents of the file being read would be invaluable. See #79

geerlingguy’s picture

@chx - I've attached the file that's present at /mnt/gluster/files/php/twig/1#36#e9#218a72f912db7dd7fe18458418b9145115f045ba90b5100e5d5fe8b5716d/4cc1fc2e1dc455a9a45bbe4ad00241000181573fffb63d719af245daa0036c36.php (that file is available at that mount, which is mounted at the point /var/www/drupal/sites/default/files). The watchdog message that appears when that file is trying to be loaded is:

2015/05/09 21:58:35 [error] 4073#0: *3059 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class '__TwigTemplate_36e9218a72f912db7dd7fe18458418b9145115f045ba90b5100e5d5fe8b5716d' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /user/login HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com"

The file is definitely present on the system, it just won't be loaded for some reason, when the clock drifts from other servers handling active requests.

geerlingguy’s picture

I'm also trying to reproduce the OP's bug on the Dramble stack, and can't quite do so—steps followed:

  1. Start a cache clear on one of the webservers: drush cr
  2. (While the cache clear is ongoing): Refresh page over and over. Page loads slowly, but loads after a few seconds with a 200.
  3. (Also while the cache clear is ongoing): Run wrk -t4 -c24 -d10 http://pidramble.com/?nocache=true (this loads the page continuously). Page loads slowly, but all requests still go through, with 0 errors.

I tried the above with both beta10 and with HEAD, and I can't reproduce the original issue, except when I explicitly set the Pi clocks to drift. The patch in #74 seems like it's a good approach here for any edge cases, but I am finding it hard to reproduce the actual issue with the cache rebuild race condition.

geerlingguy’s picture

I also added the debug code inside MTimeProtectedFileStorage.php:

  public function load($name) {
    file_put_contents('/tmp/log.txt', $name, FILE_APPEND);
    if (($filename = $this->checkFile($name)) !== FALSE) {
      // Inline parent::load() to avoid an expensive getFullPath() call.
      return (@include_once $filename) !== FALSE;
    }
    return FALSE;
  }

Contents of /tmp/log.txt:

service_container_prod.php

And with that call to file_put_contents() (and no other changes), I'm now getting a new error message in the logs:

2015/05/09 15:56:50 [error] 5110#0: *15 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class 'Drupal\Core\Template\Loader\FilesystemLoader' not found in /mnt/gluster/files/php/service_container/service_container_prod/38461dd565579de4d7e93cbc2d9ac60601804d7fbd95e49d31cbf6ae8400643f.php on line 6859" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /user/login?oijsdfasd HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com"

I will attach the loaded service container file as mentioned in the above error message in the next comment.

geerlingguy’s picture

(And a note: sorry for oversharing the data here... I just want to make sure I put as much debug info into this issue as possible since it's rare I get more than a few minutes to dig into this bug, and it's probably not a bug that's quick/easy to reproduce!)

Attached is the service container mentioned in the previous comment.

chx’s picture

Status: Needs work » Needs review
FileSize
1.22 KB

This patch implements logic similar to DrupalKernel to make sure the error can't exist.

pbuyle’s picture

I had the same issue on a fresh Drupal 8 install made with http://www.drupalvm.com/, the patch in #89 fixed it.

geerlingguy’s picture

@pbuyle @chx - Strangely, I just started seeing Twig errors on initial site install (they cleared up after a second page refresh) today, reliably reproducible on a fresh Drupal 8 HEAD install on either Ubuntu 14.04 or 12.04 with PHP 5.5. I'm testing this patch now (using Drupal VM).

geerlingguy’s picture

Status: Needs review » Reviewed & tested by the community

To reproduce (if you don't want to set up a cluster of servers with a Gluster filesystem):

  1. Build an instance of Drupal VM with the default settings (will install D8 HEAD)
  2. Load the default URL for the D8 site http://drupaltest.dev/
  3. Observe a twig error:
    No front page content has been created yet.
    
    Fatal error: Class '__TwigTemplate_94bd2225e3fafe16fc2afc51cd67340aa54d733f8325c53771de007781bb0398' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141
    
  4. Refresh the page, and observe that the site loads correctly (with the default standard profile home page layout).

After applying the patch in #89, and completely rebuilding the VM from scratch (or reinstalling the site, the first page load is fine.

I was able to reproduce 3/3 tries, and used both Ubuntu 12.04 and 14.04, with PHP 5.5 and 5.6 with each of the OSes. The patch was also successful in not resulting in an error on first page load 3/3 tries.

Thumbs up from me; this is an annoying bugger, and I think I've now spent at least 12 hours of my life on it! Thanks so much for the patch, @chx :)

Fabianx’s picture

RTBC + 1, even if chx' approach in the other issue is looking promising, this should help to at least avoid the race condition for now.

Can we open a (postponed) follow-up based on #2513326: Performance: create a PHP storage backend directly backed by cache to enable that for twig templates, too.

alexpott’s picture

I hesitate to ask but is this in anyway testable?

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 89: 2429659_89.patch, failed testing.

geerlingguy’s picture

Status: Needs work » Needs review

@alexpott - I can reproduce 100% of the time in two different scenarios (one that's simple enough for anyone to reproduce, anywhere, the other requiring a set of servers that have their filesystems shared via GlusterFS... so the first way is probably the simplest for easy reproduction):

  1. Follow the Drupal VM Quick Start Guide to install an instance of Drupal VM.
  2. After about 5-10 minutes (depending on connection speed), provisioning will be complete, with D8 HEAD installed.
  3. Fire up your web browser and load http://drupalvm.dev/. Observe attached image and logged PHP error in /var/log/apache2/error.log (pasted below).
[Fri Jul 03 20:36:26.943786 2015] [:error] [pid 23023] [client 192.168.88.1:63018] Uncaught PHP Exception Drupal\\Core\\Database\\DatabaseExceptionWrapper: "SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction: INSERT INTO {cache_config} (cid, expire, created, tags, checksum, data, serialized) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array\n(\n    [:db_insert_placeholder_0] => bartik.settings\n    [:db_insert_placeholder_1] => -1\n    [:db_insert_placeholder_2] => 1435955786.931\n    [:db_insert_placeholder_3] => \n    [:db_insert_placeholder_4] => 0\n    [:db_insert_placeholder_5] => b:0;\n    [:db_insert_placeholder_6] => 1\n)\n" at /var/www/drupal/core/lib/Drupal/Core/Database/Connection.php line 609

Side note: Judging by the error message above, this could be related to #2336627: Deadlock on cache_config (DatabaseBackend::setMultiple()) as well. Regardless, it seems the patch in #89 fixes at least this particular issue.

To reproduce, vagrant destroy -f && vagrant up, then load the page again. If I add the patch to the drush make example that's included with Drupal VM, then provision a new instance of the VM, the error is 100% reliably fixed! I also tested it on the Raspberry Pi #Dramble cluster, and while it's harder to reliably reproduce in that scenario, with the patch applied I never saw the error, but without it, I did see the error from time to time when the clocks were more than a few seconds out of sync (since they're not synced via NTP, they drift after a few days).

Here's the exact Drupal make file I was using with Drupal VM (could work with other setups, just make sure PHP/webserver has been cleanly restarted so nothing is in memory already prior to first page load, and make sure you don't install Drupal via web UI, since that may pre-build some of the twig caches):

---
api: 2
core: "8.x"

projects:

  drupal:
    type: "core"
    download:
      branch: "8.0.x"
    patch:
      - "https://www.drupal.org/files/issues/2429659_89.patch"

  devel: "1.x-dev"

Nudging testbot again...

geerlingguy queued 89: 2429659_89.patch for re-testing.

geerlingguy’s picture

Forgot to attach the first page load error image (this is what the first visitor to a new site sees):

Drupal - error on first page load

Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

I don't think its possible to test the race condition, but the manual testing should be good enough.

Anonymous’s picture

a long time ago, in a galaxy far, far away... #850782: allow testing lock code via async http calls

well, 2010. who can remember wtf was happening back then these days?

lauriii’s picture

Issue summary: View changes

Added beta evaluation

Berdir’s picture

Hm. Actually, that's more or less the same code as I had from the beginning. And that had tests. See for example in #52 which I think is the last place they existed.

@beejebus: I don't see how that would help here. This is not predictable, it happens when you have many requests that are trying to build the same twig templates, possibly on a slow file system. That's very different from a predictable lock where you know the second request will hit the lock.

Anonymous’s picture

@berdir - the only predictable thing here is that we'll never catch these bugs by sending one request at a time.

we have many sections of code that make assertions about safety when running across multiple concurrent requests, and test literally 0% of them that way. hard to figure out why 'it might not catch everything with our test suite' is used to defend 'we will never catch any of this class of bugs with our current test suite'.

Fabianx’s picture

Status: Reviewed & tested by the community » Needs work

Yes, right, sorry berdir I totally forgot that #3 had a test, which at least tests the error condition part of it.

Lets merge the test of #3 into this patch.

Berdir’s picture

The test in #52 might be easier to merge, I had to update it at some point.

geerlingguy’s picture

Just to give a little more of a push for this; I fired up a fresh instance of the entire Dramble stack (I re-imaged all the 6 Pi microSD cards, then reprovisioned everything, and installed Drupal 8 HEAD from this morning), and now it seems about every 5th or 6th page load (when logged in) results in a WSOD on first load, then if I refresh, the page loads fine.

I currently have the Nginx balancer set to go round-robin, with no IP pinning, so each request goes to a new web node behind the balancer. If I turn IP pinning on, then requests work correctly. But in the real world, most requests would be distributed amongst the backend servers, resulting in users hitting the WSOD pages on initial page load.

The error in the Nginx log on the backend server is as follows:

2015/07/04 22:22:59 [error] 2290#0: *52 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Class '__TwigTemplate_813b6bb6bcccc2ae6f6de69fbc5bf6c3ebe61f1c97eb2e81de9e4d330bb17c61' not found in /var/www/drupal/core/lib/Drupal/Core/Template/TwigEnvironment.php on line 141" while reading response header from upstream, client: 10.0.1.60, server: 10.0.1.61, request: "GET /admin/reports/dblog?page=2 HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "pidramble.com", referrer: "http://pidramble.com/admin/reports/dblog?page=1"
geerlingguy’s picture

I definitely still agree with 'Major' priority... might even argue for Critical, since now I can reproduce this error in three different environments—running Debian 7.8 with PHP 7.0.0 beta1 compiled, I get errors like the following on every 3-5 page loads:

Error: Class '__TwigTemplate_408ac94a722c70a8d9e243d9eebb745aeff646c4fc11d22dd91885dbb5553bce' not found in Drupal\Core\Template\TwigEnvironment->loadTemplate() (line 141 of core/lib/Drupal/Core/Template/TwigEnvironment.php).

This is on a Raspberry Pi with slow disk writes (since Drupal is running on a microSD card)... maybe the slow disk access is causing the template to not be found? The compiled directory/file is most definitely present at sites/default/files/php/twig/1#40#8a#c94a722c70a8d9e243d9eebb745aeff646c4fc11d22dd91885dbb5553bce, but the .php file is empty...

Though the Pi is an exceptional use case, I have definitely seen weird/slow disk writes in many cloud environments, and I can imagine more people will run into these issues as well.

Update: It seems that it may have been a weird case of FPM having some split-brained behavior. A full restart of the server seems to have cleared up the errors this time; maybe different child processes had different/broken cached file directory information or something.

star-szr’s picture

Status: Needs work » Needs review
FileSize
1.25 KB
2.46 KB

I can't get the test-only to fail, but here's something.

Berdir’s picture

Would be very interesting to see if #2527478: Resolve infinite stampede in mtime protected PHP storage improves the frequency of this happening.

I still think we should get this committed to avoid the fatals *if* it happens. But I hope that the other issue means it happens much less frequently.

geerlingguy’s picture

Status: Needs review » Reviewed & tested by the community

Still RTBC in my book—and that test probably won't ever fail unless you can make testbots do multiple concurrent requests (#2527478: Resolve infinite stampede in mtime protected PHP storage should help). Let's stop the bleeding...

Fabianx’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/core/modules/system/src/Tests/Theme/TwigEnvironmentTest.php
@@ -46,6 +47,19 @@ public function testInlineTemplate() {
+    $cache_file = $environment->getCacheFilename($name);
...
+      '#template' => $name,

Don't we add something by now to inline template names in the InlineLoader?

I think if we want to fool the cache, we need to do the same {# #} comment adding here ...

star-szr’s picture

Yup that is probably needed to calculate the cache filename properly: https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21Template%...

Doing a quick test of this:
$cache_file = $environment->getCacheFilename('{# inline_template_start #}' . $name);

Seemed to result in an infinite loop for me. Not sure why…

Edit: And also combined with doing an actual inline template like @Fabianx suggested earlier: Hello World - {{ 1+0 }}.

Fabianx’s picture

Hmm, maybe lets use a real template then instead of an inline template or call twig_render_template directly?

geerlingguy’s picture

Reroll of #108; test file changed.

Note that this patch is still necessary to get things working with Drupal 8 on my little Raspberry Pi cluster, using GlusterFS for a shared files directory mount. I seem to have encountered another race condition over in #2540912: Installation fails with files directory on glusterfs: "Warning: mkdir(): File exists", but it looks like that one is resolved by #2497243-188: Replace Symfony container with a Drupal one, stored in cache.

catch’s picture

Priority: Major » Critical

I'd been assuming this was self-healing, but had missed berdir's comment in https://www.drupal.org/node/2429659#comment-9845943 where he pointed out the site was down permanently until they manually intervened. Since berdir's site is more or less the only 8.x site with any traffic at the moment, we should assume that people will run into this more often.

I'm bumping this to critical. However also this is pretty much there except for test coverage. Given it's very hard to test, and we have months of manual testing on patched installs, let's add the regression test in a follow-up issue.

Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

RTBC then, we can fix the test in a follow-up. (and should create one for it)

lauriii’s picture

I've seen this problem too so agree that this is quite critical

catch’s picture

Status: Reviewed & tested by the community » Needs work

Should we not remove the test and put it back in the follow-up?

Or is it worth having the test there to avoid a further regression which might actually make the test fail? If so I could go for that, but I think it needs a @todo in the test to point out that we've not been able to get it to fail.

geerlingguy’s picture

Status: Needs work » Needs review
FileSize
1.22 KB

Patch with bugfix-only is attached. I'll split the test into a separate patch for a follow-up issue, to be opened momentarily...

geerlingguy’s picture

Issue summary: View changes

Updated IS, since it's been a while...

geerlingguy’s picture

Added related issue #2541440: Add tests for race conditions in Twig template cache for the tests. I'll supply the first patch, with the broken out test, over there.

Can we get an RTBC on the above reroll without the tests?

mpdonadio’s picture

I kinda think we should add a @todo linking to the issue mentioning, and also mention that there is no coverage for that code.

dawehner’s picture

I kinda think we should add a @todo linking to the issue mentioning, and also mention that there is no coverage for that code.

+1 for that idea. Yeah race conditions can be hard to test, this is for sur.

Thank you @geerlingguy for updating the issue summary.

Damien Tournoud’s picture

From the look of this, the main bug is actually in \Drupal\Component\PhpStorage\FileStorage, this has more race conditions than lines of code.

The main problem is the implementation of PhpStorageInterface::save(), which is basically a wrapper around file_put_contents() and as a consequence has two major problems:

  • First, it is not protected against partial writes. By default file_put_contents() doesn't perform any kind of locking, which means that readers are going to see partial content. Because the readers are using @include, all the errors are ignored, and partial content is most likely going to appear as a cache miss (but really, who knows? we could also totally load a partial template, or execute some arbitrary code);
  • Second, it doesn't call fsync() on the file, which means that the write is not guaranteed to be immediately committed (and as a consequence visible from the other nodes in a distributed filesystem).

We need to implement a standard "write to a temporary file, flush and rename" protocol here.

We could also use a fixed name for the temporary file, and lock it with LOCK_EX, which might or might not help with stampedes. (Nowadays, most distributed filesystems have locking primitives, but it might be only local to the node.)

Berdir’s picture

@znerol tried to improve that in #2527478: Resolve infinite stampede in mtime protected PHP storage I think. A review from you would be great there.

There were also various attempts by @chx and @mpdonadio earlier in this issue to solve it in the php storage but none really worked so this eventually circled back to this workaround, which is also very close to what I originally wrote in my early patches.

amateescu’s picture

Edited: cross-post with the first part of #125.

geerlingguy’s picture

So, should I stick a @todo mentioning the other issue in the loadTemplate() method, or somewhere else?

Fabianx’s picture

#127: Lets add the @todo to the test that the current test is non-functional and will be fixed in that other issue.

geerlingguy’s picture

@Fabianx - latest patch (#119) removes the test entirely as per #118 and above comments. Should we add that test back in for now, and leave the @todo with that test, or keep the test out for now, and put the todo with the changed function?

Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

Oh, I missed #119. The code does not need a comment as the test is major.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

The patch attached improves the situation so I'm going ahead before beta 13 since this is not disruptive at all. I agree with @Damien Tournoud and @znerol - the real issue will be solved in #2527478: Resolve infinite stampede in mtime protected PHP storage which looks promising.

Committed 7595aa7 and pushed to 8.0.x. Thanks!

  • alexpott committed 7595aa7 on 8.0.x
    Issue #2429659 by chx, Berdir, mpdonadio, geerlingguy, Cottser,...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

star-szr’s picture

I'm working on #2555243: Upgrade path / plan to Twig 2.x aka 2.0 which involves porting some of our custom code in TwigEnvironment into a new Twig cache class.

I'm wondering if #2527478: Resolve infinite stampede in mtime protected PHP storage maybe makes this change no longer necessary. I'm a bit stuck because I don't see a way right now to incorporate this "just in case" handling into the cache class I'm writing so I'd like to ask for some testing to see if this issue can be reproduced with the current code.

If @geerlingguy or @Berdir or anyone else who was able to reproduce this can test the following patch that would be extremely helpful:

#2426563-93: Ignore: Patch testing issue
https://www.drupal.org/files/issues/upgrade_twig_test_1.x-cache4.patch

I tried it myself using @geerlingguy's Drupal VM steps in #96 on Drupal 8 HEAD with the patch and couldn't reproduce any errors. However I also tested on Drupal 8 commit b158c354b15abee86b58c2342c985456392d44a8 (based on the timestamp of #96) and was still not able to reproduce the error so I don't consider my testing to be definitive. I also tried with an older version of Drupal VM (3d5b33be5c974f4e26c96947d4f82239231c5301) just in case but still no luck breaking it.

star-szr’s picture

Actually I just got it to break using Drupal 8 b158c354b15abee86b58c2342c985456392d44a8 and the latest Drupal VM. Yay! I'll try with the patch on D8 HEAD a few more times.

star-szr’s picture

Development has now moved to a proper issue of its own (#2568171: Upgrade to Twig 1.22 and implement our own cache class) and we have some plans now for re-implementing this fallback in our cache class. Still would be good to have it tested manually once we have some code because we don't have automated tests for it.