Problem/Motivation
When you are clearing caches while a site is being visited, then you sometimes get fatal errors like this:
Class '__TwigTemplate_e420be0e76e7b05d16c9c1bd35cee0103be31106e96972ab01f86f0a60b76811' not found
Happens in core/lib/Drupal/Core/Template/TwigEnvironment.php:141
I'm not quite sure why this happens. Was writing a few theories, but every time I look into them closer, I find out something new.
So it seems to be able to load the file, but then the class isn't found. Maybe the file is still being written? but then the file system should protect us from that?
It sometimes only happens once, the second request is then fine. But I have a few of those fatal errors in our logs. This error is also reproducible in some cluster/shared filesystem circumstances, like when running drupal with a shared files directory with slow disk access, or after initial install or cache clears in certain circumstances.
Proposed resolution
- Remove the exists method. It's broken.
- Augment load with a success callback. This way it can check whether the loading resulted in what the caller asked for and retry if not. Possible use cases: loading classes, functions, variables.
- Add a loadClass method wrapping load as this is the most common case.
Remaining tasks
RTBC the patch (which has a lot of manual testing at this point, but is decidedly tricky to write automated tests for).
User interface changes
N/A
API changes
N/A
Beta phase evaluation
Issue category | Bug because fatal errors can cause sites to be unreachable until manual cache clear is performed |
---|---|
Issue priority | Critical because its easy to replicate error situation |
Prioritized changes | Bug fixes are prioritized changes in the beta phase/td> |
Disruption | The change is not disruptive |
Comments
Comment #1
BerdirNote: This is a on a shared file system, so that could make this more likely, but I'm quite sure that I've seen this locally as well
Comment #2
dawehnerFor me this happened in some scenarios when the stream wrappers haven't been setup properly.
Comment #3
BerdirThis might be the most evil test I've ever written.
Unfortunately, the fix I thought of doesn't work because the file storage uses include_once, so it won't load the same file again and still fails. Maybe is also related to the actual issue, I don't know what is really going on.
Comment #4
mpdonadioI must say, that I was looking at the TestBot queue to see how big it was and couldn't help but read this issue...
Comment #6
star-szrComment #7
BerdirTalked about this with @Cottser, looks like the only useful thing that we can do here is throw an exception instead of having a fatal error. That still means the user gets an error page, but at least it's not an WSOD.
Comment #8
Berdir#2296009: Use APC Classloader by default (when available) just had a random fail with this, apparently caused by using APC. I think this means that those errors happen because of the apc classloader, although I don't know why yet.
Comment #9
Fabianx CreditAttribution: Fabianx commentedTwig uses a double caching logic to avoid having to check the filesystem again and again for things that are in cache.
Therefore a race condition during cache clear could be expected.
I think the class_exists($cls, FALSE) is the proper fix.
Comment #10
BerdirWhere exactly? And what should we do if it returns FALSE?
Comment #11
catchHEAD just had a random fail with this as well.
Comment #12
xjmComment #13
Fabianx CreditAttribution: Fabianx commentedThe caching logic happens in isFresh() and likely that test for a cache miss is wrong, but this should not affect core as its only happening when auto_reload => TRUE, which we don't do for performance reasons.
That means somehow either ->save() to storage fails or ->load() loads something else.
That is tricky for sure, hardening we should do anyway, so as proposed already:
at the minimum.
Then we could again do a check afterwards and throw an Exception, though that is really fatal if a newly dumped file cannot be loaded as class ...
Comment #14
BerdirYeah, but the class exists check won't help, because the php storage won't ever load the file again. We would have to combine it with the non-cache version of using eval?
Comment #15
Fabianx CreditAttribution: Fabianx commentedWhy not?
If the class does not exist and we store and retrieve it then, it should exist afterwards or not?
Comment #16
BerdirBecause FileStorage::load() does an include_once.
If the first load() call loads the file but that for some reason does *not* contain the class (only then we get into the || !class_exists() case), then calling load() again will *not* try to include that file again.
I quickly tried to switch to an include there, but Drupal dieded hard and I didn't try to fix it.
Comment #17
BerdirWhile doing stress tests on #2336627-16: Deadlock on cache_config (DatabaseBackend::setMultiple()), I noticed this happening again. so this might be more common/easier to reproduce than I thought.
The only additional information I have right now is that I also got a few corresponding watchdog messages:
So definitely looks like some sort of conflict in the phpstorage, where multiple processes try to write the file?
Comment #18
mpdonadioWould a `flock($handle, LOCK_EX);` be appropriate here? Since the lock is needed to write the cache and not read it, it won't be a huge performance hit?
Comment #19
BerdirI was thinking something like that yes, but it would probably have to be abstracted through PhpStorage? So we could have a lockForWrite() method that we can call immediately after failing to include the fail, or possibly even automatically in load (maybe with an option to disable it), that is removed again when the file was written.
Comment #20
mpdonadioHmm. PhpStorage doesn't use handles, so flock() can't be used directly unless we also add a lockfile to the Twig cache.
Maybe just this is all that needed?
Comment #21
BerdirI tried that, it gets rid of the watchdog messages, but I still get some class not found fatal errors.
Comment #22
BerdirOk, this fixes the fatal errors for me, although I still get some warnings.
Comment #23
mpdonadioWe could @mkdir to make the warnings go away, but it kinda sounds like there are still two processes in PhpStorage::save() at the same time.
Do we need to add an optional lock to PhpStorage or make a LockablePhpStorage, either of which whose constructor accepts a LockBackendInterface, and then protect save with it? Or a lock in TwigEnvironment?
Comment #24
mpdonadioQuick and dirty locking in FileStorage::save() to see what implodes. No other changes from HEAD.
Comment #25
mpdonadioOK, some additional testing with increased concurrency in ab reveals that Bad Things happen with a lock there, and I still got the Twig errors...
Comment #27
geerlingguy CreditAttribution: geerlingguy commentedCottser sent me this way from #2234229: PHP Fatal Error: Class __TwigTemplate not found in TwigEnvironment.php... I had the same strange issue on an infrastructure with a replicated file system:
Comment #28
Fabianx CreditAttribution: Fabianx commented#27: Your best bet is to put the twig template cache on /tmp, by setting up the phpstorage configuration in settings.php.
The template cache is also DB backed, so a global cache clear will clear all templates as the mtime is stored in the database.
Comment #29
Berdir@Fabianx: That's not correct?
Yes there's the auto refresh feature, but you definitely don't want to have that enabled on production, because that does a cache_get() for each template file.
Comment #30
BerdirI think there's not much we can do about the mkdir, see https://bugs.php.net/bug.php?id=35326 and #392100: Warning: mkdir(): File exists in imagecache_build_derivative() for somewhat similar issues.
Here's a combined patch with #22, the test from #3 (had to make a small change because the filename is not validated) and the @ for mkdir().
I think it would be great to get it in like that, which should take care of most issues.
Comment #31
geerlingguy CreditAttribution: geerlingguy commentedBut doesn't the twig cache need to be on a shared filesystem if you have multiple webservers? Otherwise, each server would have to regenerate its own twig cache on each request... at least that's how it seems this would work. I could turn off the Twig cache entirely, but that would result in abysmal performance on the Raspberry Pis.
Comment #32
BerdirThey would have to, but that's not the problem. The problem is distributing invalidations/cache clears. Just like ChainedFast cache backends, we'd need a shared flag/storage that they can check or you need a custom process that on cache clear, triggers a delete of those files on all servers.
Comment #33
Wim LeersSounds like we need these additional tags.
Comment #34
Fabianx CreditAttribution: Fabianx commented#29: Right, thats not active, I was mistaken there.
Still putting it on /tmp/drupal-cache/ is good.
Need to clear /tmp/drupal-cache then on code deploys, but with code deploy on multiple webservers being usually by jenkins, that is not a big deal either ...
Comment #35
rteijeiro CreditAttribution: rteijeiro commentedFixed a couple of nitpicks.
Comment #36
stefan.r CreditAttribution: stefan.r commentedI just had this issue as well on a vagrant box, this patch fixed it :)
Comment #37
BerdirAnything left to do here? It has tests, I can't think of a different fix, it should also help to fix random testbot fails that sometimes happen.
Comment #38
Fabianx CreditAttribution: Fabianx commentedNope, lets get this in.
This is RTBC.
Comment #40
BerdirSimple conflict in the use statements in the test after the String/SafeMarkup change.
Comment #42
BerdirUhm, did something change with the file storage tests that this now fails?
Comment #43
star-szrIndeed, git bisect points to #2453399: Use VFS for FileStorage tests.
Comment #44
mpdonadioMaybe this? Used `parse_url` so the tests wouldn't have to rely on the file_system service just to check the scheme.
Comment #45
mpdonadioGrrr.
Comment #46
BerdirTricky, but seems to be working. Anyone wants to RTBC this? This is causing a lot of fatal errors, in one case, it even caused a permanent error that we had to fix with a manual cache clear. This should basically be self-healing...
Comment #47
Fabianx CreditAttribution: Fabianx commentedI had to look three times, but yes due to how ChainedFileLoader works, you should indeed be able to embed not only inline templates, but also real templates ...
However is this still true? (We fixed some error messages and changed that part.)
Let's check that first ...
Comment #48
Fabianx CreditAttribution: Fabianx commentedThis gives just back the string maintenance-html.twig, but that is fine (as that is a valid twig template).
Therefore - even though that is probably not what berdir wanted originally - as it successfully tests the condition of a wrong template in the cache => RTBC.
If we wanted to we could change:
or something like that ...
But leave the rest the same.
Comment #49
catchI really don't like adding the eval() here even in a rare case. Is it worth looking at the switch to include Berdir mentioned in #16?
Comment #50
catchAlso if we have to have this protection for reads, what indication is there that the lock on write is doing anything?
Comment #51
catchfile_put_contents() doesn't care if you write to an already existing file, so what does the LOCK_EX actually help with here?
It'll stop another process from writing to the file at the same time, but then it's going to write once the lock is released - it won't stop the write from happening at all. So potentially we'll actually spend longer writing to the same file?
I'm not sure what the implication actually is, but that needs inline documentation (if it's actually useful here).
Needs a comment.
"failed to load".
Comment #52
mpdonadio#51 has been addressed, but as I wrote out the comment for (1), I wondered if
file_put_contents($path, $code, LOCK_EX);
is really sufficient. It will grab an exclusive lock for the .htaccess (if needed), and the file itself, but two processes could still be in the function at the same time making the directory. I'm wondering if the ::save() method needs to be refactored to use flock() directly to make the entire method process exclusive?Comment #54
catchFrom everything I could fine, LOCK_EX is only a write lock, it's not a read lock at all, i.e. http://php.net/manual/en/function.flock.php
If it was a read lock, we wouldn't need the fallback to eval() later on in the patch.
If it's only a write lock, then I don't see how it helps the race condition - it might even make it worse since you'd have two, subsequent writes to the file, where it can be partially read from, rather than two concurrent ones which may finish faster.
Comment #55
mpdonadioI am going to try to get confirmation about LOCK_EX; maybe I can raise @ircmaxell. My understanding from the comments is that it is an exclusive lock, which means only one process can have the file at any time; IOW it would block both readers and writers to the file.
Comment #56
catchThere's a stackoverflow thread here: http://stackoverflow.com/questions/4899737/should-lock-ex-on-both-read-w... which says it's only an advisory lock - so it'd work if all calling code used LOCK_EX - but file_get_contents() can't. Also it looks like ircmaxell answered that thread..
Comment #57
mpdonadioBlerg. Do you think refactoring this to create a
class AtomicFileStorage implements PhpStorageInterface
would be worth it?
Comment #58
chx CreditAttribution: chx commentedMy solution would be to remove exists from PhpStorageInterface as it is pointless and rather pass an optional callable to PhpStorageInterface::load which it can run after including to determine whether the loading was successful. The caller has logic already to deal with load failures.
Comment #59
chx CreditAttribution: chx commentedThis patch removes exists as it is unusable and adds loadClass instead.
Comment #60
dawehnerIt would be great to explain why we are waiting here ...
Comment #62
chx CreditAttribution: chx commentedSure.
Comment #63
dawehnerThank you chx. I'm sorry that I haven't see the bit on the interface.
Comment #66
chx CreditAttribution: chx commentedLast fail: @covers exists when exists no longer , well, exists.
Comment #67
BerdirThis looks quite interesting.
Not sure if it is possible to restore my test, or test it in a different way somehow?
But I'll repeat my manual tests (mixing ab -c 10 with drush cr calls)
Comment #68
chx CreditAttribution: chx commentedIf it were fundamentally broken nothing would work since this loads DrupalKernel. If we badly want to test this then we could test the implementation which I never like by creating two files:
and then test
Comment #69
chx CreditAttribution: chx commentedWe could use a success callable which opens the door for success checking including arrays, functions and whatever else. Even if we want to go back to the previous version, there are small improvements in this one compared to previous.
Comment #71
chx CreditAttribution: chx commentedOnly 22439 fails? I need to work harder.
Comment #72
chx CreditAttribution: chx commentedComment #73
dawehnerRegarding test coverage, I think non race condition problem will be visible immediately, given how often we render something using a template,
so that part is fine for me :)
It seems entirely not obvious why we have both doIncludeOnce and doInclude ... do you mind explaining this somewhere? I assume there is some intention behind it?
Comment #74
chx CreditAttribution: chx commentedAdded code and doxygen to make load and loadClass safe by default but overridable. Much nicer and very visible of why doIncludeOnce is necessary:
Comment #78
BerdirThe last patch doesn't solve the problem for me. When stress-testing this according to #17, then I still sometimes get the fatal error.
Note that on HEAD, it's more likely that you will run into serialization exceptions, see the issue in #17.
I also get watchdog entries like this:
I'm not sure, but I think the patch also resulted in apache getting stuck completely :)
Going to try with the patch from #52 on production now, that seems to work fine locally.
Comment #79
chx CreditAttribution: chx commentedSomething is odd here: we write the file first to a random location (str_shuffle combined with microtime) and then do a rename. And yet somehow we manage to read a partially written file? How come?
Comment #80
geerlingguy CreditAttribution: geerlingguy commentedSo (mostly for my own benefit), I can reproduce the error:
On my Raspberry Pi Dramble stack (4 webservers using a shared Gluster mount).
Steps to reproduce:
Notes:
$ ansible all -m shell -a "date --set='10:25:55'" -s
, everything works fine again.I'm at the DrupalCon LA sprints, and I will be testing the patch here and seeing if I can help get this resolved (or at least more thoroughly tested) today.
Comment #81
mpdonadio@geerlingguy, are you NTP synced, or just jam syncing?
Comment #82
geerlingguy CreditAttribution: geerlingguy commentedI tested the patch in #74, and while it doesn't solve the WSOD issue on the Pis with time drift, it doesn't cause any regression either, and seems like a good approach towards improving the situation. Still doing more testing, too, and I'll try to wrk with chx via IRC if I can some time this afternoon.
Comment #83
geerlingguy CreditAttribution: geerlingguy commented@mpdonadio - to test time drift, I set each of the four webservers' clocks explicitly to a value +/- 5 seconds from the actual time (using, e.g.
sudo date --set='21:52:48'
). Then, to bring them back in sync, I'm jam syncing from my Mac's current time to the stack.Since the servers have no RTC, and are not connected to the Internet (and don't have NTP), they don't have any (simple) ability to keep their time in sync otherwise. I've thought about setting up one of the servers as an NTP server, with an RTC set up through GPIO, so the other servers could sync to the master clock, but haven't yet set that up (see: https://github.com/geerlingguy/raspberry-pi-dramble/issues/43).
Granted, this is (or at least should be) a pretty rare use case—having a multi-server setup with the twig cache on a shared filesystem with servers that have out-of-sync clocks. So we might not want to bikeshed on that. I'd just like to see if I can figure out exactly where the issue is coming from, since the twig file exists on all the servers, and it seems like a semi-random failure (some requests go through, some are 500s with the above error).
Comment #84
chx CreditAttribution: chx commentedYes, if you can figure this out, the ocntents of the file being read would be invaluable. See #79
Comment #85
geerlingguy CreditAttribution: geerlingguy commented@chx - I've attached the file that's present at
/mnt/gluster/files/php/twig/1#36#e9#218a72f912db7dd7fe18458418b9145115f045ba90b5100e5d5fe8b5716d/4cc1fc2e1dc455a9a45bbe4ad00241000181573fffb63d719af245daa0036c36.php
(that file is available at that mount, which is mounted at the point/var/www/drupal/sites/default/files
). The watchdog message that appears when that file is trying to be loaded is:The file is definitely present on the system, it just won't be loaded for some reason, when the clock drifts from other servers handling active requests.
Comment #86
geerlingguy CreditAttribution: geerlingguy commentedI'm also trying to reproduce the OP's bug on the Dramble stack, and can't quite do so—steps followed:
drush cr
wrk -t4 -c24 -d10 http://pidramble.com/?nocache=true
(this loads the page continuously). Page loads slowly, but all requests still go through, with 0 errors.I tried the above with both beta10 and with HEAD, and I can't reproduce the original issue, except when I explicitly set the Pi clocks to drift. The patch in #74 seems like it's a good approach here for any edge cases, but I am finding it hard to reproduce the actual issue with the cache rebuild race condition.
Comment #87
geerlingguy CreditAttribution: geerlingguy commentedI also added the debug code inside
MTimeProtectedFileStorage.php
:Contents of
/tmp/log.txt
:And with that call to file_put_contents() (and no other changes), I'm now getting a new error message in the logs:
I will attach the loaded service container file as mentioned in the above error message in the next comment.
Comment #88
geerlingguy CreditAttribution: geerlingguy commented(And a note: sorry for oversharing the data here... I just want to make sure I put as much debug info into this issue as possible since it's rare I get more than a few minutes to dig into this bug, and it's probably not a bug that's quick/easy to reproduce!)
Attached is the service container mentioned in the previous comment.
Comment #89
chx CreditAttribution: chx commentedThis patch implements logic similar to
DrupalKernel
to make sure the error can't exist.Comment #90
pbuyle CreditAttribution: pbuyle at Floe design + technologies commentedI had the same issue on a fresh Drupal 8 install made with http://www.drupalvm.com/, the patch in #89 fixed it.
Comment #91
geerlingguy CreditAttribution: geerlingguy commented@pbuyle @chx - Strangely, I just started seeing Twig errors on initial site install (they cleared up after a second page refresh) today, reliably reproducible on a fresh Drupal 8 HEAD install on either Ubuntu 14.04 or 12.04 with PHP 5.5. I'm testing this patch now (using Drupal VM).
Comment #92
geerlingguy CreditAttribution: geerlingguy commentedTo reproduce (if you don't want to set up a cluster of servers with a Gluster filesystem):
http://drupaltest.dev/
After applying the patch in #89, and completely rebuilding the VM from scratch (or reinstalling the site, the first page load is fine.
I was able to reproduce 3/3 tries, and used both Ubuntu 12.04 and 14.04, with PHP 5.5 and 5.6 with each of the OSes. The patch was also successful in not resulting in an error on first page load 3/3 tries.
Thumbs up from me; this is an annoying bugger, and I think I've now spent at least 12 hours of my life on it! Thanks so much for the patch, @chx :)
Comment #93
Fabianx CreditAttribution: Fabianx as a volunteer commentedRTBC + 1, even if chx' approach in the other issue is looking promising, this should help to at least avoid the race condition for now.
Can we open a (postponed) follow-up based on #2513326: Performance: create a PHP storage backend directly backed by cache to enable that for twig templates, too.
Comment #94
alexpottI hesitate to ask but is this in anyway testable?
Comment #96
geerlingguy CreditAttribution: geerlingguy commented@alexpott - I can reproduce 100% of the time in two different scenarios (one that's simple enough for anyone to reproduce, anywhere, the other requiring a set of servers that have their filesystems shared via GlusterFS... so the first way is probably the simplest for easy reproduction):
/var/log/apache2/error.log
(pasted below).Side note: Judging by the error message above, this could be related to #2336627: Deadlock on cache_config (DatabaseBackend::setMultiple()) as well. Regardless, it seems the patch in #89 fixes at least this particular issue.
To reproduce,
vagrant destroy -f && vagrant up
, then load the page again. If I add the patch to the drush make example that's included with Drupal VM, then provision a new instance of the VM, the error is 100% reliably fixed! I also tested it on the Raspberry Pi #Dramble cluster, and while it's harder to reliably reproduce in that scenario, with the patch applied I never saw the error, but without it, I did see the error from time to time when the clocks were more than a few seconds out of sync (since they're not synced via NTP, they drift after a few days).Here's the exact Drupal make file I was using with Drupal VM (could work with other setups, just make sure PHP/webserver has been cleanly restarted so nothing is in memory already prior to first page load, and make sure you don't install Drupal via web UI, since that may pre-build some of the twig caches):
Nudging testbot again...
Comment #98
geerlingguy CreditAttribution: geerlingguy commentedForgot to attach the first page load error image (this is what the first visitor to a new site sees):
Comment #99
Fabianx CreditAttribution: Fabianx as a volunteer commentedI don't think its possible to test the race condition, but the manual testing should be good enough.
Comment #100
Anonymous (not verified) CreditAttribution: Anonymous commenteda long time ago, in a galaxy far, far away... #850782: allow testing lock code via async http calls
well, 2010. who can remember wtf was happening back then these days?
Comment #101
lauriiiAdded beta evaluation
Comment #102
BerdirHm. Actually, that's more or less the same code as I had from the beginning. And that had tests. See for example in #52 which I think is the last place they existed.
@beejebus: I don't see how that would help here. This is not predictable, it happens when you have many requests that are trying to build the same twig templates, possibly on a slow file system. That's very different from a predictable lock where you know the second request will hit the lock.
Comment #103
Anonymous (not verified) CreditAttribution: Anonymous commented@berdir - the only predictable thing here is that we'll never catch these bugs by sending one request at a time.
we have many sections of code that make assertions about safety when running across multiple concurrent requests, and test literally 0% of them that way. hard to figure out why 'it might not catch everything with our test suite' is used to defend 'we will never catch any of this class of bugs with our current test suite'.
Comment #104
Fabianx CreditAttribution: Fabianx as a volunteer commentedYes, right, sorry berdir I totally forgot that #3 had a test, which at least tests the error condition part of it.
Lets merge the test of #3 into this patch.
Comment #105
BerdirThe test in #52 might be easier to merge, I had to update it at some point.
Comment #106
geerlingguy CreditAttribution: geerlingguy commentedJust to give a little more of a push for this; I fired up a fresh instance of the entire Dramble stack (I re-imaged all the 6 Pi microSD cards, then reprovisioned everything, and installed Drupal 8 HEAD from this morning), and now it seems about every 5th or 6th page load (when logged in) results in a WSOD on first load, then if I refresh, the page loads fine.
I currently have the Nginx balancer set to go round-robin, with no IP pinning, so each request goes to a new web node behind the balancer. If I turn IP pinning on, then requests work correctly. But in the real world, most requests would be distributed amongst the backend servers, resulting in users hitting the WSOD pages on initial page load.
The error in the Nginx log on the backend server is as follows:
Comment #107
geerlingguy CreditAttribution: geerlingguy commentedI definitely still agree with 'Major' priority... might even argue for Critical, since now I can reproduce this error in three different environments—running Debian 7.8 with PHP 7.0.0 beta1 compiled, I get errors like the following on every 3-5 page loads:
This is on a Raspberry Pi with slow disk writes (since Drupal is running on a microSD card)... maybe the slow disk access is causing the template to not be found? The compiled directory/file is most definitely present at
sites/default/files/php/twig/1#40#8a#c94a722c70a8d9e243d9eebb745aeff646c4fc11d22dd91885dbb5553bce
, but the .php file is empty...Though the Pi is an exceptional use case, I have definitely seen weird/slow disk writes in many cloud environments, and I can imagine more people will run into these issues as well.
Update: It seems that it may have been a weird case of FPM having some split-brained behavior. A full restart of the server seems to have cleared up the errors this time; maybe different child processes had different/broken cached file directory information or something.
Comment #108
star-szrI can't get the test-only to fail, but here's something.
Comment #109
BerdirWould be very interesting to see if #2527478: Resolve infinite stampede in mtime protected PHP storage improves the frequency of this happening.
I still think we should get this committed to avoid the fatals *if* it happens. But I hope that the other issue means it happens much less frequently.
Comment #110
geerlingguy CreditAttribution: geerlingguy commentedStill RTBC in my book—and that test probably won't ever fail unless you can make testbots do multiple concurrent requests (#2527478: Resolve infinite stampede in mtime protected PHP storage should help). Let's stop the bleeding...
Comment #111
Fabianx CreditAttribution: Fabianx as a volunteer commentedDon't we add something by now to inline template names in the InlineLoader?
I think if we want to fool the cache, we need to do the same {# #} comment adding here ...
Comment #112
star-szrYup that is probably needed to calculate the cache filename properly: https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21Template%...
Doing a quick test of this:
$cache_file = $environment->getCacheFilename('{# inline_template_start #}' . $name);
Seemed to result in an infinite loop for me. Not sure why…
Edit: And also combined with doing an actual inline template like @Fabianx suggested earlier:
Hello World - {{ 1+0 }}
.Comment #113
Fabianx CreditAttribution: Fabianx as a volunteer commentedHmm, maybe lets use a real template then instead of an inline template or call twig_render_template directly?
Comment #114
geerlingguy CreditAttribution: geerlingguy commentedReroll of #108; test file changed.
Note that this patch is still necessary to get things working with Drupal 8 on my little Raspberry Pi cluster, using GlusterFS for a shared files directory mount. I seem to have encountered another race condition over in #2540912: Installation fails with files directory on glusterfs: "Warning: mkdir(): File exists", but it looks like that one is resolved by #2497243-188: Replace Symfony container with a Drupal one, stored in cache.
Comment #115
catchI'd been assuming this was self-healing, but had missed berdir's comment in https://www.drupal.org/node/2429659#comment-9845943 where he pointed out the site was down permanently until they manually intervened. Since berdir's site is more or less the only 8.x site with any traffic at the moment, we should assume that people will run into this more often.
I'm bumping this to critical. However also this is pretty much there except for test coverage. Given it's very hard to test, and we have months of manual testing on patched installs, let's add the regression test in a follow-up issue.
Comment #116
Fabianx CreditAttribution: Fabianx as a volunteer commentedRTBC then, we can fix the test in a follow-up. (and should create one for it)
Comment #117
lauriiiI've seen this problem too so agree that this is quite critical
Comment #118
catchShould we not remove the test and put it back in the follow-up?
Or is it worth having the test there to avoid a further regression which might actually make the test fail? If so I could go for that, but I think it needs a @todo in the test to point out that we've not been able to get it to fail.
Comment #119
geerlingguy CreditAttribution: geerlingguy at Acquia commentedPatch with bugfix-only is attached. I'll split the test into a separate patch for a follow-up issue, to be opened momentarily...
Comment #120
geerlingguy CreditAttribution: geerlingguy at Acquia commentedUpdated IS, since it's been a while...
Comment #121
geerlingguy CreditAttribution: geerlingguy at Acquia commentedAdded related issue #2541440: Add tests for race conditions in Twig template cache for the tests. I'll supply the first patch, with the broken out test, over there.
Can we get an RTBC on the above reroll without the tests?
Comment #122
mpdonadioI kinda think we should add a @todo linking to the issue mentioning, and also mention that there is no coverage for that code.
Comment #123
dawehner+1 for that idea. Yeah race conditions can be hard to test, this is for sur.
Thank you @geerlingguy for updating the issue summary.
Comment #124
Damien Tournoud CreditAttribution: Damien Tournoud commentedFrom the look of this, the main bug is actually in
\Drupal\Component\PhpStorage\FileStorage
, this has more race conditions than lines of code.The main problem is the implementation of
PhpStorageInterface::save()
, which is basically a wrapper aroundfile_put_contents()
and as a consequence has two major problems:file_put_contents()
doesn't perform any kind of locking, which means that readers are going to see partial content. Because the readers are using@include
, all the errors are ignored, and partial content is most likely going to appear as a cache miss (but really, who knows? we could also totally load a partial template, or execute some arbitrary code);We need to implement a standard "write to a temporary file, flush and rename" protocol here.
We could also use a fixed name for the temporary file, and lock it with
LOCK_EX
, which might or might not help with stampedes. (Nowadays, most distributed filesystems have locking primitives, but it might be only local to the node.)Comment #125
Berdir@znerol tried to improve that in #2527478: Resolve infinite stampede in mtime protected PHP storage I think. A review from you would be great there.
There were also various attempts by @chx and @mpdonadio earlier in this issue to solve it in the php storage but none really worked so this eventually circled back to this workaround, which is also very close to what I originally wrote in my early patches.
Comment #126
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedEdited: cross-post with the first part of #125.
Comment #127
geerlingguy CreditAttribution: geerlingguy at Acquia commentedSo, should I stick a @todo mentioning the other issue in the
loadTemplate()
method, or somewhere else?Comment #128
Fabianx CreditAttribution: Fabianx as a volunteer commented#127: Lets add the @todo to the test that the current test is non-functional and will be fixed in that other issue.
Comment #129
geerlingguy CreditAttribution: geerlingguy at Acquia commented@Fabianx - latest patch (#119) removes the test entirely as per #118 and above comments. Should we add that test back in for now, and leave the @todo with that test, or keep the test out for now, and put the todo with the changed function?
Comment #130
Fabianx CreditAttribution: Fabianx as a volunteer commentedOh, I missed #119. The code does not need a comment as the test is major.
Comment #131
alexpottThe patch attached improves the situation so I'm going ahead before beta 13 since this is not disruptive at all. I agree with @Damien Tournoud and @znerol - the real issue will be solved in #2527478: Resolve infinite stampede in mtime protected PHP storage which looks promising.
Committed 7595aa7 and pushed to 8.0.x. Thanks!
Comment #134
star-szrI'm working on #2555243: Upgrade path / plan to Twig 2.x aka 2.0 which involves porting some of our custom code in TwigEnvironment into a new Twig cache class.
I'm wondering if #2527478: Resolve infinite stampede in mtime protected PHP storage maybe makes this change no longer necessary. I'm a bit stuck because I don't see a way right now to incorporate this "just in case" handling into the cache class I'm writing so I'd like to ask for some testing to see if this issue can be reproduced with the current code.
If @geerlingguy or @Berdir or anyone else who was able to reproduce this can test the following patch that would be extremely helpful:
#2426563-93: Ignore: Patch testing issue
https://www.drupal.org/files/issues/upgrade_twig_test_1.x-cache4.patch
I tried it myself using @geerlingguy's Drupal VM steps in #96 on Drupal 8 HEAD with the patch and couldn't reproduce any errors. However I also tested on Drupal 8 commit
b158c354b15abee86b58c2342c985456392d44a8
(based on the timestamp of #96) and was still not able to reproduce the error so I don't consider my testing to be definitive. I also tried with an older version of Drupal VM (3d5b33be5c974f4e26c96947d4f82239231c5301) just in case but still no luck breaking it.Comment #135
star-szrActually I just got it to break using Drupal 8 b158c354b15abee86b58c2342c985456392d44a8 and the latest Drupal VM. Yay! I'll try with the patch on D8 HEAD a few more times.
Comment #136
star-szrDevelopment has now moved to a proper issue of its own (#2568171: Upgrade to Twig 1.22 and implement our own cache class) and we have some plans now for re-implementing this fallback in our cache class. Still would be good to have it tested manually once we have some code because we don't have automated tests for it.