Problem/Motivation
fastcgi_finish_request() as called by Symfony\HttpFoundation\Response.php ensures that the request is finished as soon as possible.
However given #2579775: Missing Content-Length header blocks response until all registered shutdown functions execute and the Kernel terminate events run this is not even true.
This means that e.g. batch or other things that use shutdown functions lead to the behavior that the next request is started while the shutdown functions do run, which essentially is a race condition.
As only certain environments use php-fpm, but others use Apache, this bug only occurs sometimes, which in general is pretty bad for debugging.
Proposed resolution
Current:
Batch is the only subsystem in Drupal core to rely on shutdown functions. It does this in case there are fatal errors during batch processing so that it can show error messages, which is not the default case. Therefore we can avoid relying on the shutdown function for normal batch operation.
- Create a service to register shutdown functions, which runs after sending the response, but before shutting the connection down.
- Convert batchs usage of register_shutdown_function to this new service
Another possibility would be to:
- Override send() in our local inheritance of Response.php (HtmlResponse, CacheableResponse, JSONResponse, etc.)
- Move the fastcgi_finish_request() to an early Kernel Exit subscriber instead, which can be advised to not do so for this request
- File an upstream bug to do the same upstream
or only do it upstream.
Remaining tasks
- Discuss
- Create test-case to reproduce
User interface changes
- None
API changes
- None
Data model changes
- None
Background information
What happens is this: batch step 1 starts - batch step 2 starts - batch step 1 updates the {batch} table in shutdown. One symptom is that the single config import form gives you a broken error message "An error occurred while processing with arguments:" but the import actually succeeds. Other batch functionality might be more broken.
This happens because of the fastcgi_finish_request() call in Symfony\Component\HttpFoundation\Response::send(). If it is commented out then the error does not occur. This allows the next request to start before shutdown functions occur.
After adding microtime(TRUE) debugging to index.php and _batch_shutdown() the following can be seen:
request start: 1486641213.3378
before update: 1486641216.4444
after update: 1486641216.4568
request start: 1486641215.0022 (!!!!!!)
before update: 1486641217.5272
after update: 1486641217.5371
Comment | File | Size | Author |
---|---|---|---|
#38 | interdiff-2851111-34-38.txt | 466 bytes | dagmar |
#38 | 2851111-38.patch | 3.97 KB | dagmar |
#34 | 2851111-34.patch | 3.96 KB | tstoeckler |
#27 | 2851111-27.patch | 2.71 KB | dawehner |
#20 | 2851111-20-batch-fastcgi.patch | 3.43 KB | tstoeckler |
Comments
Comment #2
david_garcia CreditAttribution: david_garcia commentedComment #3
Fabianx CreditAttribution: Fabianx as a volunteer commentedI am making this critical on the grounds that it can potentially lead to data corruption and to get it to the attention of the core committers to decide if it is critical or not.
Comment #4
david_garcia CreditAttribution: david_garcia commented@fabianx: rather than "before sending the response", should this not be "before the connection to the client is closed", that is, before symfony does it's fastcgi_finish_request.
Because we want the response to be sent ASAP. The user can have the feeling the page is loaded, even if the connection is still open. This is not true, though, depending on the type of response and will only work for HTML pages. Redirects, Ajax and other get totally blocked until the request has 100% finished and connection is closed.
I understand that the race condition you mention only happens in environments where fastcgi_finish_request() does work.
Comment #5
Fabianx CreditAttribution: Fabianx as a volunteer commented#4: You are completely right. I will fix that in the IS.
Comment #6
catchI have a feeling there's already an issue open for this.
I'd be tempted to override send() to get rid of the fastcgi_finish_request(), unless someone can prove it has a genuine benefit for page serving which seems unlikely compared to BigPipe etc.
Comment #7
xjm? (Emphasis added)
Comment #8
tim.plunkettI was very confused by this, because I didn't write the single config import form to use a batch.
But this issue added it: #2486467: Single config import form needs to use the config validation events
Not super helpful, sorry.
Comment #9
tedbowThis is patch that is mix of the 2 proposed solutions.
Here is patch that
@todo need to pass ShutDownRegistry service in the Response.php constructor.
I have manually tested batch_test page and does bypass fastcgi_finish_request().
Sorry if this totally off track just getting a handle on how this works.
Comment #10
tedbowInstead of using the ShutdownRegistry service in the Response class this passes an optional $prevent_early_flush bool to the constructor. The classes the instantiate a response object that extends this will use the Shutdown service if needed. Updated the 2 instances of using the HTMLResponse that I saw.
If we want to push the change up stream to Symfony we shouldn't rely on our service in the Response.
Comment #11
Fabianx CreditAttribution: Fabianx as a volunteer commented#6: The early termination of a request is usually very helpful to do background work that is not affecting the other requests. It is just not very helpful for batch things.
I think we should have the deprecated drupal_register_shutdown_function() call the service as we do usually. (there is a @todo in the code stating that, so just iterating my support for that).
I think it is a good idea to add an option to avoid the early flush.
Can we open an upstream issue and solicit opinions from Symfony about this topic, too?
Edit:
I also think avoiding the early flush should be the default for registered shutdown functions with an option to opt-in to have this shutdown function be a "true" background callback.
Comment #12
catch@Fabianx well let's not forget #2564921: In PHP-FPM environment, enabling a module in using a 'configure' route leads to an error page / #2589967: Rebuild routes immediately when modules are installed. We have to be certain that background work isn't needed before the next request by the same user, and core itself was not doing that correctly.
Comment #13
catchAlso we have a background API, it's the queue API, so if things don't need to be done before the request finishes that's fine. Then we can support waiting queues or add something like #1189464: Add an 'instant' queue runner, but if this is a critical bug, let's fix the bug, not add new features based on the bug.
Comment #14
tedbowre @Fabianx
Ok I changed the default to TRUE for $prevent_early_flush parameter.
So now the only way preventEarlyFlush() would return TRUE now is if no functions were registered or if only functions were registered that explicitly set $prevent_early_flush to FALSE. If any 1 function is registered with $prevent_early_flush TRUE OR without a value for $prevent_early_flush then preventEarlyFlush() would return FALSE.
Also renamed the class from ShutDownRegistry from ShutdownRegistry(caps change) b/c shutdown is 1 word.
Deprecated drupal_register_shutdown_function() and replaced logic with calls to the new service.
Updated _drupal_shutdown_function() to call \Drupal\Core\Utility\ShutdownRegistry::getCallbacks()
Also I had misread drupal_register_shutdown_function() specifically
Updated the function signature to
public function registerShutdownCallable(callable $callable, $arguments = [], $prevent_early_flush = TRUE)
$arguments is new here.
So now only batch.inc is using registerShutdownCallable(). I haven't replaced other uses to see if we get any errors. Calls to drupal_register_shutdown_function() will still work but will not be able to set $prevent_early_flush.
Comment #16
tedbowI think I missed @catch's point
It seems like simplest way to do this would be to extend Response and check drupal_register_shutdown_function() directly to see if any functions were registered.
Here is patch that does that.
Not doing an interdiff because missing most changes from previous patch.
Comment #17
tedbowWhoops uploaded the patch 2x
Comment #19
catchThat looks like a good compromise to me, thanks!
Comment #20
tstoecklerThat is not quite enough as in the JS-version of the Batch
_batch_do()
returns aJsonResponse
directly which still goes through the upstream logic in Symfony with #16. So we need a specializedJsonRepsonse
as well.Comment #21
hchonov"depending on based on" -> one of both should be removed :).
I think this should be return $this? Or why are we not using just inheritdoc?
True -> TRUE and also "otherwise" is missing.
Also in the doc of the trait or the send method I would like to see at least one sentence giving the batch processing as an example why this is needed :).
Comment #22
alexpottDiscussed with @catch, @cilefen, @cottser, @laurii, and @xjm and we agreed that this was critical because of data integrity. There is a chance that a user delete batch might get halfway through and leave the site in an inconsistent state - with only some of a users content deleted or changed to anonymous. Also many site owners don't choose whether they are using fastcgi so can't fix it by making different choices.
I think we need a CR to tell people about the new trait and the issue if you extend a Symfony response.
Comment #23
dawehnerCan the issue summary try to explain why this wasn't explored? I guess one disadvantage would be that the current shutdown function solution even works in the case of running into php timeout limits?
I'm really curious why noone has filed an upstream bug: https://github.com/symfony/symfony/pull/21959
Comment #24
tstoecklerThanks @alexpott and @dawehner. It took me enough time to debug through this that I didn't really bother to consider the larger implications of this.
I tend to agree with the first part of #23 (and I guess some earlier comments, but didn't read the entire issue, sorry) that Batch API really is misusing the shutdown function system. If it is doing stuff in a shutdown function that depends on the subsequent request not having been started that just feels wrong.
If we do go with the current approach, then we should really hope that the Symfony PR goes in, because otherwise we would have IMO to (in a follow-up) create subclasses for all Symfony Response classes and replace all usages of them within Drupal. Otherwise we're just kicking the can down the road, i.e. if someone somehow uses any other type of Response with the Batch system they will hit this same very cryptic and hard to debug problem.
How do others feel?
Comment #25
dawehnerWell, there has to be a reason why its not simply doing stuff on the end of its controller function.
Yeah the subclasses feel quite hacky.
@tstoeckler
Do you mind commenting on the symfony issue with a more detailed reason why we kinda need that?
Comment #26
catchBatch could possibly continue to register the shutdown function, but also implement a late request listener which does the same thing, and sets something to prevent the shutdown from running (or make it re-entrant and just let it run twice).
The reason it's using shutdown is so it runs even if there's a fatal error, a late request listener is not going to help with that at all. However when there is such an error, there's not going to be a successful subsequent batch request, so it might be OK like that.
Comment #27
dawehnerI tried to experiment with the result of the latest discussion.
Comment #29
tstoecklerHmm... why do we actually need a response subscriber when we know exactly for which responses we need this behavior?
Here's something I cooked up in the flight back from Seville, so I think the tag still counts ;-)
Comment #30
catchCouple of nits on the comments, but that looks exactly like #26, thanks!
FastCGI and fastcgi_finish_request()
s/functions/function/
Without significantly changing batch itself (or at least its error handling), while it's not the nicest patch it feels OK. We should possibly open a follow-up to discuss wider alternatives.
Comment #31
Munavijayalakshmi CreditAttribution: Munavijayalakshmi at Valuebound commentedComment #32
Munavijayalakshmi CreditAttribution: Munavijayalakshmi at Valuebound commentedMade changes as per the comment #30. Applying the patch, please review.
Comment #33
dawehnerNice! I really like that. Its a small change to fix the problem.
Comment #34
tstoecklerOh, I had a little kernel test for the new function lying around, apparently that didn't make it into the patch... #fail
Comment #36
tstoecklerComment #37
tstoecklerI just realized this does not do what I wanted it to do: It sets
$batch
to a boolean instead of fetching the batch and checking the function, due to missingdependenciesparentheses.Not sure, but I think this cannot actually be tested due to the nature of the Batch API and the necessity to trigger a fatal.
Edit: Brain fail -> wrong word
Comment #38
dagmarIf this help, I have never been able to run an upgrade test in my current environment without this patch, this for sure make the batch system work at least with php-fpm.
Updated the patch to not make $batch a Boolean.
Comment #40
Nebel54We had problems with race conditions when large views objects get serialized into the batch table. We started to write a patch which looked pretty much the same as #38. We are using #38 now and it works perfectly.
I also fired manually a fatal error in my batch. The update function in _batch_shutdown() gets fired as it is supposed to.
Comment #41
catchComment #42
catchI've updated the issue summary to include the solution in the patch.
I can't think of a good way to test this patch as a whole, so happy with the RTBC here, sent the patch for a re-test against 8.5.x though.
Comment #44
catchCommitted/pushed to 8.5.x and cherry-picked to 8.4.x. Thanks!
Comment #46
catchNo change record needed because there's no API change with the approach we went with.
Comment #47
prestonso CreditAttribution: prestonso commentedTagging for inclusion in 8.4.0-beta1 release notes.
Comment #49
Gábor HojtsyCan someone help with a two line summary for #2895685: Create release notes (a.k.a CHANGELOG) for Drupal 8.4.0-alpha1 and Drupal 8.4.0? After staring at this issue for quite a while, I would not say I can intelligently summarize the problem/solution, sorry :/
Comment #50
Nebel54I propose:
[Batch processes are broken due to a race condition when using fastcgi.](https://www.drupal.org/node/2851111)
Comment #51
Gábor HojtsyWell, they are not anymore ;) We are supposed to list what were problems before and how we solved them. See other entries in #2895685: Create release notes (a.k.a CHANGELOG) for Drupal 8.4.0-alpha1 and Drupal 8.4.0. I managed to gather that there was something broken with fastcgi and batches. The idea with the release notes is to give some useful summary instead of just an index of issues.
Comment #52
Nebel54Ok, then how about that?
A race condition occured in the Batch API when using fastcgi. The Batch API now ensures that (the current batch state is written completely to the database before starting the next batch)[https://www.drupal.org/node/2851111].
Comment #53
Gábor HojtsySuperb, exactly what we needed, thanks!