Problem/Motivation

fastcgi_finish_request() as called by Symfony\HttpFoundation\Response.php ensures that the request is finished as soon as possible.

However given #2579775: Missing Content-Length header blocks response until all registered shutdown functions execute and the Kernel terminate events run this is not even true.

This means that e.g. batch or other things that use shutdown functions lead to the behavior that the next request is started while the shutdown functions do run, which essentially is a race condition.

As only certain environments use php-fpm, but others use Apache, this bug only occurs sometimes, which in general is pretty bad for debugging.

Proposed resolution

Current:
Batch is the only subsystem in Drupal core to rely on shutdown functions. It does this in case there are fatal errors during batch processing so that it can show error messages, which is not the default case. Therefore we can avoid relying on the shutdown function for normal batch operation.

- Create a service to register shutdown functions, which runs after sending the response, but before shutting the connection down.
- Convert batchs usage of register_shutdown_function to this new service

Another possibility would be to:

- Override send() in our local inheritance of Response.php (HtmlResponse, CacheableResponse, JSONResponse, etc.)
- Move the fastcgi_finish_request() to an early Kernel Exit subscriber instead, which can be advised to not do so for this request
- File an upstream bug to do the same upstream

or only do it upstream.

Remaining tasks

- Discuss
- Create test-case to reproduce

User interface changes

- None

API changes

- None

Data model changes

- None

Background information

What happens is this: batch step 1 starts - batch step 2 starts - batch step 1 updates the {batch} table in shutdown. One symptom is that the single config import form gives you a broken error message "An error occurred while processing with arguments:" but the import actually succeeds. Other batch functionality might be more broken.

This happens because of the fastcgi_finish_request() call in Symfony\Component\HttpFoundation\Response::send(). If it is commented out then the error does not occur. This allows the next request to start before shutdown functions occur.

After adding microtime(TRUE) debugging to index.php and _batch_shutdown() the following can be seen:

request start: 1486641213.3378
before update: 1486641216.4444
after update:  1486641216.4568
request start: 1486641215.0022 (!!!!!!)
before update: 1486641217.5272
after update:  1486641217.5371
Members fund testing for the Drupal project. Drupal Association Learn more

Comments

Fabianx created an issue. See original summary.

Fabianx’s picture

Priority: Major » Critical
Issue summary: View changes

I am making this critical on the grounds that it can potentially lead to data corruption and to get it to the attention of the core committers to decide if it is critical or not.

david_garcia’s picture

Create a service to register shutdown functions, which run before sending the response

@fabianx: rather than "before sending the response", should this not be "before the connection to the client is closed", that is, before symfony does it's fastcgi_finish_request.

Because we want the response to be sent ASAP. The user can have the feeling the page is loaded, even if the connection is still open. This is not true, though, depending on the type of response and will only work for HTML pages. Redirects, Ajax and other get totally blocked until the request has 100% finished and connection is closed.

I understand that the race condition you mention only happens in environments where fastcgi_finish_request() does work.

Fabianx’s picture

Issue summary: View changes

#4: You are completely right. I will fix that in the IS.

catch’s picture

I have a feeling there's already an issue open for this.

I'd be tempted to override send() to get rid of the fastcgi_finish_request(), unless someone can prove it has a genuine benefit for page serving which seems unlikely compared to BigPipe etc.

xjm’s picture

What happens is this: batch step 1 starts - batch step 2 starts - batch step 1 updates the {batch} table in shutdown. One symptom is that the single config import form gives you a broken error message "An error occurred while processing with arguments:" but the import actually succeeds. Other batch functionality might be more broken.

? (Emphasis added)

tim.plunkett’s picture

I was very confused by this, because I didn't write the single config import form to use a batch.

But this issue added it: #2486467: Single config import form needs to use the config validation events

Not super helpful, sorry.

tedbow’s picture

Status: Active » Needs review
FileSize
3.93 KB

This is patch that is mix of the 2 proposed solutions.

Here is patch that

  1. Overrides send() in our local inheritance of Response.php (used now only in HtmlResponse but probably should be used in CacheableResponse, JSONResponse, etc.)
  2. Creates a ShutDownRegistry service to register shutdown functions. When registering a shutdown you can set $prevent_early_flush.
  3. The inheritance of Response.php used the ShutDownRegistry to check preventEarlyFlush().
  4. Replace call to drupal_register_shutdown_function in batch.inc to use ShutDownRegistry(all other calls would be replaced and drupal_register_shutdown_function deprecated)

@todo need to pass ShutDownRegistry service in the Response.php constructor.

I have manually tested batch_test page and does bypass fastcgi_finish_request().

Sorry if this totally off track just getting a handle on how this works.

tedbow’s picture

Instead of using the ShutdownRegistry service in the Response class this passes an optional $prevent_early_flush bool to the constructor. The classes the instantiate a response object that extends this will use the Shutdown service if needed. Updated the 2 instances of using the HTMLResponse that I saw.

If we want to push the change up stream to Symfony we shouldn't rely on our service in the Response.

Fabianx’s picture

#6: The early termination of a request is usually very helpful to do background work that is not affecting the other requests. It is just not very helpful for batch things.

I think we should have the deprecated drupal_register_shutdown_function() call the service as we do usually. (there is a @todo in the code stating that, so just iterating my support for that).

I think it is a good idea to add an option to avoid the early flush.

Can we open an upstream issue and solicit opinions from Symfony about this topic, too?

Edit:

I also think avoiding the early flush should be the default for registered shutdown functions with an option to opt-in to have this shutdown function be a "true" background callback.

catch’s picture

@Fabianx well let's not forget #2564921: In PHP-FPM environment, enabling a module in using a 'configure' route leads to an error page / #2589967: Rebuild routes immediately when modules are installed. We have to be certain that background work isn't needed before the next request by the same user, and core itself was not doing that correctly.

catch’s picture

Also we have a background API, it's the queue API, so if things don't need to be done before the request finishes that's fine. Then we can support waiting queues or add something like #1189464: Add a browser-based queue runner, but if this is a critical bug, let's fix the bug, not add new features based on the bug.

tedbow’s picture

FileSize
10.07 KB
12.57 KB

re @Fabianx

I also think avoiding the early flush should be the default for registered shutdown functions with an option to opt-in to have this shutdown function be a "true" background callback.

Ok I changed the default to TRUE for $prevent_early_flush parameter.
So now the only way preventEarlyFlush() would return TRUE now is if no functions were registered or if only functions were registered that explicitly set $prevent_early_flush to FALSE. If any 1 function is registered with $prevent_early_flush TRUE OR without a value for $prevent_early_flush then preventEarlyFlush() would return FALSE.

Also renamed the class from ShutDownRegistry from ShutdownRegistry(caps change) b/c shutdown is 1 word.

Deprecated drupal_register_shutdown_function() and replaced logic with calls to the new service.

Updated _drupal_shutdown_function() to call \Drupal\Core\Utility\ShutdownRegistry::getCallbacks()

Also I had misread drupal_register_shutdown_function() specifically

* @param ...
 *   Additional arguments to pass to the shutdown function.

Updated the function signature to
public function registerShutdownCallable(callable $callable, $arguments = [], $prevent_early_flush = TRUE)
$arguments is new here.

So now only batch.inc is using registerShutdownCallable(). I haven't replaced other uses to see if we get any errors. Calls to drupal_register_shutdown_function() will still work but will not be able to set $prevent_early_flush.

Status: Needs review » Needs work

The last submitted patch, 14: 2851111-14.patch, failed testing.

tedbow’s picture

Status: Needs work » Needs review
FileSize
1.72 KB
1.72 KB

I think I missed @catch's point

but if this is a critical bug, let's fix the bug, not add new features based on the bug.

It seems like simplest way to do this would be to extend Response and check drupal_register_shutdown_function() directly to see if any functions were registered.

Here is patch that does that.

Not doing an interdiff because missing most changes from previous patch.

tedbow’s picture

Whoops uploaded the patch 2x

The last submitted patch, 16: 2851111-16.patch, failed testing.

catch’s picture

That looks like a good compromise to me, thanks!

tstoeckler’s picture

That is not quite enough as in the JS-version of the Batch _batch_do() returns a JsonResponse directly which still goes through the upstream logic in Symfony with #16. So we need a specialized JsonRepsonse as well.

hchonov’s picture

  1. +++ a/core/lib/Drupal/Core/Render/ResponseTrait.php
    @@ -0,0 +1,48 @@
    + * Determines whether fastcgi_finish_request should be called depending on
    + * based on whether any shutdown functions are registered.
    

    "depending on based on" -> one of both should be removed :).

  2. +++ a/core/lib/Drupal/Core/Render/ResponseTrait.php
    @@ -0,0 +1,48 @@
    +   * @return Response
    

    I think this should be return $this? Or why are we not using just inheritdoc?

  3. +++ a/core/lib/Drupal/Core/Render/ResponseTrait.php
    @@ -0,0 +1,48 @@
    +   *   True if any shutdown functions are registered.
    

    True -> TRUE and also "otherwise" is missing.

Also in the doc of the trait or the send method I would like to see at least one sentence giving the batch processing as an example why this is needed :).

alexpott’s picture

Discussed with @catch, @cilefen, @cottser, @laurii, and @xjm and we agreed that this was critical because of data integrity. There is a chance that a user delete batch might get halfway through and leave the site in an inconsistent state - with only some of a users content deleted or changed to anonymous. Also many site owners don't choose whether they are using fastcgi so can't fix it by making different choices.

I think we need a CR to tell people about the new trait and the issue if you extend a Symfony response.

dawehner’s picture

- Create a service to register shutdown functions, which runs after sending the response, but before shutting the connection down.
- Convert batchs usage of register_shutdown_function to this new service

Can the issue summary try to explain why this wasn't explored? I guess one disadvantage would be that the current shutdown function solution even works in the case of running into php timeout limits?

I'm really curious why noone has filed an upstream bug: https://github.com/symfony/symfony/pull/21959

tstoeckler’s picture

Thanks @alexpott and @dawehner. It took me enough time to debug through this that I didn't really bother to consider the larger implications of this.

I tend to agree with the first part of #23 (and I guess some earlier comments, but didn't read the entire issue, sorry) that Batch API really is misusing the shutdown function system. If it is doing stuff in a shutdown function that depends on the subsequent request not having been started that just feels wrong.

If we do go with the current approach, then we should really hope that the Symfony PR goes in, because otherwise we would have IMO to (in a follow-up) create subclasses for all Symfony Response classes and replace all usages of them within Drupal. Otherwise we're just kicking the can down the road, i.e. if someone somehow uses any other type of Response with the Batch system they will hit this same very cryptic and hard to debug problem.

How do others feel?

dawehner’s picture

I tend to agree with the first part of #23 (and I guess some earlier comments, but didn't read the entire issue, sorry) that Batch API really is misusing the shutdown function system. If it is doing stuff in a shutdown function that depends on the subsequent request not having been started that just feels wrong.

Well, there has to be a reason why its not simply doing stuff on the end of its controller function.

If we do go with the current approach, then we should really hope that the Symfony PR goes in, because otherwise we would have IMO to (in a follow-up) create subclasses for all Symfony Response classes and replace all usages of them within Drupal. Otherwise we're just kicking the can down the road, i.e. if someone somehow uses any other type of Response with the Batch system they will hit this same very cryptic and hard to debug problem.

Yeah the subclasses feel quite hacky.

@tstoeckler
Do you mind commenting on the symfony issue with a more detailed reason why we kinda need that?

catch’s picture

Batch could possibly continue to register the shutdown function, but also implement a late request listener which does the same thing, and sets something to prevent the shutdown from running (or make it re-entrant and just let it run twice).

The reason it's using shutdown is so it runs even if there's a fatal error, a late request listener is not going to help with that at all. However when there is such an error, there's not going to be a successful subsequent batch request, so it might be OK like that.

dawehner’s picture

I tried to experiment with the result of the latest discussion.

Status: Needs review » Needs work

The last submitted patch, 27: 2851111-27.patch, failed testing.

tstoeckler’s picture

Status: Needs work » Needs review
Issue tags: +DevDaysSeville
FileSize
2.73 KB

Hmm... why do we actually need a response subscriber when we know exactly for which responses we need this behavior?

Here's something I cooked up in the flight back from Seville, so I think the tag still counts ;-)

catch’s picture

Couple of nits on the comments, but that looks exactly like #26, thanks!

  1. +++ b/core/includes/batch.inc
    @@ -47,8 +47,21 @@ function _batch_page(Request $request) {
    +  // to handle this case. Because with FastCGI shutdown functions are called
    

    FastCGI and fastcgi_finish_request()

  2. +++ b/core/includes/batch.inc
    @@ -47,8 +47,21 @@ function _batch_page(Request $request) {
    +  // shutdown functions would lead to race conditions between consecutive
    

    s/functions/function/

Without significantly changing batch itself (or at least its error handling), while it's not the nicest patch it feels OK. We should possibly open a follow-up to discuss wider alternatives.

Munavijayalakshmi’s picture

Assigned: Unassigned » Munavijayalakshmi
Munavijayalakshmi’s picture

Assigned: Munavijayalakshmi » Unassigned
FileSize
2.77 KB
1.6 KB

Made changes as per the comment #30. Applying the patch, please review.

dawehner’s picture

Nice! I really like that. Its a small change to fix the problem.

tstoeckler’s picture

Oh, I had a little kernel test for the new function lying around, apparently that didn't make it into the patch... #fail

Status: Needs review » Needs work

The last submitted patch, 34: 2851111-34.patch, failed testing.

tstoeckler’s picture

Status: Needs work » Needs review
tstoeckler’s picture

Status: Needs review » Needs work
+++ b/core/includes/batch.inc
@@ -509,7 +551,7 @@ function _batch_finished() {
+  if ($batch = batch_get() && _batch_needs_update()) {

I just realized this does not do what I wanted it to do: It sets $batch to a boolean instead of fetching the batch and checking the function, due to missing dependenciesparentheses.

Not sure, but I think this cannot actually be tested due to the nature of the Batch API and the necessity to trigger a fatal.

Edit: Brain fail -> wrong word

dagmar’s picture

Status: Needs work » Needs review
FileSize
3.97 KB
466 bytes

Not sure, but I think this cannot actually be tested due to the nature of the Batch API and the necessity to trigger a fatal.

If this help, I have never been able to run an upgrade test in my current environment without this patch, this for sure make the batch system work at least with php-fpm.

Updated the patch to not make $batch a Boolean.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Nebel54’s picture

Status: Needs review » Reviewed & tested by the community

We had problems with race conditions when large views objects get serialized into the batch table. We started to write a patch which looked pretty much the same as #38. We are using #38 now and it works perfectly.

I also fired manually a fatal error in my batch. The update function in _batch_shutdown() gets fired as it is supposed to.

catch’s picture

Issue summary: View changes
catch’s picture

I've updated the issue summary to include the solution in the patch.

I can't think of a good way to test this patch as a whole, so happy with the RTBC here, sent the patch for a re-test against 8.5.x though.

  • catch committed 69bf50f on 8.5.x
    Issue #2851111 by tedbow, tstoeckler, Munavijayalakshmi, dagmar,...
catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed/pushed to 8.5.x and cherry-picked to 8.4.x. Thanks!

  • catch committed d00c20f on 8.4.x
    Issue #2851111 by tedbow, tstoeckler, Munavijayalakshmi, dagmar,...
catch’s picture

Issue tags: -Needs change record

No change record needed because there's no API change with the approach we went with.

prestonso’s picture

Issue tags: +8.4.0 release notes

Tagging for inclusion in 8.4.0-beta1 release notes.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Gábor Hojtsy’s picture

Can someone help with a two line summary for #2895685: Create release notes (a.k.a CHANGELOG) for Drupal 8.4.0-alpha1 and Drupal 8.4.0? After staring at this issue for quite a while, I would not say I can intelligently summarize the problem/solution, sorry :/

Nebel54’s picture

I propose:
[Batch processes are broken due to a race condition when using fastcgi.](https://www.drupal.org/node/2851111)

Gábor Hojtsy’s picture

Well, they are not anymore ;) We are supposed to list what were problems before and how we solved them. See other entries in #2895685: Create release notes (a.k.a CHANGELOG) for Drupal 8.4.0-alpha1 and Drupal 8.4.0. I managed to gather that there was something broken with fastcgi and batches. The idea with the release notes is to give some useful summary instead of just an index of issues.

Nebel54’s picture

Ok, then how about that?

A race condition occured in the Batch API when using fastcgi. The Batch API now ensures that (the current batch state is written completely to the database before starting the next batch)[https://www.drupal.org/node/2851111].

Gábor Hojtsy’s picture

Superb, exactly what we needed, thanks!