Problem/Motivation

This is a public follow-up for https://security.drupal.org/node/180370 following the 2 week waiting period for both maintainers and security team members to change the issue status away from "Needs public followup".

Acquia Search has a potential to become a contributor to temporary denial of service conditions on cache clear due to excessive locking, which can lead to exhaustion of PHP processes. On cold caches, multiple blocking HTTP requests are made from the Drupal application to Acquia hosted services, each of which have nondeterministic timing in their responses. Acquia Search in particular has the potential to lock a PHP process for up to 90 seconds.

In \Drupal\acquia_search\AcquiaSearchApiClient::getSearchIndexes we see the following code:

    $lock = 'acquia_search_get_search_indexes';
    if (!$this->lock->acquire($lock)) {
      $this->lock->wait($lock);
      // Throw an exception after X amount of seconds.
      if (!$this->lock->acquire($lock)) {
        throw new \Exception("Couldn't acquire lock for $lock in less than 30 seconds.");
      }
    }

Picking this apart, the first call to $this->lock->acquire($lock)) has the potential to block the process for up to 30 seconds. Next, $this->lock->wait($lock) has the potential to block the process for an additional 30 seconds. Finally, the second call to $this->lock->acquire($lock) can block for another 30 seconds. In total, it seems that this code has the latent potential to halt a PHP process for up to 90 seconds.

We have seen this exception thrown recently on deployments right after drush cr steps.

Steps to reproduce

With a large site under heavy load, flush caches. Sometimes, witness that PHP processes are exhausted and that Couldn't acquire lock for acquia_search_get_search_indexes in less than 30 seconds. exceptions appear in the error logs.

Proposed resolution

Greatly reduce lock timing from the default 30 seconds to something more sane. What exactly is the API performance timing of the Acquia services that the HTTP calls are made to? Responses in 1 second? 3 seconds? Surely 30 second responses aren't to be expected from the Acquia services.

- or -

Only lock on one process and on subsequent requests, return immediately, perhaps making a call to \Drupal::messenger() with a user-friendly error message. I.e. "Acquia Search is not available right now, please try again in a few seconds."

Remaining tasks

Discuss, open MR, commit, profit.

User interface changes

Perhaps a new end-user facing error message.

API changes

TBD

Data model changes

None

CommentFileSizeAuthor
#5 acquia_serach-3466306.patch1002 byteschandu7929
Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

Luke.Leber created an issue. See original summary.

rajeshreeputra made their first commit to this issue’s fork.

chandu7929 made their first commit to this issue’s fork.

chandu7929’s picture

StatusFileSize
new1002 bytes
chandu7929’s picture

Status: Active » Needs review

I have verified the attached patch which fixes the issue. Needs review.

"patches": {
            "drupal/acquia_search": {
                "Acquia Search can lock threads excessively leading to denial of service on cache clear": "https://www.drupal.org/files/issues/2024-09-10/acquia_serach-3466306.patch"
            }
        }
luke.leber’s picture

Status: Needs review » Needs work

I'm not sure that any refactoring effort that still utilizes locks to make multiple request processes wait is a real solution to this problem.

There are two things that can potentially help:

1. Greatly reduce lock time (a naive approach here could simply cause WSOD conditions for website visitors, so this probably isn't a viable solution).

2. Gracefully handle scenarios where there is *a single* process waiting for a response from the Acquia infrastructure and degrade the search experience until that process succeeds.

It's unclear how Acquia manages community contributions to their modules, but these are my recommendations FWIW.

Setting back to NW. If that's not welcomed in this context, let me know and I won't interfere further.

Thanks.

luke.leber’s picture

See #8 for an example of what's meant by the comments in #7. For clarity, here's the updated approach outside of a diff.

    $lock = 'acquia_search_get_search_indexes';

    // Another process has this lock. Tying up this process is wasteful.
    if (!$this->lock->lockMayBeAvailable($lock)) {
      return FALSE;
    }

    // Attempt to acquire the lock for 1 second. This should be more than
    // enough time, because if it takes longer than 1 second, it's likely
    // that another process has already acquired it since we checked if it
    // was available a few nanoseconds ago.  Safe concurrency gets weird!
    if (!$this->lock->acquire($lock, 1)) {
      return FALSE;
    }

    // The current process owns the lock now, so make the blocking calls.
    $indexes = $this->searchRequest($path, $query_string);

    // Be sure to release the lock when we're finished.
    $this->lock->release($lock);

    if (empty($indexes) && !is_array($indexes)) {
      // When API is not reachable, cache it for 1 minute.
      $this->cache->set($cid, FALSE, $now + 60, ['acquia_search_indexes']);
      return FALSE;
    }

Out of scope nit - might be better to use $lock_name = 'acquia_search_get_search_indexes' for clarity here.

Can lock timeouts be less than 1 second? If so, might consider using 0.1 seconds here. Timing on database locking -- which is probably a worst case scenario, is dependent on a VERY simple update query that looks like... UPDATE semaphore SET expire = 'some timestamp' WHERE name = 'whatever_the_lock_name_is' AND value = 'whatever_the_lock_id_is'; or a roughly equivalent INSERT query. If this takes > 0.1 seconds to execute, there are much bigger issues in the drupal site!

There seems to be another bug in the cacheability of the "unreachable API" condition... $this->cache->set($cid, FALSE, $now + 60, ['acquia_search_indexes']);. If one looks at where this cache item is checked, further up in ::getSearchIndexes, one will note that a non-strict comparison is used (if ($cache = $this->cache->get($cid)) { ... }) , so therefore setting FALSE in the cache doesn't really do anything. To make this work, a strict comparison should be used at the top of the method, like this: if (($cache = $this->cache->get($cid)) !== NULL) { ... }

luke.leber’s picture

You're 100% correct in that the critical section needs to be wrapped in a try block for safety here, @benjifisher. Thanks for catching that!

luke.leber’s picture

Our TAM suggested that the engineers on this issue might want to connect in order to iron out easier steps to reproduce the issue, so that a potential fix can be confirmed.

In order to test this issue without needing an Acquia account at all (meaning any Drupal 10 / 11 vanilla installation can work for this!), make sure the following preconditions are met:

  1. You will need four or five different authenticated users in incognito tabs (attempting to test this with a single user will be thwarted by PHP session locking!). Incognito tabs can work for this.
  2. A patch to mock a slow response from Acquia's search service (follows).
    diff --git a/src/AcquiaSearchApiClient.php b/src/AcquiaSearchApiClient.php
    index 4bb268e..bc34604 100644
    --- a/src/AcquiaSearchApiClient.php
    +++ b/src/AcquiaSearchApiClient.php
    @@ -211,6 +211,19 @@ class AcquiaSearchApiClient {
        *   Response array or FALSE.
        */
       private function searchRequest(string $path, string $query_string) {
    +
    +    // !!! DO NOT COMMIT THIS !!!
    +    // Simulate a slow response from Acquia's search service.
    +    //
    +    // This is an exaggerated time, but makes it easier to see the underlying
    +    // problem.  This simulates a blocking HTTP call that is hanging.
    +    sleep(30);
    +
    +    // We don't actually want to call the real search service, since the goal
    +    // of what we're doing is to simulate a worst case scenario without any
    +    // real world conditions to contend with.
    +    return FALSE;
    +
         $subscription_data = $this->subscription->getSubscription();
         // Return no results if there is no subscription data.
         if (!$subscription_data || !$this->subscription->isActive()) {
        

Once signed into all of the various user accounts and the simulated enabling criteria patch is in place...

  1. Flush the drupal cache
  2. Visit /admin/config/search/search-api/server/acquia_search_server on one account. This will be the process that obtains the lock. It will hang for 30 seconds before releasing its lock and failing gracefully.
  3. At the same time that the first request is executing, fire off an additional request from each different incognito window to /admin/config/search/search-api/server/acquia_search_server. This simulates a backlog of requests that would ordinarily be generated by users on a production site.
  4. Note that *all* of the processes will be blocked until either they either...

    1. Obtain the lock, spins for an additional 30 seconds, and fails gracefully
    2. - or - fails to acquire the lock and fails with an exception

    In either scenario, the processes will just hang for excessive amounts of time. On a real production site, there are a finite number of php processes available. Eventually resources will be exhausted and end-users will be affected.

If more information, or a live demonstration is desired, please reach out to me and we can set up a call.

Thanks.

japerry made their first commit to this issue’s fork.

japerry’s picture

Status: Needs work » Needs review

I also saw the logic issue with the FALSE cache set and retrieving the cache. I also noticed when using the snippet above (returning false on the search request) that the 'getAvailableCores' method gets called multiple times, meaning the API query to get search indexes was getting called multiple times. (which is probably where the lock issue is coming from)

I've made a few changes to PR41, but would love a review before it gets committed!

luke.leber’s picture

Hey, Jakob.

I'll take a look and see if we can apply the patch for tomorrow's deployment and gather some feedback from the ones generally affected by slowdowns on deployments.

I'm very confident that this will make a world of difference for our staff especially.

Thanks!

luke.leber’s picture

Status: Needs review » Needs work
I'm pleased to report having patched !41 into prod and did not receive any locking exceptions on deploy this morning.

Setting back to NW for test fails, but !41 seems to do the trick, functionally.

japerry changed the visibility of the branch 3466306-couldnt-acquire-lock-fix to hidden.

luke.leber’s picture

Status: Needs work » Reviewed & tested by the community

That's a clever use of drupal_static.

Everything looks good to me here. Our team reviewed the functional bits yesterday fairly comprehensively in a code review.

Thanks!

  • japerry committed d0fbe51e on 3.1.x
    Issue #3466306 by luke.leber, japerry: Acquia Search can lock threads...
japerry’s picture

Version: 3.2.x-dev » 3.1.x-dev
Status: Reviewed & tested by the community » Fixed

Perfect! yah tests look good (both manual and auto) Committed.

japerry’s picture

Note, I'll have to make some changes to get a new fix for 3.2... which should be out as a full release soon.

luke.leber’s picture

Issue credit should be given to Benji Fisher too, as he performed a critical review.

  • japerry committed a9b4d525 on 3.2.x
    Issue #3466306 by luke.leber, japerry: Acquia Search can lock threads...

japerry’s picture

Oh he didn't show up on the issue queue but I see it in the tracker. Done! Also, @luke.leber, if you get a chance, can you test against Acquia Search 3.2.0-rc1? Automated tests are passing, but we don't have great insight into the various configurations customers have. Once thats done, we can make 3.2.0 and not have to maintain two branches.

luke.leber’s picture

I can ask our search specialist to take a look at the changes and give things a test drive. We have quite a bit of other stuff going on, but there may be some extra hours.

Can we get a "cut a 3.2.0 issue" spun up so he can report the feedback on it?

Thanks

japerry’s picture

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.