Problem/Motivation
Acquia Search has a potential to become a contributor to temporary denial of service conditions on cache clear due to excessive locking, which can lead to exhaustion of PHP processes. On cold caches, multiple blocking HTTP requests are made from the Drupal application to Acquia hosted services, each of which have nondeterministic timing in their responses. Acquia Search in particular has the potential to lock a PHP process for up to 90 seconds.
In \Drupal\acquia_search\AcquiaSearchApiClient::getSearchIndexes we see the following code:
$lock = 'acquia_search_get_search_indexes';
if (!$this->lock->acquire($lock)) {
$this->lock->wait($lock);
// Throw an exception after X amount of seconds.
if (!$this->lock->acquire($lock)) {
throw new \Exception("Couldn't acquire lock for $lock in less than 30 seconds.");
}
}
Picking this apart, the first call to $this->lock->acquire($lock)) has the potential to block the process for up to 30 seconds. Next, $this->lock->wait($lock) has the potential to block the process for an additional 30 seconds. Finally, the second call to $this->lock->acquire($lock) can block for another 30 seconds. In total, it seems that this code has the latent potential to halt a PHP process for up to 90 seconds.
We have seen this exception thrown recently on deployments right after drush cr steps.
Steps to reproduce
With a large site under heavy load, flush caches. Sometimes, witness that PHP processes are exhausted and that Couldn't acquire lock for acquia_search_get_search_indexes in less than 30 seconds. exceptions appear in the error logs.
Proposed resolution
Greatly reduce lock timing from the default 30 seconds to something more sane. What exactly is the API performance timing of the Acquia services that the HTTP calls are made to? Responses in 1 second? 3 seconds? Surely 30 second responses aren't to be expected from the Acquia services.
- or -
Only lock on one process and on subsequent requests, return immediately, perhaps making a call to \Drupal::messenger() with a user-friendly error message. I.e. "Acquia Search is not available right now, please try again in a few seconds."
Remaining tasks
Discuss, open MR, commit, profit.
User interface changes
Perhaps a new end-user facing error message.
API changes
TBD
Data model changes
None
| Comment | File | Size | Author |
|---|---|---|---|
| #5 | acquia_serach-3466306.patch | 1002 bytes | chandu7929 |
Issue fork acquia_search-3466306
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #5
chandu7929 commentedComment #6
chandu7929 commentedI have verified the attached patch which fixes the issue. Needs review.
Comment #7
luke.leberI'm not sure that any refactoring effort that still utilizes locks to make multiple request processes wait is a real solution to this problem.
There are two things that can potentially help:
1. Greatly reduce lock time (a naive approach here could simply cause WSOD conditions for website visitors, so this probably isn't a viable solution).
2. Gracefully handle scenarios where there is *a single* process waiting for a response from the Acquia infrastructure and degrade the search experience until that process succeeds.
It's unclear how Acquia manages community contributions to their modules, but these are my recommendations FWIW.
Setting back to NW. If that's not welcomed in this context, let me know and I won't interfere further.
Thanks.
Comment #9
luke.leberSee #8 for an example of what's meant by the comments in #7. For clarity, here's the updated approach outside of a diff.
Out of scope nit - might be better to use
$lock_name = 'acquia_search_get_search_indexes'for clarity here.Can lock timeouts be less than 1 second? If so, might consider using 0.1 seconds here. Timing on database locking -- which is probably a worst case scenario, is dependent on a VERY simple update query that looks like...
UPDATE semaphore SET expire = 'some timestamp' WHERE name = 'whatever_the_lock_name_is' AND value = 'whatever_the_lock_id_is';or a roughly equivalent INSERT query. If this takes > 0.1 seconds to execute, there are much bigger issues in the drupal site!There seems to be another bug in the cacheability of the "unreachable API" condition...
$this->cache->set($cid, FALSE, $now + 60, ['acquia_search_indexes']);. If one looks at where this cache item is checked, further up in ::getSearchIndexes, one will note that a non-strict comparison is used (if ($cache = $this->cache->get($cid)) { ... }) , so therefore setting FALSE in the cache doesn't really do anything. To make this work, a strict comparison should be used at the top of the method, like this:if (($cache = $this->cache->get($cid)) !== NULL) { ... }Comment #10
luke.leberYou're 100% correct in that the critical section needs to be wrapped in a
tryblock for safety here, @benjifisher. Thanks for catching that!Comment #11
luke.leberOur TAM suggested that the engineers on this issue might want to connect in order to iron out easier steps to reproduce the issue, so that a potential fix can be confirmed.
In order to test this issue without needing an Acquia account at all (meaning any Drupal 10 / 11 vanilla installation can work for this!), make sure the following preconditions are met:
Once signed into all of the various user accounts and the simulated enabling criteria patch is in place...
Note that *all* of the processes will be blocked until either they either...
In either scenario, the processes will just hang for excessive amounts of time. On a real production site, there are a finite number of php processes available. Eventually resources will be exhausted and end-users will be affected.
If more information, or a live demonstration is desired, please reach out to me and we can set up a call.
Thanks.
Comment #13
japerryI also saw the logic issue with the FALSE cache set and retrieving the cache. I also noticed when using the snippet above (returning false on the search request) that the 'getAvailableCores' method gets called multiple times, meaning the API query to get search indexes was getting called multiple times. (which is probably where the lock issue is coming from)
I've made a few changes to PR41, but would love a review before it gets committed!
Comment #14
luke.leberHey, Jakob.
I'll take a look and see if we can apply the patch for tomorrow's deployment and gather some feedback from the ones generally affected by slowdowns on deployments.
I'm very confident that this will make a world of difference for our staff especially.
Thanks!
Comment #15
luke.leberSetting back to NW for test fails, but !41 seems to do the trick, functionally.
Comment #17
luke.leberThat's a clever use of drupal_static.
Everything looks good to me here. Our team reviewed the functional bits yesterday fairly comprehensively in a code review.
Thanks!
Comment #19
japerryPerfect! yah tests look good (both manual and auto) Committed.
Comment #20
japerryNote, I'll have to make some changes to get a new fix for 3.2... which should be out as a full release soon.
Comment #21
luke.leberIssue credit should be given to Benji Fisher too, as he performed a critical review.
Comment #24
japerryOh he didn't show up on the issue queue but I see it in the tracker. Done! Also, @luke.leber, if you get a chance, can you test against Acquia Search 3.2.0-rc1? Automated tests are passing, but we don't have great insight into the various configurations customers have. Once thats done, we can make 3.2.0 and not have to maintain two branches.
Comment #25
luke.leberI can ask our search specialist to take a look at the changes and give things a test drive. We have quite a bit of other stuff going on, but there may be some extra hours.
Can we get a "cut a 3.2.0 issue" spun up so he can report the feedback on it?
Thanks
Comment #26
japerryDone! #3476239: Acquia Search 3.2.0 release