Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Currently, the letsencrypt queue injects verify tasks for all sites with le certs every week. This is causing issues when there are hundreds of such sites -- it creates a huge task backlog.
Instead, the code should be a bit more nuanced about the verify scheduling, and spread things out as much as possible.
Comment | File | Size | Author |
---|---|---|---|
#22 | 2981309-22-improve-stability.patch | 2.12 KB | dpovshed |
#18 | 2981309-hosting_https-two_weeks.patch | 640 bytes | helmo |
#8 | interdiff_0_8.txt | 2.12 KB | dpovshed |
#8 | 2981309-hosting_https-queue-rework_8.patch | 2.82 KB | dpovshed |
#6 | interdiff_0_6.txt | 1008 bytes | dpovshed |
Comments
Comment #2
bdragon CreditAttribution: bdragon at Tag1 Consulting for Advisor Websites commentedHere's an initial go at a patch.
I'm not completely happy with it, and it has not been tested, but would appreciate feedback.
Comment #3
bdragon CreditAttribution: bdragon at Tag1 Consulting for Advisor Websites commentedSo the theory here is as follows:
* Sites that have been verified within the last month or so shouldn't need to be reverified.
* Sites that have gone the longest without being verified should be verified first.
* We should be using the spread queue to spread the verifies out as much as possible.
Unknowns
* Not sure how the math works out with the spread queue -- maybe I should be leaving the count at all sites instead of eligible sites? -- should be fine as long as verifys actually happen
* Maybe be a bit more conservative on the math for eligible items and consider them ready for verify a few days LESS than the 7*4? Haven't thought out a scheduling simulation for different numbers of sites. Might need to have it set to (renewal window) - (schedule period)? Not sure about the math.
Comment #4
helmo CreditAttribution: helmo as a volunteer and at Initfour websolutions for Aegir Cooperative commentedLooks like a good idea.
The spread queue was a fairly recent addition in #1405904: Hosting cron # of sites per run mismatch between UI and reality
It does however give me a 'PDOException: SQLSTATE[HY093]: Invalid parameter number: number of bound variables does not match number of tokens' though.
Comment #5
dpovshed CreditAttribution: dpovshed as a volunteer and at Drupal Ukraine Community commentedYep, I confirm - the same error
Comment #6
dpovshed CreditAttribution: dpovshed as a volunteer and at Drupal Ukraine Community commentedUpdated patch attached
Comment #7
dpovshed CreditAttribution: dpovshed commented...
Comment #8
dpovshed CreditAttribution: dpovshed commentedThis one is superseding my previous patches.
In addition here SQL request is updated - JOIN was used and the query itself converted to standard Drupal format.
This one seems finally working, let me see how it will proceed large number of sites.
Comment #9
helmo CreditAttribution: helmo as a volunteer and at Initfour websolutions for Aegir Cooperative commentedNow on my dev ... seems to work.
I've committed it on the 2910575-verify-spread branch.
The $queue_only parameter could have a name that's more descriptive, but I've not sure what ... I've added some phpdoc comments to at least explain it a bit.
Comment #11
helmo CreditAttribution: helmo as a volunteer and at Initfour websolutions for Aegir Cooperative commentedPS(note to self): An easy way to debug this function is
drush @hm php-eval "print_r(hosting_letsencrypt_get_sites(true, 2));"
Comment #13
helmo CreditAttribution: helmo as a volunteer and at Initfour websolutions for Aegir Cooperative commentedComment #15
acWould this change effectively show no sites in the queue if all sites had been verified recently?
Previously I always had 50+ sites in the queue at any given time but lately I noticed there are none. Running drush @hm php-eval "print_r(hosting_letsencrypt_get_sites(true, 2));" returns an empty array.
Comment #16
helmo CreditAttribution: helmo as a volunteer and at Initfour websolutions for Aegir Cooperative commented@ac yes I think that was the idea here.
However I came to this issue because I noticed a site where the cert is about to expire in 14 days. The last verify task was 16 days ago so 12 days before the site is scheduled again in the renewal queue.
That is cutting it very close :( ... I'd like to avoid getting nervous from monitoring warnings. What about changing the 4 weeks to 2.
Comment #17
colanWhen I originally wrote this stuff, it was tricky staying between LE's rate limits and not getting sites updated in time. So any suggestions to optimize this would be welcome!
Comment #18
helmo CreditAttribution: helmo as a volunteer and at Initfour websolutions for Aegir Cooperative commentedHere's the patch for that
Comment #20
ergonlogicAs Colan mentioned, when we originally wrote this, we took some shortcuts. Scheduling verify tasks without checking the certs' expiry was expedient, but perhaps a little heavy-handed. We could have the queue task check for imminent expiry (within a week, for example) and only queue the 'verify' task at that point.
Comment #21
colan+1
This would:
Setting back to Active as I'm now convinced this is what we should do (and we don't have a patch for it yet).
Comment #22
dpovshed CreditAttribution: dpovshed commentedWhile the attached patch is not a rework with more modest usage of LetsEncrypt I decided to share it in case someone else with large number of files finds it useful.
The modification does two things:
- prevent the queue to be stuck by firing the same task(s) again and again - if the verification already scheduled it will not be scheduled for 2nd time;
- in case of verification has failed it will not be repeated for one week. This prevents misconfigured item to block the queue.