Currently, the letsencrypt queue injects verify tasks for all sites with le certs every week. This is causing issues when there are hundreds of such sites -- it creates a huge task backlog.

Instead, the code should be a bit more nuanced about the verify scheduling, and spread things out as much as possible.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

bdragon created an issue. See original summary.

bdragon’s picture

Status: Active » Needs review
FileSize
2.59 KB

Here's an initial go at a patch.

I'm not completely happy with it, and it has not been tested, but would appreciate feedback.

bdragon’s picture

So the theory here is as follows:

* Sites that have been verified within the last month or so shouldn't need to be reverified.
* Sites that have gone the longest without being verified should be verified first.
* We should be using the spread queue to spread the verifies out as much as possible.

Unknowns

* Not sure how the math works out with the spread queue -- maybe I should be leaving the count at all sites instead of eligible sites? -- should be fine as long as verifys actually happen
* Maybe be a bit more conservative on the math for eligible items and consider them ready for verify a few days LESS than the 7*4? Haven't thought out a scheduling simulation for different numbers of sites. Might need to have it set to (renewal window) - (schedule period)? Not sure about the math.

helmo’s picture

Looks like a good idea.
The spread queue was a fairly recent addition in #1405904: Hosting cron # of sites per run mismatch between UI and reality

It does however give me a 'PDOException: SQLSTATE[HY093]: Invalid parameter number: number of bound variables does not match number of tokens' though.

dpovshed’s picture

Yep, I confirm - the same error

PDOException: SQLSTATE[HY093]: Invalid parameter number: number of bound variables does not match number of tokens: SELECT s.nid FROM[error]
{hosting_site} s, {hosting_https_site} h WHERE s.nid = h.nid AND h.https_enabled IN(:https_0, :https_1) AND s.status = :enabled AND
s.verified  1
    [:time] => 1527857419
    [:count] => 0
    [:https_0] => 1
    [:https_1] => 2
)
 in hosting_letsencrypt_get_sites() (line 93 of
.../sites/all/modules/hosting_https/submodules/letsencrypt/hosting_letsencrypt.module)
dpovshed’s picture

Updated patch attached

dpovshed’s picture

dpovshed’s picture

FileSize
2.82 KB
2.12 KB

This one is superseding my previous patches.

In addition here SQL request is updated - JOIN was used and the query itself converted to standard Drupal format.

This one seems finally working, let me see how it will proceed large number of sites.

helmo’s picture

Status: Needs review » Reviewed & tested by the community

Now on my dev ... seems to work.

I've committed it on the 2910575-verify-spread branch.

The $queue_only parameter could have a name that's more descriptive, but I've not sure what ... I've added some phpdoc comments to at least explain it a bit.

  • helmo committed 2c73785 on 2981309-verify-spread
    Issue #2981309 by helmo: Add docbook comments
    
  • helmo committed c4e558f on 2981309-verify-spread authored by dpovshed
    Issue #2981309 by dpovshed, bdragon: Letsencrypt queue not scalable
    
helmo’s picture

PS(note to self): An easy way to debug this function is drush @hm php-eval "print_r(hosting_letsencrypt_get_sites(true, 2));"

  • helmo committed 2c73785 on 7.x-3.x, 2981309-verify-spread
    Issue #2981309 by helmo: Add docbook comments
    
  • helmo committed c4e558f on 7.x-3.x, 2981309-verify-spread authored by dpovshed
    Issue #2981309 by dpovshed, bdragon: Letsencrypt queue not scalable
    
helmo’s picture

Status: Reviewed & tested by the community » Fixed

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

ac’s picture

Would this change effectively show no sites in the queue if all sites had been verified recently?

Previously I always had 50+ sites in the queue at any given time but lately I noticed there are none. Running drush @hm php-eval "print_r(hosting_letsencrypt_get_sites(true, 2));" returns an empty array.

helmo’s picture

Status: Closed (fixed) » Needs work

@ac yes I think that was the idea here.

However I came to this issue because I noticed a site where the cert is about to expire in 14 days. The last verify task was 16 days ago so 12 days before the site is scheduled again in the renewal queue.

That is cutting it very close :( ... I'd like to avoid getting nervous from monitoring warnings. What about changing the 4 weeks to 2.

colan’s picture

When I originally wrote this stuff, it was tricky staying between LE's rate limits and not getting sites updated in time. So any suggestions to optimize this would be welcome!

helmo’s picture

Status: Needs work » Needs review
FileSize
640 bytes

Here's the patch for that

  • helmo committed 33fcb65 on 7.x-3.x
    Issue #2981309 by dpovshed, helmo, bdragon, colan, ac: Reduce 4 to 2...
ergonlogic’s picture

As Colan mentioned, when we originally wrote this, we took some shortcuts. Scheduling verify tasks without checking the certs' expiry was expedient, but perhaps a little heavy-handed. We could have the queue task check for imminent expiry (within a week, for example) and only queue the 'verify' task at that point.

colan’s picture

Status: Needs review » Active

+1

This would:

  • Prevent unnecessary verify tasks.
  • Reduce the probability of hitting LE's rate limits.
  • Keep the queue small so that many sites can successfully be updated per cycle.

Setting back to Active as I'm now convinced this is what we should do (and we don't have a patch for it yet).

dpovshed’s picture

While the attached patch is not a rework with more modest usage of LetsEncrypt I decided to share it in case someone else with large number of files finds it useful.

The modification does two things:

- prevent the queue to be stuck by firing the same task(s) again and again - if the verification already scheduled it will not be scheduled for 2nd time;
- in case of verification has failed it will not be repeated for one week. This prevents misconfigured item to block the queue.