When freenode splits and rejoins, it takes Druplicon quite a while to works its way through all the joins/parts and start catching back up with factoids.

Comments

Morbus Iff’s picture

Title: Druplicon Lags on net split » Remove rate limits after full connection restored
Category: bug » feature

This is normal and is by design. Because Druplicon is in 75+ channels with thousands of users, the bot has to rate limit itself after a net split, otherwise it will never be able to rejoin (as Freenode will constantly deny it for operating too quickly). For each denial, the bot slows itself down slightly and tries again, until eventually it can join properly again. Ideally, once the bot is back connected, it should probably then UN-rate-limit itself.

jhr’s picture

Priority: Normal » Minor

Does that throttle also occur when Druplicon doesn't go anywhere? What about throttling if Druplicon is being flooded?

I want to blame the lag from having to store/check all the names against the seen/tell cache. I think one seen turned up someone who hasn't been in the channel for >1 year. Maybe a schedule/option to clear the cache?

Morbus Iff’s picture

When the bot "starts up", it joins one channel every 15 (30? I can't remember) seconds. After it joins the channel, it looks at all the people currently in the channel and static caches information about them, for use in various situations (not just Seen, but also for logging purposes in those channels that do log). These lookups take a lot of time and a lot of traffic initially, but are then manually maintained (added, removed, etc.) throughout the course of the bot's operation.

On a netsplit or a flood or any other sort of "where'd all mah channels go?!", the bot attempts to reconnect to every channel again, as quickly as possible - this is a "feature" of the underlying IRC library that is being used. In Druplicon's case, with 75+ channels, I think that's probably enough to piss Freenode off. However, because it is now reconnecting to all these channels, it again has to rescan the user lists, and it no longer has the buffer of 30 seconds (as above, on initial connect). This batch of commands pisses Freenode off, floods us out, we rate limit, try again, floods us out, etc., etc., until we reach a rate limit that Freenode allows us to do everything we need to do. This rate limit then follows the bot around for the rest of its operation (i.e., until it's restarted).

Since the internal static cache for users is only ever based on the connected users (vs. all the database entries that the bot has seen over its entire lifetime), removing stale entries in the bot_seen database table wouldn't have any effect.

The way I would approach a fix to this is to a) at startup, store the current rate limit variable, b) every time we have to rate limit ourselves, store another variable with the new value and a timestamp, c) after the timestamp hasn't changed for 5 minutes (i.e., we've found a rate limit that Freenode likes), delete the variable and set the rate limit back to the startup value.