Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
On a large site the tokens table can grow considerably, I suggest to purge tokens that haven't been seen for a while (1 week?).
Maybe make this conditional on the spam probability, ie keep rarely used spam tokens but discard rarely used non-spam ones.
Comments
Comment #1
Jeremy CreditAttribution: Jeremy commentedThrowing away only non-spam or only spam tokens will affect the weighting, and is not recommended. Throwing away useful data points in general seems unwise -- logic could be added to note the frequency of tokens, not just the # of tokens in each posting. Are you experiencing an issue where the tokens table is becoming a bottleneck? Perhaps instead this bottleneck should be reported as a bug, which we can track down, optimize and fix.
Comment #2
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedNo, I am not yet experiencing a problem. However, I recall how big the tables for the search module got for a site the size of d.o (the d.o of 2-3 years ago). Based on this I am expecting similar problems and want to address them early on.
Comment #3
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedbtw, we already record the last time a token was "seen" so I assumed we could use this to discard tokens not seen for a while.
Comment #4
Jeremy CreditAttribution: Jeremy commentedSearch is building complex queries, whereas as the token lookups required by the bayesian filter are really simple and fully indexed. While the table will indeed grow large, I do not expect it to have the same scalability problems as the search module.
If you decide to go ahead and implement this anyways, please make it configurable so it can be disabled altogether.
Comment #5
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedLet's postpone this and see if it turns out to be an issue at all.