On a large site the tokens table can grow considerably, I suggest to purge tokens that haven't been seen for a while (1 week?).

Maybe make this conditional on the spam probability, ie keep rarely used spam tokens but discard rarely used non-spam ones.

Comments

Jeremy’s picture

Throwing away only non-spam or only spam tokens will affect the weighting, and is not recommended. Throwing away useful data points in general seems unwise -- logic could be added to note the frequency of tokens, not just the # of tokens in each posting. Are you experiencing an issue where the tokens table is becoming a bottleneck? Perhaps instead this bottleneck should be reported as a bug, which we can track down, optimize and fix.

killes@www.drop.org’s picture

No, I am not yet experiencing a problem. However, I recall how big the tables for the search module got for a site the size of d.o (the d.o of 2-3 years ago). Based on this I am expecting similar problems and want to address them early on.

killes@www.drop.org’s picture

btw, we already record the last time a token was "seen" so I assumed we could use this to discard tokens not seen for a while.

Jeremy’s picture

Search is building complex queries, whereas as the token lookups required by the bayesian filter are really simple and fully indexed. While the table will indeed grow large, I do not expect it to have the same scalability problems as the search module.

If you decide to go ahead and implement this anyways, please make it configurable so it can be disabled altogether.

killes@www.drop.org’s picture

Status: Active » Postponed

Let's postpone this and see if it turns out to be an issue at all.