This is probably only useful with the bayesian filtering, but perhaps it can be more general

It's not all that useful to split posts into only spam and ham categories, because it leaves the administrator to review all posts in order to catch the errors. Instead it would be better to add an 'unsure' category in the middle, wide enough that there's little or no need to review the material which lies in either the spam or ham categories. Admin tools should allow for listing only this uncertain material, regardless of whether policy is to publish this or not pending review.

What the threshold scores should be will probably vary somewhat with how much training has been done.

Comments

Jeremy’s picture

Assigned: Unassigned » Jeremy

Agreed, I would like to see a third classification for content we're not sure about.

gnassar’s picture

Doesn't sending spam to the approval queue effectively cover this use case?

ngaur’s picture

No, that doesn't cover it. IT still leaves the admin reviewing stuff which is clearly spam. IF the admin still has to review everything, what's the point?

I'm coming at this from the point of view of someone who has dealt with tens of thousands of spams in a single day. Requirements will vary.

naught101’s picture

Ngaur, is your idea that "unsure" content items would be published? What about having these three classes:

Not-spam: marked as not spam, published
Unsure: Marked as spam/unsure, published anyway
Spam: Marked as spam, not published (or deleted, or what ever)

Then in admin/content/spam, there could be a filter to separate Spam and Unsure.

Not entirely sure what you're suggesting to do with "unsure" content items, otherwise...

Jeremy’s picture

Version: 5.x-3.0-beta1 » 6.x-1.x-dev

I like the idea of a 3 way split. The idea is that content is filtered into three categories 1) "this is spam", 2) "this is not spam", and 3) "this may be spam". For example, content with a score greater than 85 may go into group #1, while content with a score less than 60 may go into group #2. Everything else, 60-85, goes into a special greylist queue demanding administrative review...

The reason to do this is so that an administrator doesn't have to waste time reviewing content that obviously isn't spam. And ideally so they also don't have to waste time reviewing content that clearly is spam. The module could be configured so that grelisted spam goes into a queue, and certain spam is prevented from ever being posted.

It should be configurable whether unsure spam is published or not -- different websites will prefer a different setting depending on the frequency of spam and the type of community. The important thing is that this content is in a special queue for easy administrative review -- this greylisted content is essentially flagged as "the administrator needs to review this".

Moving to 6.x-1.x-dev where new features will be added.

ngaur’s picture

I'd agree with all of that.

One point though is that the appropriate thresholds will vary somewhat between sites, and with how much prior training the Bayesian classifier has had. Some email classifiers (eg Spamassassin) adapt these thresholds automatically as learning proceeds.