The "train as spam" option is in spam.module, but none of the filter modules actually implement it. This should probably either be made to do something or removed.

Additionally -- why is this in spam.module? Is there any filter that can use "training" other than Bayesian? We were just cleaning up that admin options code a while ago; this is probably a good case in which to use it.

Comments

AlexisWilke’s picture

Hmmm... Baysian and all it's derivatives, but otherwise I'm not too sure what that could be used for.

One idea I had about training a new install would have been to create a module with a (large) default set of words, phrases, etc. that are considered spam by 99% of the population and can be used to train that brand new site. The truth is that would be limited to a specific module that runs once (if it uses cron, it could take a few days too, but once the training is done, you could just remove that module.)

The only other use would be for a user to test whether something is properly detected as spam. Yet I did not see any interface for that purpose... Again, that should probably be a separate module and use the normal procedure, not a specific API call, although a different API should not be called "train as spam" in this case. It should rather be "check for spam" and that check would return TRUE or FALSE and not change anything to the database. That way the user could test things without messing up the current state (could be very useful for our tests too.)

Thank you.
Alexis

gnassar’s picture

Title: "Train as spam" not implemented » "Train as not spam" not implemented

Whoa. Just a sec. Big screwup in explaining this on my part. :)

s/as/as_not/g

The function -- that exists in spam.module, and isn't implemented anywhere, is "Train as *not* spam." Sorry about that! I knew what I was thinking of in my head, but that somehow didn't translate to the keyboard.

You're right, from the other thread -- the filter is trained for spam by "marking as spam" already. "Mark as not spam" just undoes the training. The feature needed (and already having an _operations hook) is "train as not spam," which is obviously much more useful in that there's no other way to achieve the same functionality.

Jeremy’s picture

My laptop died this weekend and I don't have a convenient way to review the code. In any case, this should be invoking both the Bayesian filter, and the URL filter, as both of these filters "auto learn". I have ideas for other filters that would also auto-learn, should I ever find the time.

If this functionality really doesn't work then evidently it was either removed recently, or somehow never ported to this version of the module. I'm pretty sure I did test this functionality recently, however.

gnassar’s picture

Hey, Jeremy --

I didn't actually test whether this worked from the end-user side -- I simply observed that the code ties the "train as not spam" option to the "train_as_not_spam" hook, and a quick grep showed that that hook isn't implemented by any submodule or content include at all. (In fact, a grep for just the word "train" returns a bunch of instances of the word inside comments, the three lines in spam_operations that define the train_as_not_spam hook, and nothing else.)

So it's more like it seems that it can't possibly work, if you know what I mean.

And I'm pretty obsessive about checking commits :), and I don't remember this being removed any time recently.

I'm guessing this was just never ported. (Not that it should be too difficult to implement.) But I'd appreciate another set of eyes on this, to verify that.

For that, but also because I know what a nightmare it would be for me if I ended up laptop-less -- I hope you get your laptop problem resolved! Good luck.

AlexisWilke’s picture

Hmmm... so... for me to know the difference between "train as [not] spam" and "mark as [not] spam" would be enlightening, if you don't mind explaining a bit?

On my end I had to switch servers as the old one starting acting up. Although it still works, I just don't keep it running 24/7 since it would shut itself down at random. No data loss, but about 2 weeks at getting the new server up to speed!

Thank you.
Alexis Wilke

gnassar’s picture

Sure, I can do that.

Basically, the difference between "train as" and "mark as" exists as an end-user semantic and as a different state transition in development, but the core functionality of the two (at least when it comes to the filters) is the same.

From the end-user side, the "mark as" option would appear for spam content, and the "train as" option would appear for non-spam content. They would be confused by "mark as not spam" showing up for content that is already marked as not spam, but would be equally confused by both of those options showing up in the same list. So there is a separate menu hook for each of them.

From our side: "mark as" means a transition in the state of the content piece; "train as" means there should be no transition. So it serves as a useful sanity check that we're getting the right data in any particular circumstance -- assuming, of course, that we're checking that we're only "marking" content that's already spam, for example. We do this, but not always consistently throughout the module.

John Franklin’s picture

It sounds like this would be useful in conjunction with #342133: Request 3 way split - (spam/unsure/ham) to let the user provide guidance to the filter when an item is in the "unsure" state.

My 2¢ -- this is a nice feature, but I wouldn't consider it to be a release blocker.

flaviovs’s picture

Hate to comment on old issues, but had to post because I'm facing a problem right now related with this issue. It's never too late to add some more $.02:

The bottom line is: "marking" will (re)save the node, which cause many sort of side effects, such as the node being labeled as "new", path aliases being updated (if you use Pathauto) etc.

OTOH, "train as not spam" should not change state (i.e. should not save the node). When training, the node can and should be treated as a read-only object. You just want to feed the Bayesian filter with "node content" => "not spam flag" so that it's internal database is updated to better reflect how you classify such content.

Training the Bayesian filter (and classifier filters in general) is much more that needed, but a required featured in some cases.

For example, imagine an admin that did a manual and careful filtering of 10K nodes on her site so that not a single one of then is spam. If she install the Spam module to make things automatic, the Bayesian filter will start from zero, meaning that at this point the probability of the next node being spam is 1/2, whereas is such scenarios it would be much better if she could feed her 10K nodes as "not spam" to train the filter beforehand. Of course, she can use e.g. Views Bulk Operations to mark all 10K nodes as "not spam", which (I think) will train the filter, but as said above, this will have many undesired collateral effects.

EDIT: Of course, a "train as spam" is also a very useful feature in spam classifiers. Our poor admin could also have some other 10K unpublished nodes that she flagged as spam in her manual fight, and if she could feed the filter with those nodes using a "train as spam" operation, her classifier will be improved a lot.