In Australia (and I suppose many English speaking parts of the world) we sometimes spell words with the "ise" suffix instead of the "ize" suffix.

I have attached a unified diff which I hope can be tested by someone who knows more about stemming than I do. (my limited time allowed me to test for "caramelise", "caramelised", "caramelize" and "caramelized", all resolving back to the word "caramel" in a node). All I can say is "it works for me". :)

CommentFileSizeAuthor
#5 335030.patch536 bytesjhodgdon
#4 335030.patch564 bytesjhodgdon
porterstemmer.module.diff1.57 KBcarneeki

Comments

greggles’s picture

That's an interesting idea. This module is largely based on an external porter-stemmer codebase. Maybe you could check that code to see if it has this capability? Perhaps it's time to update this code with a refresh of their latest version.

jhodgdon’s picture

Component: Code » Documentation

The published Porter Stemmer algorithm is apparently only for American English (this is true of the Porter 2 algorithm). I think we should just update the documentation to state this clearly, rather than trying to modify the algorithm so it would maybe work for non-American English as well. The reason I think this is that the algorithm's decision process is quite complex, and I'm concerned that any modifications we would do would likely screw up the stemming of other words.

Places to fix documentation:
- Project page - http://drupal.org/project/porterstemmer
- README.txt file

Thoughts? Any other places to fix?

greggles’s picture

That seems like a good solution to me.

Thanks!

jhodgdon’s picture

StatusFileSize
new564 bytes

I fixed the project page. Here's a patch for the README. Which branch(es) should we commit it to, if any?

jhodgdon’s picture

StatusFileSize
new536 bytes

Missing newline. Try this patch.

jhodgdon’s picture

Status: Active » Needs review
greggles’s picture

Looks great to me. I guess commit to 5.x and 6.x branches which are DRUPAL-5 and DRUPAL-6--1.

greggles’s picture

I should add, if you want to commit things to HEAD as well, please do. Otherwise we can just merge everything from DRUPAL-6--1 into HEAD whenever we start working on 7.x compatibility.

jhodgdon’s picture

Status: Needs review » Fixed

Done.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

gpk’s picture

Status: Closed (fixed) » Active

The published Porter Stemmer algorithm is apparently only for American English (this is true of the Porter 2 algorithm).

Is this true? Martin Porter's definitive Porter 2 page http://snowball.tartarus.org/algorithms/english/stemmer.html doesn't mention being specific to one form of English, nor does the main snowball page http://snowball.tartarus.org/ mention American vs. English among all the languages listed. (And he seems to hail from somewhere this (British) side of the pond!!)

-ise and -ize are both valid British spellings:
http://en.wikipedia.org/wiki/American_and_British_English_spelling_diffe...

Interestingly though the Porter 2 (and the original Porter) algorithm doesn't look for -ise endings, only -ize. Bizarrely, in his actual prose Porter only uses -ise forms on that page, where he discusses -ize...!!?!

Looking at the sample vocabulary and its stemmed equivalent, I see the following:
apologise -> apologis
but
apologize -> apolog

I'm contacting the mailing list to see what Martin or others have to say about this!

jhodgdon’s picture

Status: Active » Closed (fixed)

Sure, contact Porter... I'm going by his algorithm, not any statements he may have made. The algorithm doesn't stem -ise, as you've noted.

The existing Drupal Porter Stemmer module stems every word in Porter's word list correctly (it is fully tested), and can also use the official Snowball project's PECL implementation (if you have it on your system). Neither one stems -ise. Such is life.

But until the algorithm or official implementations are changed, adding this feature to the Drupal module is a non-starter.

gpk’s picture

Status: Closed (fixed) » Active

Well having almost composed my message I found this in the mailing list archives for November 2008 (hoping that no one objects to my posting it here!):

This "ise/ize" debate often re-emerges. The essential point is that "ise" as
an included ending does too much damage to the many words ending "ise", but
for which "ise" is not a suffix: enfranchise, otherwise, paradise, imprecise
and so on. Here is an answer I sent to Xxxxx on 22 Feb 2001.

------------------------------------------

Re: Stemming American English vs. English

Dear Xxxxx,

I don't think you need worry too much about English/American spelling
differences, as far as the Porter stemming algorithm is concerned. The main
difference is that -ize and -ise endings are (as you note) applied
differently in American and English usage, and the algorithm treats -ize as
an ending but not -ise.

Many people have adapted the algorithm by adding -ise to the list of
endings, but on balance I think that is a mistake. There are too many words
ending -ise where -ise should not be removed.

American spelling is much more logical than English, and -ize/-ise usage is
no exception. So in fact the Porter stemmer probably does better with
American English than with English English!

As a matter of fact -ize usage in England used to be much closer to the
American style than it now is. Here are Thackeray's -ize endings from Vanity
Fair (published 1847):

agonized
apologize apologized
authorized
capitalized
characterize
cicatrized
civilized
harmonized
idolizes
particularize
patronize patronized patronizes
proselytizer
realize realized
recognize recognized
tyrannize tyrannized
victimized victimizer

Today many of these words would have to be spelled -ise in England, e.g.
characterise, realise, recognise ....

Hope this helps,

Martin

Another correspondent agreed that removing the -ise ending, in the same way as -ize, actually made things worse.

So I think that the README and project page info need modifying again, maybe to say that the stemmer works for both British and American spellings, but that it is not an exact science and in fact it works better for the latter.

gpk’s picture

> adding this feature to the Drupal module is a non-starter
Agreed!

jhodgdon’s picture

Status: Active » Closed (fixed)
jhodgdon’s picture

I've modified the project page and the README (at least in HEAD/branch; not enough of a change to merit releasing a new 6.x version in my opinion).

gpk’s picture

Wow that was quick :-D

Thanks!