It's time to port this module to 8.x!

Per jhodgdon's notes in https://www.drupal.org/node/2528570, this should:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mark_fullmer created an issue. See original summary.

mark_fullmer’s picture

Here is an "alpha-level" pass at an 8.x port: https://github.com/markfullmer/porterstemmer

What it does

  • Is multilingual aware, only running the Porter 2 (English) Stemmer when it detects an "en" language profile (which is default if none is defined
  • Implements the stemmer as a D8 plugin, which allows other developers to register their own stemming plugins in other languages, if desired.

To Do

  1. Write unit tests.
  2. Consider refactoring logic currently in the StemmerPlugin class to StemmerPluginBase.
  3. Review approach to exploding text on word boundaries, which diverged a bit from the 7.x implementation.
  4. Add Readme & documentation.
  5. Memoize already-stemmed results to optimize long-running indexing.
  6. Add a UI component to allow users to enter custom-defined exceptions that the Porter 2 algorithm doesn't cover (e.g., -an suffixes)
jhodgdon’s picture

I took a look at the code you put on Github.

A few comments and notes:

a) I do not think you would ever want to run multiple stemmers in sequence for the same language. That would never work. Once a word is stemmed, it is stemmed.

b) I also don't see why this one module needs to run other stemmers... each stemming module can just implement hook_search_preprocess itself. So the logic of having a plugin manager and a service and all of that in this module seems to be unnecessary overhead.... or I'm not understanding why you would want this? It seems like if you have this module installed, it implements the hook and will run, and if you don't have this module installed, it will not run. Same for other stemmers. Having a plugin system just doesn't seem necessary at all. ???

c) I don't see this using the PECL library?

d) Normally you wouldn't want to do operations in a class constructor -- such as splitting text into words etc. I would take that out of PorterStemmer::__construct() and put it elsewhere. A constructor should usually just save the input in a class variable, and initialize a few things maybe.

e) Actually, I don't get why this class is storing anything. You could just have two static public methods, something like:
PorterStemmer::splitIntoWords($text)
PorterStemmer::stem($word)

Then the preprocess method could be something like:
$words = PorterStemmer::splitIntoWords($text);
foreach ($words as $word) {
$new_words[] = PorterStemmer::stem($word);
}
return implode(' ', $new_words);

or something like that...

f) I think I would also put all the logic of stemming on the main PorterStemmer class. Why is it in a separate class? Seems like the PorterStemmer class itself does nothing much...

g) I like the idea in the issue summary about using a cache or in-memory list to minimize restemming of already-stemmed words.

h) I haven't looked at the word-boundary stuff, but I think there were reasons in 7.x for doing it the way it was done.

i) There was a feature request in 7.x for this module to add exceptions etc. I personally think that is material for a separate module, which can implement hook_search_preprocess() itself and do whatever it needs to do. It doesn't belong in a module that says it is an implementation of the Porter Stemmer algorithm... that is my opinion anyway. Keep this module simple and have it just do that.

mark_fullmer’s picture

Hi jhodgdon,

I agree with much of your reasoning, and I think most of the divergence above stems (pardon the pun) from scope. If the scope is simply to implement the Porter 2 stemming algorithm, there is no need to implement a plugin type.

I'll create a narrowly-scoped version based on your suggestions

With hook_search_preprocess() being multilingual-aware now, do you think there would be value in a unified "Stemmer" module? To be sure, each of the various stemmer modules -- https://www.drupal.org/project/greekstemmer, https://www.drupal.org/project/danishstemmer, https://www.drupal.org/project/spanishstemmer , etc. -- could do a language check in their own hook_search_preprocess() and only run in a given language. But to have a multilingual, stem-indexed site, you'd need to have a bunch of modules enabled, and there is where I think a plugin-type approach would optimize performance and minimize code.

jhodgdon’s picture

Yes, that has been the scope of this project in the past: Implement the Porter Stemmer algorithm exactly. Nothing more. Nothing less. Here are some issues where adding new features has been discussed and not implemented:
#991008: custom overrides feature
#66461: Problem with words such as "pied", "died", and "coed"
#1627596: Porter stemmer not matching electrician and electrical
#558846: Stemming of month names and days of week

I still think that having each module implement the hook is probably just as efficient, if not more, than having a plugin approach. The code is one line to check, in each module, whether the language is right, and if not, just return. Plugin managers are a LOT more overhead, and a lot more code to maintain and write for each module. Plus, you'd have to get all the other stemming modules to agree on the API etc. We already have an API for the hook. Just let each stemming module implement it.

mark_fullmer’s picture

@jhodgdon: a revised straight-ahead 8.x port is now at https://github.com/markfullmer/porterstemmer

* Removed plugin & plugin type logic
* Restored the 7.x module's approach to word boundaries
* Re-implemented PECL extension option
* Re-implemented stemmer as a static class

I decided to leave the "split into words" logic in the preprocess function, as that feels like logic specific to Drupal's handling of strings, rather than the Porter stemming method itself.

Upon reconsideration, I don't see an in-memory solution to storing already-stemmed word pairs, given that hook_search_preprocess() will fire on each item being indexed. If you have any ideas on this, let me know.

Still to do: unit tests.

jhodgdon’s picture

Sorry for the delay -- I've been away on a cycling trip for the last 3 weeks.

So I took a look at the code on github as of today. I have a few comments:

  1. The approach in the current code of splitting the text in the hook implementation, then stemming word-by-word, seems right to me.
  2. You have defined a service for the stemmer, but in the .module file, you're using the class directly. That's not good -- either get rid of the service or use it properly in the module. Personally I think the service isn't necessary.
  3. Coding standards -- run the Coder module on the code. One thing that jumped out at me was that the name of the stemmer class should be in CamelCase (not "porter2").
  4. The function/class documentation headers need to be updated. For instance, porterstemmer_search_preprocess() still mentions plugins.
  5. I'm not sure about the class implementation. The static variables, in particular, seem very weird to me. Normally you don't have any class member variables in a static method in a class, but just carry information around in local variables. I am not comfortable with this... probably because I used to work in Java, where multithreaded processes were a possibility. They are also possible in PHP apparently, and this code is definitely not thread-safe. I think I would recommend rewriting it without the static variables -- just use local variables and return values.
mark_fullmer’s picture

Hope the cycling trip was enjoyable, jhodgdon. Sounds fun!

I will take care of all of the suggestions you mentioned above, and also add tests. A couple quick notes:

* Yes, the service definition is cruft. Will remove.
* I generally do run my code through Coder. I'll make sure I do this before providing an 8.x patch file.
* The static class implementation was done to accommodate your code request, above: $new_words[] = PorterStemmer::stem($word); . If I understand my PHP OOP correctly, calling a PHP class without instantiating it requires as static approach. See here. I'll re-implement the non-static approach and you can see if that passes muster.

In touch soon!

jhodgdon’s picture

Sorry for the confusion! I think a static *method* is a good idea. What I don't think is a good idea is to have the method store stuff on the class in static variables. That makes the code not thread-safe.

The implementation that is in the 7.x code in includes/standard-stemmer.inc does not use any static variables in any of its functions. I don't see a reason the 8.x class-based version should either.

mark_fullmer’s picture

I think I follow, now. Will implement accordingly!

mark_fullmer’s picture

Status: Active » Needs review

Okay, jhodgdon, your latest round of suggestions has been implemented. See https://github.com/markfullmer/porterstemmer

  • CodeSniffer compliant code
  • PHPUnit tests added
  • Stemmer class implemented with single public static method and no static variables
jhodgdon’s picture

Looks great!

So the only thing remaining, I think, is to add to the tests.

The 7.x version of Porter Stemmer includes unit tests that verify that the entire word list provided by Porter actually works correctly. There is a file testwords.txt that I got from the Porter algorithm that contains the entire list. We should do the same for the 8.x version.

The other test that the 7.x version has is a functional test that verifies if you have nodes with variations in them, you can search them successfully.

jhodgdon’s picture

As a note, when I put in that unit test in the 7.x version with the entire word list, I did find some bugs in the algorithm implementation... it's a good test to run, to make sure everything is working correctly.

mark_fullmer’s picture

Okay!

  • Added entire ~29,000 wordlist to the existing unit test of the stemmer algorithm. All pass.
  • Added new test to match your 7.x test, which verifies that word variations are returned via a POST to /search/node.
    • This test additionally verifies that the stemmer is limited to "en" langcode content.

All code should comply with the Drupal ruleset for phpcs, as well.

jhodgdon’s picture

Looks pretty good. Three comments:

a) What I did for the 7.x test is to slice it into chunks, so that the entire test data set would run. Your current 8.x test only runs the first 1000 words.

b) Why is the word list a .ini file? That seems a little weird. I just left it as plain text and read it in using standard PHP file read functions in the 7.x version.

c) The functional test should not be in the Unit test subdirectory, but src/Tests I think. It is not a PHPUnit test.

Anyway... this is pretty close. Eventually it needs to be a patch... :)

mark_fullmer’s picture

FileSize
548.25 KB
  • 8.x unit test sliced into chunks, following your model
  • Wordlist moved into plain text file, and read via same method as 7.x
  • Functional Test moved to src/Tests
  • 7.x unit tests for PECL extension ported to 8.x

I'm not sure if I've got the proper protocol for a patchfile for switching major versions, but the attached is a diff of the 7.x-1.x-dev against the "8.x" branch. If I've got it wrong, let me know how it should be done, or open an 8.x-1.x-dev branch off of https://github.com/markfullmer/porterstemmer.

Well wishes,
Mark

Status: Needs review » Needs work

The last submitted patch, 16: porterstemmer-2805507-16.patch, failed testing.

mark_fullmer’s picture

Status: Needs work » Needs review
FileSize
527.96 KB

Patch off 7.x-1.x-dev attached.

Status: Needs review » Needs work

The last submitted patch, 18: porterstemmer-2805507-17.patch, failed testing.

jhodgdon’s picture

I'll take a look at this tomorrow, thanks! The tests are going to fail if it is run against 7.x, so don't worry about that.

jhodgdon’s picture

Status: Needs work » Fixed

Sorry for the delay -- the patch looks pretty good, thanks! So, I did the following:

a) Created a new 8.x-1.x branch.

b) Applied the patch.

c) There were a few left-over files that the patch did not remove, relative to the top-level directory:
porterstemmer.info
porterstemmer.test
testwords.txt
includes/standard-stemmer.inc

d) I also removed the CHANGELOG.txt file. Most projects do not have one of those any more.

e) Then I went ahead and committed the patch and pushed the new branch. I have also added this as a Release to the project, but that will take a few minutes to get through.

My next plan is to review the documentation and code style carefully, and if I find any minor things to fix, I'll do that in a separate commit. I'll also enable automatic testing of the new branch.

jhodgdon’s picture

I don't see anything else that needs attention. Great work, and many thanks!

jhodgdon’s picture

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.