Need smarter search splitting on underscores, hyphens, apostrophes and other characters [#108100]

A search for node_save flexinode returns only one result, whereas there are definitely other hits it should be picking up, such as this or this or even this handbook page.

Comment	File	Size	Author
#49	punctuation_preprocess-6.x-alpha-0.4.tar_.gz	10 KB	rjacobs
#32	punctuation_preprocess-6.x-alpha-0.3.tar_.gz	10 KB	rjacobs
#29	punctuation_preprocess-6.x-alpha-0.2.tar_.gz	10 KB	rjacobs
#25	punctuation_preprocess-6.x-alpha.tar_.gz	10 KB	rjacobs

Comments

Comment #1

robertdouglass commented 15 April 2008 at 12:59

Title:	Not returning all results?	» Can't find results with hyphens
Version:	5.0-rc1	» 7.x-dev

Related: http://drupal.org/node/205670

The query terms should be subject to the same handling as the text that gets indexed else they won't match.

Comment #2

robertdouglass commented 15 April 2008 at 13:01

Title:

Can't find results with hyphens

» Can't find results with underscores or hyphens

Comment #3

jhodgdon

she/her

English

commented 4 March 2010 at 19:52

Status:

Active

» Closed (duplicate)

I just tested this in Drupal 7. I believe it works the same in D6. drupal.org is using solr search, so that's a separate issue, and anyway that issue report was from 2007...

Anyway, to test D7 I put the following into a node:
here-is some-text with-hyphens and_some text_with under_scores.

Then I ran cron to index it.

I was able to find "some-text" and "under_scores".

I was not able to find "some" or "under".

So this is fine, as far as it goes. We have another issue in place about whether we should be splitting words with hyphens (we aren't currently), and I'll add underscores to that issue.
#205670: searching doesn't break on hyphens

Comment #4

jhodgdon

she/her

English

commented 4 March 2010 at 19:53

Title:	Can't find results with underscores or hyphens	» Search doesn't split on underscores
Status:	Closed (duplicate)	» Active

Actually, I think I'll leave this open for the underscore part only.

Comment #5

jhodgdon

she/her

English

commented 29 April 2010 at 22:49

Title:

Search doesn't split on underscores

» Search doesn't split on underscores or hyphens

I decided to mark the hyphen issue #205670: searching doesn't break on hyphens as a duplicate of this one.

I also took a look at the code for this today.

The comment in the code where it decides to remove - and _ rather than split on them says:

// The dot, underscore and dash are simply removed. This allows meaningful
// search behavior with acronyms and URLs.

The URL part of the comment seems relevant here. If we have underscores or hyphens in a URL (or email address), I don't think we would want to split it up into pieces. So probably we should do something special with hyphens and underscores in URLs and email addresses before we get to this line.

Comment #6

jhodgdon

she/her

English

commented 29 April 2010 at 23:04

So what I think needs to be done is to remove . - _ from:
- URLs
- email addresses
- acronyms, hyphenated words, etc. if the left-over piece would be shorter than variable_get('minimum_word_size', 3)
[e.g. e-mail -> email, T.S.A. -> TSA, but word-power is unchanged because the resulting fragments are longer than 3 characters]

This would replace

  // The dot, underscore and dash are simply removed. This allows meaningful
  // search behavior with acronyms and URLs. No need to use the unicode modifer
  // here because 0-127 ASCII characters can't match higher UTF-8 characters as
  // the leftmost bit of those are 1.
  $text = preg_replace('/[._-]+/', '', $text);

And would be done right before the line that says:

  // With the exception of the rules above, we consider all punctuation,
  // marks, spacers, etc, to be a word boundary.
  $text = preg_replace('/[' . PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);

(I believe that would split up any remaining things like word-power with hyphens, underscores, and . in them)

This will be kind of a hairy regular expression to write...

And note that the word boundary preg character class constant is hopefully changing in:
#108100: Need smarter search splitting on underscores, hyphens, apostrophes and other characters
so we should probably wait to write a patch until that lands. Though I'm not sure, since we probably cannot use that class to define this regexp -- it would include .-_ presumably?

Comment #7

jhodgdon

she/her

English

commented 29 April 2010 at 23:17

Title:

Search doesn't split on underscores or hyphens

» Need smarter search splitting on underscores, hyphens, and apostrophes

As per this issue (now marked as a duplicate), we need to do the same thing as #6 for apostrophes in words too.
#427504: Why does indexing split on apostrophe?

Comment #9

figaro commented 13 May 2010 at 20:36

Title:

Need smarter search splitting on underscores, hyphens, and apostrophes

» Need smarter search splitting on underscores, hyphens, apostrophes and other characters

There is also this issue: http://drupal.org/node/138284
(searching for "C++" leads to poor search results)
Also changing title.

Comment #10

robertdouglass commented 13 May 2010 at 21:44

This is all just cosmetic layer on what really needs to be a configurable set of behaviors. The end solution might look like a filter chain similar to Drupal's filtering system, but there are generally no one-size-fits-all solutions for processing text going into a search index.

To make this system more robust in the short term we should not simply remove any of these artifacts, but both remove them and replace them with a space, so tell-tale would get expanded to "tell", "tale" and "telltale". To be thoroughly accurate the scoring in the idf part of the equation would have to be adjusted so that they don't appear as twice as prevalent, but in this way, search would discover the content in all the cases that one would want.

This approach... both removing and replacing, makes the regexp part of the solution much easier. We could even forgo the scoring adjustment in a first pass and just live with the slight perversion of the importance of these words because the real salient thing is discovery - we have cases now of not being able to find content that intuitively should be found.

Comment #11

robertdouglass commented 13 May 2010 at 21:46

Another example:

The text "drupal/url/alias" would generate "drupal", "url", "alias" and "drupalurlalias" in the index. The last one seems problematic, because it doesn't seem like anyone would ever search for that. The correct behavior, though, is to process the search term in the same way that text gets processed during indexing, so searching for "drupal/url/alias" would be the same as searching for "drupal" OR "url" OR "alias" OR "drupalurlalias".

Comment #12

jhodgdon

she/her

English

commented 13 May 2010 at 22:04

We do process search keywords the same way as node content. I don't see why we need to put all three options in the search index.

Comment #13

arhak commented 18 May 2010 at 16:30

Issue tags:

+Ancient

tag

Comment #14

damien tournoud commented 18 May 2010 at 18:04

Issue tags:

-Ancient

Nothing happens magically. Removing the useless tag.

Comment #15

jhodgdon

she/her

English

commented 18 May 2010 at 19:47

Anyway. While I agree it would be better to do search preprocessing in a filter-like manner, that is a Drupal 8 feature request and not a Drupal 7 fixable bug.

The Drupal 8 feature request issue:
#257007: Convert hook_search_preprocess() into a plugin
Further discussion on options for making this whole thing better in Drupal 8 (see the preprocessing section):
http://drupal.org/node/717654

Meanwhile, let's think about what we can actually do at this point for Drupal 7 to make hyphens, underscores, etc. work reasonably well. I think it's an important issue to fix, and as noted above, it has been around in the search module for ages.

My suggestion of what to do is in #6...

Comment #16

arhak commented 18 May 2010 at 20:37

@#14 I disapprove your attitude http://drupal.org/node/7881#comment-2983710

Comment #17

damien tournoud commented 18 May 2010 at 21:11

What about contributing to the discussion about this issue, instead of discussing my "attitude"?

Comment #18

jhodgdon

she/her

English

commented 18 May 2010 at 21:51

arhak: It is not good practice in the Drupal issue queue to introduce new tags like "Ancient". That one in particular comes across as a sarcastic commentary saying that certain bugs have been around for too long, which doesn't really help them get fixed.

Nothing in Drupal gets fixed without someone making a patch for it, and no patch gets accepted without someone else in the community understanding it and approving it. The issue queue is the place for discussing the issue and how to fix it.

So while Damien may not have explained it patiently or politely here, I agree with his sentiment and with his action of removing the tag.

So, back to the issue. I proposed a solution (or the outline of one) in #6. robertDouglass suggested rather than that type of heuristic to choose how to split each hyphenated word, we should do all possible splits (which in some ways is easier). And I think Robert and I agree that this is kind of a heuristic/cosmetic solution and it would be better to fix it "right" in Drupal 8 (but we can't do that for D7 at this point).

So the next thing we need is for someone to make a patch. Any takers? :)

Comment #19

arhak commented 18 May 2010 at 23:57

@#18 I'm sorry, I was just trying to help grouping them under a common radar (I wasn't even following this particular issue)
I can't figure why it would be "sarcastic" not even why it would have to be "bugs" instead of "issues", and they all mutate in time...
Thanks for your patient explanation and don't worry, I certainly won't do it again.

Comment #20

rjacobs commented 23 November 2010 at 17:46

At this point is it safe to say that the only way this issue could be addressed for Drupal 6 or 7 would be the creation of a new module that uses hook_search_preprocess?

I'm also in support of the logic outlined in #6.

Comment #21

jhodgdon

she/her

English

commented 23 November 2010 at 17:53

I don't know rjacobs, but I would suggest asking webchick in IRC before starting to work on a patch for Drupal 7, because it might be considered too big of a change to put in at this point. There is no point working on a patch if it's just going to get rejected.

Comment #22

rjacobs commented 23 November 2010 at 18:22

Thanks jhodgdon,

I have a nonprofit Drupal 6 project right now (using the Open Atrium Distribution) where special logic will be required to allow searching of file attachments that use userscores in the filenames/descriptions (to treat underscores as word boundaries). We'll just use a case-specific implementation of hook_search_preprocess for this which will be super-simple version of some of the logic outlined in #6.

However it seems that if the right regular expressions could be created to implement a robust version of #6, they could easily be captured in a Drupal 6 and 7 module that could be widely used. Such a module could even have some admin options where admins could pick-and-choose the rules they want to apply (whether or not to treat emails special, which special characters should create word boundaries, etc.). As I think another user stated, there are not necessarily one-size-fits all rules for this, so perhaps a module that exposes some options would make sense?

Or perhaps it makes more sense to try to get that logic worked into an existing module such as Porter-Stemmer?

I'm just thinking out loud as it seems like a module (or change to existing module) is the only way to go for D6, and porting that to D7 could be a nominal effort once the regular expression logic exists. Unfortunately that regex capacity is over my own head at the moment. Just some thoughts....

Comment #23

jhodgdon

she/her

English

commented 23 November 2010 at 18:38

As one of the maintainers of Porter Stemmer, I can tell you that I'd rather see this as a separate module. For one thing, hyphens and other punctuation are not English-specific, so this punctuation module could be useful for other European languages. Also, Porter Stemmer is the implementation of a very specific algorithm, and I am relunctant to add other functionality to the module.

But your idea of making a flexible punctuation pre-processor module is a good one, and assuming it works out, would avoid the problems with trying to get such a functionality change approved for Drupal 7 (much less Drupal 6) at this point. Also, if we do get approval, it would be super simple to take your module and add it to core. So, I say, go for it! You can post your module here when you get it working, and then we can take it from there (either get it added to Drupal 7 or add it as a contributed module). I'm not sure if you are interested in becoming a contributed module maintainer or not, but we can talk about that later.

Thanks!

Comment #24

rjacobs commented 24 November 2010 at 22:47

Yes, I would be interested to explore this, I'll just have to see if I can get some time over the holidays to take a deeper look. I've been contributing patches for modules for a while, but am sill just developing my chops as a code-level contributor. I would be interested to venture into module maintainer status if that ends up making sense, but I'll just have to see how it goes.

It seems to me that a search pre-processor like this would be rather simple (from a Drupal concepts point of view), and primarily entails getting the right regex logic and admin menu forms bundled into the right hooks... so it may provide a good excuse to explore the idea of my first contributed module. Either way, I'd need to do some learning on the regex side. Of course, if someone beats me too it, that's alright too.

Cheers,
Ryan

Comment #25

rjacobs commented 3 January 2011 at 07:28

Status	File	Size
new	punctuation_preprocess-6.x-alpha.tar_.gz	10 KB

Ok, I've taken a humble stab at a module that incorporates the pre-processing logic outlined in comments #6, #7 and #10. A Drupal 6 version of this is attached. I know this issue is marked for D7, but all my projects are in D6 at the moment, so I started there. If I'm on the right track I'm under the impression that porting this to D7 will be a pretty minor effort.

So I'm not addressing this exact issue per se, but am responding to jhodgdon's comments in #23 about exploring a solution in module form.

I've been working with Drupal for many years but have only really produced a handful of modules to date, all of which were very case-specific. This is my first attempt at some code that I'm hoping could eventually be part of a contrib effort (I'd be willing/keen to package this as a formal contrib module if that turns out to be relevant).

Some notes:

Some "fine-tuning" options are exposed in the search admin settings form (admin/settings/search)
I'm identifying URLs (web and email addressed) based on their pattern (ends in something.tla) and whether-or-not they end with a known tld. I'm not sure if this could somehow be achieve in a regex check alone.
I'm no regex expert, so I'm just hoping I'm getting the patterns correct (I'm pretty sure some could at least be simplified).
I've attempted to implement a setting that speaks to comment #10, and would match "word-power" with searches for "word", "power", "word-power" and "wordpower" (by indexing "word-power" as "word power wordpower") but I'm not sure if I'm missing some caveats of creating multiple index entries for the same token. Preliminary testing was successful though.

So before I consider sharing this beyond my own needs/projects, I'm just putting it out there in hopes someone watching this issue might want to have a look, and would give some feedback. I'm always learning, so criticism is surely welcomed!

Best,
Ryan

Comment #26

jhodgdon

she/her

English

commented 2 January 2011 at 17:39

This looks promising. A few comments (some trivial, some more significant):

a) Since top-level domains are now flexible, I don't think identifying URLs by their endings is practical. I would just stick to a regexp pattern. Keep in mind that the top-level domain is no longer restricted to 2 or 3 characters.

b) in PREG_CLASS_SPECIAL_PUNCTUATION, defined as "\'._-" -- I think you need to remove the \ before the apostrophe, because you don't need to escape ' inside " quotes.

c) Spelling/style/punctuation errors: seperate -> separate, somthing -> something, whether-or-not and word-boundaries should not be hyphenated, "ie." should be "e.g.," in your form alter, eMail -> email

d) It looks like you're assuming that [a-zA-Z0-9]+ will match any letters/numbers in words. But you have to consider accented letters, Japanese characters, etc. I think the unicode.inc file has some preg classes you can use to match "letters", more broadly defined.

e) Coding standards - for instance, comments should end in a ".", variable names shouldn't be abbreviated (such as "punctuation_preprocess_adv"), setting labels should be "Sentence case" not "Proper Noun Case", and you haven't followed the Doxygen guidelines. See http://drupal.org/coding-standards, and get the Coder module to make some checks automatically.

f) Rather than form altering the Search Settings page to add your admin settings, use hook_search_admin() -- but you may need to form alter to add your validation function. Oh, in D6, this is hook_search($op = 'admin').

g) You can probably just add your own validation function after the search module's, rather than replacing it.

h) The enable/disable hooks should be in the .install file, not in the .module file.

i) I'm not sure the logic of the "no url processing" part is correct - it seems backwards?

j) I think the "advanced/rigourous" setting should be reworded. It doesn't seem advanced or rigorous. All it means is that you can do "wordpower" as well as "word-power" and have it match. Maybe call it "Match hyphenated to non-hyphenated" or something like that?

Comment #27

rjacobs commented 3 January 2011 at 07:27

Thanks jhodgdon! That's fantastic feedback!

It looks like to comes down to a re-evaluation of some regexp logic (I was somewhat anticipating this) and some general clean-up. I think I can digest most of this and make some modifications. I suppose my egocentricity shows... along with the fact that I've been working primarily with case-specific mods to english sites.

A couple quick points:

In terms of (a), I'll have to give some thought to the construction of a regexp that would capture URLs without the potential for false matches. I have some thoughts, but will need to look at it a bit deeper. Of course checking the protocol part won't cut-it as a text URL does not necessarily imply a link.
In terms (i) and the logic of the "no url processing" part, checking that option means bypassing any special processing (that follows) and just doing what core search would do. By the time that part of the code is reached $text is already a token that contains special punctuation, but it's not yet clear if it's a URL. So that options says "check for URL match, and if found, just mimic what core search would do (strip the periods) and stop pre-processing". I suppose the best action there would be to just return the token (URL) unchanged and let the core search_simplify function do it's stripping there. Of course un-checking that option would mean the URLs will actually go on for "special" processing, and the periods would be treated as word boundaries (which I suppose some people might want)

Alright. I've got some project deadlines over the next couple days, bit I hope to squeeze out some additional time to make modifications and post-back some notes here soon.

Thanks again!
Ryan

Comment #28

rjacobs commented 6 January 2011 at 04:42

Some additional follow-up to #26:

Since top-level domains are now flexible, I don't think identifying URLs by their endings is practical

I've been looking into this, and though I wholeheartedly agree in principle, I'm not sure I agree in practice. What I'm finding is that correct patterns for URLs and URIs (or parts thereof) are quite hard to nail-down. In this case of punctuation pre-processing I think we are only interested in the host/domain part of any URLs, as I think the protocol, path, port, etc. parts (if they exist) will be tokenized in an acceptable way by core. However all the patterns I've found, or have thought of, that could match the domain part of a URL suffer from shortcomings that I think TLD matching would overcome. Consider the simple regexp possibility of:

/[A-Z0-9.-]+\.[A-Z]{2,4}/ui

This will do a decent job targeting which punctuation-separated strings are domains, but it will also lead to lots of false matches. My biggest concern would be that most filenames will also be matched with this ("file-name.jpg" or "tarball-compressed-file.tar.gz" would be). I would very much like to be able to differentiate between domains and filenames and provide the option to treat their punctuation differently.

It feels like there needs to be some for of non-pattern check on the string ending to do what I envision, either checking against known TLDs or known file extensions. I'm wondering if checking against 90% of TLDs would be "good enough" and possible "better" than having false matches on things like filenames.

Or maybe I'm not thinking creatively enough about how regular expressions can be employed? Just wanted to reach-out and see if anyone has any thoughts.

Comment #29

rjacobs commented 8 January 2011 at 23:44

Status	File	Size
new	punctuation_preprocess-6.x-alpha-0.2.tar_.gz	10 KB

Ok, I've re-worked some structural pieces and performed some general clean-up. I believe that I've addressed all of the feedback from #26 in one way or another and have added some useful configuration options to hopefully accommodate a wider audience. Some notes:

In terms of my points in #28 (about using TLDs to detect URLs) I have maintained the TLD checking functionality, but have exposed a config option to disable it (upon which only a simple pattern check is used). I have also allowed the known list of TLDs to be customized. Again, I'm not sure how else to prevent false matches to file names, and would surely entertain alternative ideas.
I've exposed the option for admins to manually configure the list of special punctuation characters to pre-process, with the default being hyphen, period, underscore and apostrophe.
To ensure more universal language compatibility, instead of using [a-zA-Z0-9]+ to match any letters/numbers in words, I'm using [^PREG_CLASS_PUNCTUATION\s]+. This should simply include anything that's not punctuation or white space, and should include accented characters, etc.
I've performed lots of cleanup all-around, and may still tweak some comments and strings if this goes on in some form as a contrib effort.

Any feedback is always hugely appreciated! If there is some agreement that this would be useful as a contrib effort, I'll take that forward and will also explore a D7 version. I'm just moving forward slowly as this would be my first such shared effort.

Best,
Ryan

Comment #30

jhodgdon

she/her

English

commented 9 January 2011 at 15:39

Looking better! Here are some comments on your new iteration:

a) I think the PREG you want to use is ^PREG_CLASS_UNICODE_WORD_BOUNDARY (D7) or ^PREG_CLASS_SEARCH_EXCLUDE (D6), to detect letter/number characters across languages.

function _punctuation_preprocess_tasks($matches) {
  $preg_class_special_punctuation = variable_get('punctuation_preprocess_preg_class_special_punctuation', PREG_CLASS_SPECIAL_PUNCTUATION);
  $text = $matches[0];
  // If configured to skip URLs, begin URL detection.
  if (variable_get('punctuation_preprocess_skip_url', TRUE)) {
    $matches = array();

Did you mean to write over the input argument $matches here? That seems wrong, but maybe it's OK. A code comment would be helpful explaining why you want to overwrite the input arg.

    // First see if token fits general URL pattern.
    if (preg_match('/^[A-Z0-9.-]+\.([A-Z]{2,4})$/i', $text, $matches)) {

- Are those really the only characters allowed in domain names still?
- I think you avhe to escape - when it's used as a literal character in a character class.
- Again, top-level domains are not restricted to being 2-4 characters any more. It seems like the {2,4} part should maybe be governed by a configuration option, rather than hard-wired?
- . is definitely something special for URLs, but you've allowed people to exclude . from the "special characters". If they do that, the URL processing won't work, because the expression used in the first preg won't match URLs. So you might just need to do that as a separate function. I think the configuration options would be clearer also if you had one section about URL processing, and another for generic "special punctuation" processing.

d) Back to the top of the file again:

   $text = preg_replace_callback('/[^'. PREG_CLASS_PUNCTUATION .'\s]+(['. $preg_class_special_punctuation .']+[^'. PREG_CLASS_PUNCTUATION .'\s]+)+/u', '_punctuation_preprocess_tasks', $text);

You are going to want to split on white space as well as punctuation here, to get word boundaries, I think?

e) It doesn't seem like this module needs to have an overall switch to turn punctuation preprocessing on/off. Someone can just disable the module, right? If you do split URL processing into a separate section, then this could be useful, but someone could also just save an empty list of special punctuation characters to indicate not to do this type of processing.

f) I would avoid using terminology like "regular expression group" in descriptions on regular admin configuration screens. Non-programmers will have no idea what you are talking about.

g) You need to explain the acronym TLD on your config screen.

Comment #31

rjacobs commented 11 January 2011 at 07:21

Thanks again jhodgdon for very thorough feedback!

I recently discovered some very useful regular expression logic used within the cck link module (for link validation). The logic used there seems much more well-rounded than similar examples I've found elsewhere, so I'm thinking to re-purpose a chunk of that code to refine the way URLs are handled in this module.

More soon,
Ryan

Comment #32

rjacobs commented 15 January 2011 at 21:34

Status	File	Size
new	punctuation_preprocess-6.x-alpha-0.3.tar_.gz	10 KB

I really like the collaborative way this problem has been defined along with the way we are exploring a solution. I know I'm learning a lot, and I hope we can produce something of value to the community. Thanks for bearing with me as I learn some of the ins-and-out of basic Drupal module development (beyond my own case-specific applications in the past).

I've made another iteration which vastly expands on the way URLs are managed. Instead of passing them through a single preg_match and tld check, I now have a whole function, with a vast collection of pattern checking, to do the job. Basically I'm re-purposing some code I found for URL validation within the CCK link module. The trick is to slightly modify the validation patters there to service URL identification. This is getting into hairy regexp territory, so all I can say is that I'm doing my best... comments appreciated.

Some other comments and questions:

I added a temp "debug" option that will show what's going on if you have the devel module enabled. With this you can see how various strings are processed while running searches or doing a re-index via cron. It's been handy for me as I test modifications to the heavy regexp logic from the cck link module.
I'm now "daisy-chaining" preg_replace_callbacks together, and calling a callback from inside a callback. Are there any performance issues here?
I think I've addressed the notes from #30 (thanks again jhodgdon):

From everything I've read, the "-" does not need to be escaped in a character class, it just needs to be either the first or last character. The search admin validation function checks this.
I was inadvertently re-using some variable names, which I know is bad practice. I think that's what lead to your confusion about overwriting arguments.

I think that the new URL-checking complexity could open the door for lots of new feedback or potential problem cases I have not encountered yet, but I hope it's a bit closer to a proper solution.

Cheers,
Ryan

Comment #33

droplet commented 15 January 2011 at 22:15

tag

Comment #34

jhodgdon

she/her

English

commented 16 January 2011 at 14:19

You really need to write some tests for this module -- it will help you see if it works or not. :) Anyway, I'll give it a review sometime soon...

Comment #35

jhodgdon

she/her

English

commented 16 January 2011 at 14:47

I gave this a quick review... it looks better... a few notes:
- As I said in #34, you really need to write some tests. The code has become complex enough that I can't see at a glance whether it will work correctly or not any more.
- There's a lot of debugging code in there, which should be removed before an eventual release of course.
- You haven't followed the Drupal coding and documentation standards, which makes some of the code harder to read then it would be otherwise. The Coder module can help you find these problems (I always use the strictest mode possible).
- In the options, is the "canonical" URL option restricted to just www, or would it also allow groups.drupal.org to match drupal.org?
- I think you will eventually need a README or some other documentation to explain how this works as opposed to what the Search module does itself. Some examples there would be helpful (they can also form the basis of your tests).

Like I said, I can't really see whether all of this would work correctly or not, and I suggest writing tests. I have found that thinking up and running the test cases will expose a lot of bugs, plus bring into focus the behavior you expect with the different settings.

Comment #36

rjacobs commented 17 January 2011 at 14:20

First, just a general clarification for anyone interested to test/review what we have so far. I have intentionally left some devel output statements in the code (along with the config option to toggle them on/off) as an aid for testing and reviewing. So with the "Enable debugging" (maybe I should have called this "enable testing output") option enabled, and you can then see exactly what the pre-processor's input and output is (when searching and indexing), along with some details about what text was identified as a URL, etc. I have been using this for my own testing, and here is an example case:

If I do a search for "word-power from www.drupal.org/some-path", I am shown some messages that basically indicates the following:

The hyphen in "word-power" was treated as a word boundary and so it becomes "word power" (the module can also be configured to produce "word power wordpower" with "exhaustive indexing" enabled)
"from" was not processed at all as it does not contain punctuation
"www.drupal.org/some-path" was identified as a URL and was therefore not processed (with the exception of the "." after "www", which was treated as a word boundary).

So the full pre-processed output would be "word power from www drupal.org/some-path". Of course, this same processing would happen during indexing as well. The end result is that I could have search "hits" on this example string (if it was actually part of my content) with a search for "word", "power", "word-power", "drupal.org" , "www.drupal.org", etc (without pre-processing, the searches for "word", "power" and "drupal.org" would not result in hits).

I have been busy doing plenty of my own testing, but figured things were to a state where others may want to test as well (without reviewing the code directly). This is the reason for the "debug" option.

Tell you what... I'll just try to setup a public example D6 site with this module installed and the "debug/review/test" feature turned on. That way anyone can test their own strings without needing to install anything. I won't be able to do this for another day or so though.

Comment #37

rjacobs commented 17 January 2011 at 13:02

Also some specific responses to #35:

I hope my points in #36 outline how I have been doing my own test cases (and trust me, I have been doing a lot of them :) )
I have indeed been using the coder module, however I did not see that there were different modes (I have just using the "normal" setting). If I go for more strict checking I see that I have a few lines with incorrect indents. I'll fix that.
In terms of comments, I have been reviewing http://drupal.org/node/1354, but I'll take another look and see what I may be missing, and where I can be more verbose.
Currently the "www." part is the only part that can optionally be processed for URLs (to treat the "." as a word boundary). Given that a URL can contain any number of subdomains, and may or may not have a ccTLD, etc., I'd have to add a new pattern check to isolate just the "domain.tld" part alone (as opposed to specific matches for just "www." at the beginning). However, the way the code is defined now, it would be much easier to add-in more URL-specific processing.
Creating a README file and removing the dpm statements is absolutely planned (I just have not gotten that far yet... and am somewhat waiting to see what others say before considering sharing this widely, and requesting CVS access for that matter).

Also, I should note that I didn't know you were looking at the code alone... based on your other comments I was under the impression that you had tried one of the iterations. Sorry if I was not focusing my responses accordingly in this thread as a result.

Comment #38

jhodgdon

she/her

English

commented 17 January 2011 at 15:23

I'm talking about writing SimpleTest automated tests. Ad-hoc testing only goes so far.

Comment #39

rjacobs commented 22 January 2011 at 20:02

Fair enough, I suppose the structure is sufficiently complex for more formal testing. I'd be happy to explore some SimpleTest tests and bundle them in with the module. I think having some end-to-end functional tests of actual searches (that formally script some of the hands-on tests I've done on my dev space), along with some unit tests would be in order.

Comment #40

rjacobs commented 22 January 2011 at 21:46

As I promised, I setup a publicly accessible test/demo form:

http://dev2.lumitycms.org/search

This will let anyone do some basic tests/demos of the current punctuation pre-processor module (with default settings). Simply do a search for a string, and you'll see a report to your screen of the input/output of search_simplify and the punctuation pre-processor. As far as testing goes I realize this is just ad-hoc (it's not actually testing how real content is matched on a search), but it does give some easy insight into what the pre-processor is doing without the need to install anything. I'm just hoping this is useful in terms of garnering some feedback from anyone interested in this issue (and generally familiar with drupal search/index concepts).

I'm quite curious what people may think about the current handling of "short" word fragments (shorter than variable_get('minimum_word_size', 3)). For example, you'll see that the pre-processor turns "ryan's" into "ryan ryans", "q-tip" into "qtip tip", "john-q-public" into "john johnqpublic public", etc. My hope is to preserve the "root" words without loosing the short fragment(s) during indexing. Is this in line with the concept first outlined in #6, or am I over-complicating things?

I've also decide to abandon the TLA checking concept for URL detection. I had a look at the way the core input filter manages URLs (when the "URL filter" is enabled), and am in favor of using the same general logic implemented there. So now a URL is only detected (and the pre-processing bypassed) if it has a known protocol or starts with a "www". This will catch "http://drupal.org" (and normal absolute links) but not "drupal.org". Any thoughts on that change? As you can see, I'm trying to simplify a bit, but remain true to the needs highlighted throughout this issue.

Best,
Ryan

Comment #41

jhodgdon

she/her

English

commented 23 January 2011 at 15:50

To answer your questions in #40, you need to think about how preprocessing and indexing are actually used by the search module. Here's my understanding, followed by my opinion of what this module should be doing.

What happens during Search Indexing:
- When a node is indexed, the node is rendered and the text is preprocessed.
- Words less than the minimum search word size setting are removed.
- The words in the preprocessed text are added to the word table, and the whole text is added to the whole-text table.

What happens during Searching:
- The search terms and/or phrases entered by the person are preprocessed.
- Words less than the minimum search word size setting are removed.
- The search query uses the whole text table to match the search keywords, and the word table for scoring.

What the Search module processing does with respect to punctuation:
- Removes hyphen, underscore, and apostrophe, so that, e.g., hyphenated-word becomes hyphenatedword.
- Treats punctuation in numbers specially.
- Replaces other punctuation with a space.
- This means that hyphenated-word matches hyphenatedword, but hyphenated does not match hyphenated-word.
- Also, word,with,punctuation,inserted is just split up into word with punctuation inserted, so other punctuation is not an issue (it would match as you'd expect). I think?

What additional processing/functionality is needed IMO:
- We don't want to lose the existing functionality: hyphenated-word matches hyphenatedword
- In addition to what is done now, we want to be able to match hyphenated-word with hyphenated or word, and to match "hyphenated word" with hyphenated-word.
- So, this preprocessor should take hyphenated-word and make sure that hyphenatedword, hyphenated, and word all get returned.
- If the hyphenated word is longer, like multi-hyphen-word, preprocessing should return multihyphen, hyphenword, multihyphenword, multi, hyphen, and word in the search index. If it is even longer, like one-two-three-four, then we'd want onetwo, onetwothree, onetwothreefour, twothree, twothreefour, threefour, one, two, three, and four to be returned.
- If the hyphenated word contains short segments (too short for the minimum size word), then you should do the one-two-three-four processing described above, and just let the search module eliminate resulting words that are two short.
- The above needs to be done for underscores and apostrophes as well. I'm not convinced it needs to be done for any other punctuation.

I think that your module is possibly being more complicated than necessary, because I don't think you need to worry about any other punctuation, and I don't know if you need all those options. What problem are they solving? What search matching that isn't already present in the core search.module do the options give you? What search matching that is present in the core search module do they remove?

Comment #42

rjacobs commented 25 January 2011 at 06:42

Thanks Jennifer. Some responses to key parts of #41 appear below:

You need to think about how preprocessing and indexing are actually used by the search module...

Indeed. Before I explored any code or iterations I absolutely did a review of core search concepts. Of course search.module makes for a fun code-review, but I also found the following article particularly informative (it even briefly refers to the punctuation problem this issue thread explores):

http://acquia.com/blog/drupal-search-how-indexing-works

We don't want to lose the existing functionality: hyphenated-word matches hyphenatedword [...] If the hyphenated word is longer, like multi-hyphen-word, preprocessing should return multihyphen, hyphenword, multihyphenword, multi, hyphen, and word in the search index....

Thanks, this is very tangible feedback. In fact, I believe this is what I already have in place within my most recent iteration with the "exhaustive indexing" option enabled (I've not shared the code for this yet, but it's used in the example/demo link I passed in #40). The key difference is that you suggest I let core search remove "short" segments instead of having the module remove them. That's a quick adjustment. I've now enabled "exhaustive indexing" at the demo form (http://dev2.lumitycms.org/search) if you'd like to confirm all this. The text "one-two-three" now becomes "one two onetwo three twothree onetwothree" and "n.y.p.d" becomes "n y ny p yp nyp d pd ypd nypd" (of which only "nyp" "ypd" and "nypd" would persist into the index). Feel free to try your own strings....

I don't think you need to worry about any other punctuation, and I don't know if you need all those options. What problem are they solving? What search matching that isn't already present in the core search.module do the options give you? What search matching that is present in the core search module do they remove?

Well contextualized. Some thoughts:

Configurable controls for URL handling have been an implied value all-along. As I noted in #40, I've tried to simplify this feature a bit in the interest of removing some configuration options and reducing overall module complexity. Did anyone have any thoughts on my approach here? Feel free to test URL detection at http://dev2.lumitycms.org/search (it will report what strings were matched as a URL).
The ability to control/configure what punctuation is pre-processed could be an asset if different languages and contexts use punctuation is different ways. I figured that adding/removing certain characters from the list may be useful to some.
My idea to have an "exhaustive indexing" conf option was primarily to expose a way to not "break" the word ordering in the index. Turning this off would process "one-two-three" as just "one two three". However, I see now that this could eliminate one of the index possibilities allowed by core search ("onetwothree"), so I'm now thinking it best have this "exhaustive" concept always on (and remove the option).

These are just my current thoughts on the configuration options. I understand that I may have gone overboard on conf options, but given that I'm looking at this in the context of a module, I figured that playing with more options could be an asset.

Cheers,
Ryan

Comment #43

rjacobs commented 25 January 2011 at 06:11

Also, I assume that when you refer to "word table" and "whole-text table" in #41 you are respectively referring to the search_index and search_dataset tables? I bring that up as the usage of the search_dataset table may be a factor when considering the order of processed word chunks and phrased searching. I note this because the pre-processed version will go directly into the search_dataset table as part of a string ("one-two-three" would enter search_dataset as "one two onetwo three twothree onetwothree").

I suppose that, in theory, if a node contains the text "one-two-three some other text" it would be beneficial to have this matched with a search for just "two-three some other text" (as a phrase - in quotes). I'm not sure how the ordering of the various combined chunks could be managed by an algorithm to make this possible in all cases. Regardless, I'm leaning toward the conclusion that this is a non-issue given that the current stand-alone search system would not create a match in this rare scenario anyway. Anyone with thoughts on this, or a better understanding of "phrased" searching?

Comment #44

jhodgdon

she/her

English

commented 25 January 2011 at 19:49

Interesting point about the search_dataset table (and yes, those are the right table mappings) and the order of words. This is only important when searching for exact phrases, though, and I guess the preprocessing at search-time and index-time will be the same, so it's probably not a huge problem.

Comment #45

rjacobs commented 1 February 2011 at 01:04

Indeed, the preprocessing at search-time and index-time will be the same. However the order of the words in the pre-processed version of "one-two-three" will most likely not match the order of the words of any part of the pre-processed version of "zero-one-two-three".

We would get something like (the exact ordering pattern could be adjusted to some extent - these examples just show the current pattern that my code produces):
one two onetwo three twothree onetwothree
and:
zero one zeroone two onetwo zeroonetwo three twothree onetwothree zeroonetwothree

These don't overlap without manually adjusting to order. I'm not entirely sure there is a way to consistently process the order of various chunks so that there is an overlap in all cases. I think it's possible to process the ordering so that there is overlap with either the beginning or ending of the longer string, but not both.

Anyway, I imagine I'm getting into the territory of unnecessarily esoteric logic as I don't think matching part of a word containing punctuation via a phrased search should be a requirement or expectation of the search engine. I just figure it's worth spelling out in case it somehow highlights other shortcomings of the prescribed preprocessing we are discussing. I agree that it's probably not an issue.

Comment #46

jhodgdon

she/her

English

commented 1 February 2011 at 01:42

I agree, something about this should definitely go into the README for this module.

I think that for phrase matches, we should just say that it will only match if the phrase in the text exactly matches (including hyphens, apostrophes, and underscores) the phrase in the search query. And leave it at that. The only thing that will be lost is that without this preprocessing module, you could also have matched the phrase "true time-less moment" with "true timeless moment". That's not much of a loss, considering that most people don't do phrase matching in the first place, and considering all of the other matching gains when you use this module.

If that is the specification, then I don't think that the order of the replacements matters too much, because the same ordering will be applied at index-time as at search-time, and the only time the order of words in the {search_dataset} table matters is in the case of a phrase search. But a phrase match should be one of the tests.

Comment #47

rjacobs commented 3 February 2011 at 20:11

Good point about the "true time-less moment" example. Yeah, I don't imagine that's much of a loss, and I don't really see any easy way around that.

I finally got some SimpleTest scripts under control, and things are looking good. However, before I share more, I have been made aware of another pending question. Namely I am curious as to what extent URL matching should take place (and thus trigger a bypass of the pre-processing). I think it's fair to say that the domain part of a URL would be bypassed, but what about paths and other fragments of a URL? For example, consider the "some-path" part of the string below:

http://drupal.org/some-path

Would the hyphen be stripped or treated as a word boundary? If considering the whole string as a URL to be bypassed (which is easy enough to do the way I have things setup at the moment), this would be indexed as "drupalorg somepath".

The problem comes when someone searches for just "drupal.org" or just "some-path". Neither of these have the correct context to be identified as a URL and will therefore be pre-processed, leading to "drupal org drupalorg" or "some path somepath" respectively. Neither of these will match anything given the way search ANDs keywords together during a search. Of course this could be considered correct behavior if we only want to have a match when the full URL (protocol, path and all...) is matched.

So it all comes down to context and URL matching logic. Given the way things have evolved in my code exploration I think I already have snippets to take on any URL matching strategy. I'm just wondering what the common expectations may be for this. I'm leaning toward bypassing pre-processing only on the domain part of a URL, though I am wondering how much the context should effect whether or not pre-processing takes place.

Any thoughts on this?

Comment #48

rjacobs commented 6 February 2011 at 01:31

Thinking a bit more (regarding the points in #47), I think the current functionality is correct and that URLs should be identified inclusive of their paths, query strings, etc. After all, this really only matters if someone chooses the option to bypass URLs (which is an option that I think I'll leave exposed). This would simply mean that, if the bypass options is selected, URLs will only be matched when searching for the full URL as it appears in content (inclusive of the path, etc...), as opposed to just a fragment (such as just the domain). I'm not sure if this would be a popular need, but I suppose it's worth including as an option. Furthermore, I think the default should be to have this option disabled.

Comment #49

rjacobs commented 8 February 2011 at 21:26

Status	File	Size
new	punctuation_preprocess-6.x-alpha-0.4.tar_.gz	10 KB

Ok, the most recent iteration is attached.

I believe this address all the points raised since the last iteration (all the way back to comment #32). It also includes a set of SimpleTest tests and a README.txt file.

Feedback is always appreciated! I know this is D6 code, but I will explore a D7 version if/when this appears to be useful enough for some general use.

Cheers,
Ryan

Comment #50

rjacobs commented 7 March 2011 at 03:41

Hi again,

I just wanted to note that with the Drupal migration to git I'm now able to setup a sandbox project for this punctuation pre-processor module concept. I suppose it makes sense to focus any additional dialog (at least about the module idea) there.

The sandbox is located here:
http://drupal.org/sandbox/rjacobs/1082788

Ryan

Comment #51

ts145nera commented 27 June 2011 at 08:55

There's a port to D7?
The link at #50 doesn't work

Comment #52

rjacobs commented 30 June 2011 at 20:38

Hi,

I just wanted to let everyone know that the sandbox project that I noted above is now a full project:

http://drupal.org/project/punctuation_preprocessor

Currently there is only a D6 version available, but I'll also need to get a D7 version for my own related projects (that are now shifting to D7), so that's forthcoming.

@ts145nera, the fact that the sandbox has been converted to a project would explain that broken link. I'm surprised a redirect was not automatically built during the promotion process.

Ryan

Comment #53

rjacobs commented 5 July 2011 at 21:18

Hello again,

A beta Drupal 7 port of punctuation_preprocessor is now available:

http://drupal.org/project/punctuation_preprocessor

Ryan

Comment #54

muranod commented 26 August 2012 at 15:27

Hope this is the right place to put this:

I have a piece of content with a node title that has several letters in parenthesis. These get stripped out ok in the url, but search will not work on the content of that particular node. The title is: House for (W)ren(t). I can search on "House" and bring up the content, but searching for anything in all of the content that comes after the parentheses in the title, including the node body, returns zero results.

Is this a known problem? I haven't yet found any issue that exactly matches this.

Here is the page:

http://danmurano.com/content/house-wrent

Edited to add Drupal version:
7.14

Comment #55

rjacobs commented 28 August 2012 at 18:16

Hi muranod,

Is your issue related to the fact that the body content of your page with the title "House for (W)ren(t)" is not being indexed? Or are you interested in ways to get the word "(W)ren(t)" (with the parenthesis) to be better indexed?

If the former (problems with the body content being indexed) you should first check to make sure your site has been 100% indexed, etc. (/admin/settings/search in Drupal 6 or /admin/config/search/settings in Drupal 7). If things do look correct there there, then you may need to open a fresh issue or support request, as I think it's a bit off-topic for this specific thread.

If the latter (you want to index the word "(W)ren(t)") then some of the concepts in this thread, and the "punctuation preprocessor" module, may be able to help you.

Comment #56

muranod commented 29 August 2012 at 01:09

Thanks, rjacobs. Not the word itself that I care about, but that everything after that one word in the title, including the entire body content, is not being indexed, while everything else on the site is being indexed.

I will look a little deeper again and then open a separate issue, if I am still having trouble.

Comment #57

muranod commented 29 August 2012 at 01:09

I will look a little deeper again and then open a separate issue, if I am still having trouble.

Comment #58

rjacobs commented 29 August 2012 at 19:04

Version:	7.x-dev	» 8.x-dev
Category:	bug	» feature

I'm going to move this to D8 as I suppose it really doesn't apply to D6 or D7 core anymore (and there is a contrib module available for the issue now). There is some potentially related talk for D8 at #257007: Convert hook_search_preprocess() into a plugin, but that's more high-level and not linked to the specific matter of "punctuation" based word splitting, etc.

However, if the intent for D8 is to keep this type of specific pre-processing out of core (and just make core more robust at handling module-based pre-processors) then perhaps this issue should just be closed.

Comment #59

jhodgdon

she/her

English

commented 30 October 2014 at 16:50

Issue summary:	View changes
Status:	Active	» Closed (duplicate)

Well. Nothing has been done on this for several years... I think the approach on #257007: Convert hook_search_preprocess() into a plugin is probably the better way to go, so I think at this point I will just close this as a duplicate, since that approach would allow for particular sites to pick how they preprocess punctuation, which is really the need identified here.