I have a CVS site which includes a page with a long-ish membership list. After about 715 words, search results fail. Everything before that works fine. I created a new page for testing, pasting in the Fall of the House of Usher. I ran cron.php and the search page reported 100% indexed. Searching only returned results up to 212 words. Nothing past that was found.

Comments

gtcaz’s picture

This is version // $Id: search.module,v 1.138 2005/10/21 11:14:55 unconed Exp $

The only thing I've done is fix the cast error as reported here: http://drupal.org/node/34515

gtcaz’s picture

All the terms do appear in the search_dataset table.

robertgarrigos’s picture

I'm looking at this and notice that many entries in the search_index table shows a score of 0 (zero) when any word is suposed to have a minimum score of 1. Thus those words with a score of zero don't get searched.

It's curious, also, that there is not any zero score during the first 120 entries, aprox., in that table. After that, zeros begin to appear more often as they are closer to the end of the table. Arrround row num 300, aprox., half of the rows have a zero score. From row 930, aprox., till the end they have a zero score.

(...)

The problem is at line 514 of search.module:

// Focus is a decaying value in terms of the amount of unique words up to this point.
 // From 100 words and more, it decays, to e.g. 0.5 at 500 words and 0.3 at 1000 words.
$focus = min(1, .01 + 3.5 / (2 + count($results[0]) * .015));

By comenting this line the problem gets fixed. However, I don't know how would this affect to the search it self. What is the exact purpose of that decaying value? Why a unique word have to score less when its found on a page with many single words? And why it have to score even less if its found more at the end of the page?

Unless any developer could give us a reason of what this was done this way, I would just take this out. So I don't upload a patch yet, in case there is a reason I cannot see.

gtcaz’s picture

I can confirm this resolves the issue. The focus algorithm is broken and I will be commenting it out on my site. Perhaps this should be a configurable setting if it's fixed. I agree with Robert than on many pages, my members list being one of them, that being near the top does not, by itself, make the result more relevant.

Steven’s picture

Fixed in CVS. The problem was not $focus, but the INSERT query, which used %d even though the scores are now floating point. The integer cast set some scores to zero, which messes up things.

FYI, the $focus variable is used to offset the effect that very long pages tend to match more, even though they may not be more relevant (they just contain more different words in more quantity).

In traditional full-text searches, a normalization is applied across the entire text (wordscore = score per word / # of words in the text), but this is really bad for a web-cms like Drupal, because we have comments: it would mean that each new comment would decrease the score of the existing content. It also means that very short content scores very high (due to the 1/x relationship), which is also undesirable.

But, we do want some penalties for really long content.

So I arrived at the decaying focus value, which applies a penalty to later words, without affecting the score of existing content when e.g. a comment is added. Also, because it only counts unique words rather than total words, it means that a very long, but very on-topic discussion will not get that much penalty. Off-topic discussions will get much lower focus much faster.

I admit this relies on some assumptions about the type of content, but (at least after this bug fix) it only has an effect on the ordering of results.

Steven’s picture

Status: Active » Fixed
shouchen’s picture

Steven,

I noticed that node.module changed with your commit. Please see http://drupal.org/node/41973 (I'm not suggesting that your commit caused the bug I'm reporting... but since you recently changed the same code I changed when patching the bug, I thought you might be interested.)

Thanks,
Steve

Anonymous’s picture

Status: Fixed » Closed (fixed)