refactor search node_rank with hook_node_rank scoring factors [#145242]

See http://groups.drupal.org/node/4105. The summary is:

modules such as voting modules or other metrics systems can inject their own scoring factors. Wouldn't it be cool if the votes a node has received could improve its positioning in the search results?
it cleans up node.module code

The technical details are:

I've create a new hook_node_rank() function which returns an array of ranks which allows each hook_node_rank to define multiple rankings. The lower level array defines title, join, score, and terms; terms is an array; join and terms are optional. For example, hook_node_rank for comment.module:

function comment_node_rank() {
  return array(
    'node_rank_comments' => array(
      'title' => t('Number of comments'),
      'join' => 'LEFT JOIN {node_comment_statistics} c ON c.nid = i.sid',
      'score' => '2.0 - 2.0 / (1.0 + c.comment_count * %f)',
      'terms' => array(variable_get('node_cron_comments_scale', 0)),
    ),
  );
}

I moved the existing code out of node_search and implemented three hook_node_rank functions (node_, comment_, and statistics_). I think that this refactoring makes more sense, moving the code to the module that actually defines it. Thus, this patch impacts node.module, comments.module, and statistics.module.

You might notice that the weights are not included in the definition. I was able to easily extend the admin/settings/search page with all of the 'node_rank' variables, and then automatically add these to the generated query. This makes it a lot easier to add new node rankings, since the system handles the UI.

Lastly, I wrapped the score values with the SQL COALESCE(..., 0) because during testing of a 5.x system, I encountered NULL score values (I think related to comments) that we're messing up the results.

Comment	File	Size	Author
#31	145242.patch	13.67 KB	douggreen
#26	145242.patch	13.05 KB	douggreen
#23	145242.patch	12.96 KB	douggreen
#22	145242.patch	12.52 KB	douggreen
#21	145242.patch	12.58 KB	douggreen
#20	145242.patch	9.55 KB	douggreen
#18	145542_3.patch	9.32 KB	BlakeLucchesi
#17	145242.patch	8.53 KB	douggreen
#15	145242.patch	8.27 KB	douggreen
#14	hook_node_rank.patch	8.31 KB	robertdouglass
#5	145242_0.patch	8.32 KB	douggreen
#2	145242.patch	8.27 KB	douggreen
	node_rank.patch	8.46 KB	douggreen

Comments

Comment #1

nedjo

he/him/his

English

commented 23 May 2007 at 14:10

+1 in principle (haven't tested). This is part of what we need to do all over core: remove the hard-coded special treatment given to core modules (in this case, comment and statistics). Enabling contrib modules to affect scoring will indeed be a significant benefit as we get e.g. voting and other user-driven ranking parameters.

This also opens the door to other uses of node ranking outside search.

Comment #2

douggreen commented 7 June 2007 at 02:01

Status	File	Size
new	145242.patch	8.27 KB

Updated patch, fixing 'field' reference that should of been 'score'...

Comment #3

Scott Reynolds commented 8 September 2007 at 04:04

subscribing. Great idea

Comment #4

Zothos commented 10 September 2007 at 15:23

definitely +1 for the idea. :)

Comment #5

douggreen commented 11 September 2007 at 06:33

Status	File	Size
new	145242_0.patch	8.32 KB

I'm updating patch to current Drupal head. I think we're probably too late to get this into 6.x, but it doesn't break any api's. Anyone care to test and mark it RTBC?

Comment #6

nedjo

he/him/his

English

commented 12 September 2007 at 20:30

Minor comments on the code, I haven't tested.

Looks very good overall.

The fact we're attaching this new hook to node rather than search reflects a general weakness in search module (the need to search separately for different types of objects, the relatively full implementation for node and lack thereof for other object types like user, taxonomy term, etc.). It'd be nice to see a solution that moved us toward general search support rather than overdeveloped node support, but this is at least a good step toward generalizing the hard-codedness.

Code comments should be full sentences (capitalized, periods or other punctuation).

I find the use of the word 'terms' here confusing, e.g.,


+      $rankings['terms'][] = $node_rank;

They are fed into a function argument called $argument2. These seem to be arguments fed into the SQL? If so, let's call them 'arguments'. If not, they need some code comment to explain.

Comment #7

nedjo

he/him/his

English

commented 12 September 2007 at 22:03

A couple of possible longer term directions:

1. Move ranking from node to search module, generalizing to accept a 'type' argument. Search is the only place I know of in core where we have built-in support for multiple object types (node, user, etc.). Compare e.g. the node_access system, hard-coded to just nodes. On the surface, generalizing hook_node_rank() to be hook_search_rank() would seem fairly simple: add a $type argument (e.g., 'node'), add a switch ($type) test to implementations. But adding e.g. a user ranking implementation would of course be more work.

2. Factor out the ranking into e.g. rank.module so it can be used independently of search. There are many other cases in which we wish to rank content (or users) by a synthesis of multiple criteria, e.g., a "recommended content" block.

Comment #8

douggreen commented 12 September 2007 at 22:40

This is definitely an incremental improvement. $arguments2 is part of the current core search code. When submitting patches, I frequently try to change as little as possible. Are suggesting that I/we discuss more significant changes.

I agree that hook_search_rank is a better name.

I don't know that it could be generalized, however, because the rank methods actually return SQL that is related to the search. For example, the 'join' array element.

You say that you find the use of the word "terms" confusing in one place. I've used the array keys of 'title', 'join', 'score', and 'terms'. I agree that 'terms' should probably be renamed to something like 'join args'.

Comment #9

nedjo

he/him/his

English

commented 12 September 2007 at 23:29

My feeling is, if this goes in for 6.x, great, it's definitely an incremental improvement (though it's hard to say if it's enough of a priority to make it in at this point). If not, we might consider some more generalization. We have a core search system that's well developed for nodes and almost nonexistant for anything else. User searching is a skeletal "SELECT name, uid FROM {users} WHERE LOWER(name) LIKE LOWER('%%%s%%')". There's zip AFAIK for taxonomy terms. Moving logic from node to search may be a step toward more balanced implementations.

I don't know that it could be generalized, however, because the rank methods actually return SQL that is related to the search. For example, the 'join' array element.

If we wanted to consider moving ranking to a separate implementation rather than tying it to search, I guess we'd need to do something similar to what we do with db_rewrite_sql(), that is, feed in a table alias and field name. In the case of node ranking, that would be 'i', 'sid' for search and 'n', 'nid' for node.

Comment #10

nedjo

he/him/his

English

commented 18 September 2007 at 22:57

I've taken a first crack at sketching in a generalized ranking system based on this patch. See this issue http://drupal.org/node/173593 on Content Recommendation Engine (CRE) and this module in my sandbox http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/nedjo/modul.... I'm hoping this might go in the short term into CRE.

In the longer term, based on this experience, I am thinking we could probably factor out search's ranking into a small, general core ranking API that search and other modules could implement and extend.

Whether or not we do so, this patch on its is a positive step in the right direction.

Comment #11

catch

he/him

English

commented 26 October 2007 at 16:13

Status:

Needs review

» Needs work

This is a very nice idea but it no longer applies.

Comment #12

douggreen commented 27 November 2007 at 13:18

Version:	6.x-dev	» 7.x-dev
Assigned:	Unassigned	» douggreen

This issue will be one of my focus areas during the 7.x release cycle.

Comment #13

dropcube commented 11 February 2008 at 22:10

Subscribing.

Comment #14

robertdouglass commented 15 April 2008 at 08:22

Status:

Needs work

» Needs review

Status	File	Size
new	hook_node_rank.patch	8.31 KB

Rerolled against HEAD, tested, fixed a bug with an unbalanced (, and somewhat addressed Nedjo's concern about the variable name (I renamed term into factor since these are referred to scoring factors elsewhere).

Comment #15

douggreen commented 9 May 2008 at 14:54

Status:

Needs review

» Needs work

Status	File	Size
new	145242.patch	8.27 KB

LIVE FROM THE MINNESOTA SEARCH SPRINT, the attached patch does:

rerolled to current Drupal HEAD
changes the new hook to be named hook_ranking which is still consistent with the original verbage in the code, but is a better hook name without an underscore
enforce the variable names to always be prepended as 'node_rank_', and then removes these from the new hook_ranking array index. This change will make it so that a hook_ranking implementor can't pollute other variable name spaces

This is still a work in progress, but something we should finish this morning.

Next, we need to review all of the existing core modules, and see if there is any low hanging fruit to further extend the search rankings.

Comment #16

BlakeLucchesi commented 9 May 2008 at 15:30

* Tests Needed
- Relevancy: Two nodes with similar nodes, one node has more instances of the same search keyword than the other.
- Recency: Two nodes with the exact same content; different timestamp.
- Statistics: Two nodes with the exact same content; different number of views.
- Comments: Two nodes with the exact same content; different number of comments.
- Promoted: Two nodes with the exact same content; one node is promoted.
- Sticky: Two nodes with the exact same content; one node is marked sticky.

Comment #17

douggreen commented 9 May 2008 at 16:01

Status	File	Size
new	145242.patch	8.53 KB

LIVE FROM THE MINNESOTA SEARCH SPRINT, this fixes the patch, which was not searching properly. If the hook_ranking 'score' uses SQL arguments, this is now renamed to 'arguments' and the code is fixed to build the proper SQL arguments. It also removes factors with a value of 0 from the SQL.

Comment #18

BlakeLucchesi commented 10 May 2008 at 00:55

Status	File	Size
new	145542_3.patch	9.32 KB

Re-rolled to make sure that there is no division by zero error when normalizing the score factors.

Comment #19

cwgordon7 commented 10 May 2008 at 15:22

Aren't we setting ourselves up for conflicts if we use table abbreviations such as {node_comment_statistics} c? Perhaps the hook should be refactored to force node_comment_statistics.nid notation? Something like:

function comment_ranking() {
  return array(
    'comments' => array(
      'title' => t('Number of comments'),
      'join' => array(
        'table' => 'node_comment_statistics',
        'type' => 'left',
        'on' => 'node_comment_statistics.nid = search_index.sid'
      ),
      'score' => '2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * %f)',
      'arguments' => array(variable_get('node_cron_comments_scale', 0)),
    ),
  );
}

Comment #20

douggreen commented 10 May 2008 at 16:39

Status	File	Size
new	145242.patch	9.55 KB

Rerolled with a simpletest (original version courtesy of Blake and Chad). I think that the test case has uncovered a problem in the code... it's either that or the test.

Comment #21

douggreen commented 10 May 2008 at 17:31

Status	File	Size
new	145242.patch	12.58 KB

This update restores part of the original patch that was lost. The test is still failing the "recent", "comments", and "views" (statistics) tests.

Comment #22

douggreen commented 10 May 2008 at 18:55

Status	File	Size
new	145242.patch	12.52 KB

This patch fixes the "recent" test failure, which was only a problem with the test case, and not the patch code.

Comment #23

douggreen commented 10 May 2008 at 19:55

Status	File	Size
new	145242.patch	12.96 KB

And now, a finally working version, complete with simpletest. I'm marking as CNR, but I think cwgordon7 has a good point that we should consider before RTBC.

To cwgordon7's point, the existing search joins do use aliases, and one just needs to know to work around them. However, since we're opening this up to contrib, you could work around the core aliases, but you'll never know what another contrib module might alias. In this case, I think that standard should be to not use aliases at all. And if that's the standard for contrib, we should probably do this in core too.

Comment #24

douggreen commented 10 May 2008 at 19:55

Status:

Needs work

» Needs review

Comment #25

BlakeLucchesi commented 10 May 2008 at 20:04

agree to to reroll without aliases.

Comment #26

douggreen commented 10 May 2008 at 20:17

Status	File	Size
new	145242.patch	13.05 KB

Agreed, the solution then is to just not introduce SQL table aliasing in hook_ranking. While we could do this in core (because everyone will know that core implements certain aliases and will avoid them), core is also an example to contrib. We don't want contrib doing this, so we just shouldn't do it in core. I've retested and rerolled without the aliases in comment_ranking and statistics_ranking.

Comment #27

robertdouglass commented 10 May 2008 at 20:37

Status:

Needs review

» Reviewed & tested by the community

Nice work. Great new opportunities for contrib modules to interact with search.

Comment #28

BlakeLucchesi commented 11 May 2008 at 15:29

Status:

Reviewed & tested by the community

» Needs work

Changing to "needs work" because we are going to add another ranking factor which increases the score of nodes with incoming links.

Comment #29

douggreen commented 11 May 2008 at 16:27

Status:

Needs work

» Reviewed & tested by the community

And after further discussion, the new ranking factor is really a separate patch that is going to have to wait until this one commits.

Comment #30

coupet commented 11 May 2008 at 16:55

Highly interesting. subscribing.

Comment #31

douggreen commented 14 May 2008 at 01:36

Status	File	Size
new	145242.patch	13.67 KB

Dries commented on #256792, when I think he intended to comment here.

We don't use the word 'Node' in output.

We write 'Node is sticky' not 'Node is Sticky'.

Code comments are inconsistent.

It would be useful to document the score functions (i.e. the math).

So, I re-rolled this patch changing "Node is Sticky" to "Content is sticky at top of lists" and changing "Node is Promoted" to "Content is promoted to the front page". I made sure that all of the ranking scores had comments, which were accidentally lost in the refactoring. I made sure that all of the new comments were sentences. I also tried to improve the comments in the new _node_rankings() function. And lastly I diffed the old patch with the new one, just to make sure that only comments and strings were changed. I did this hoping we could leave it RTBC.