Updated: Comment #N

Problem/Motivation

The NodeSearch plugin for the Search module (Core search's ability to search nodes) allows admins to decide how to rank search results, using one or more scoring algorithms. These are provided by hook_ranking(), and the scores are supposed to be values between 0 and 1 for each node.

The ranking algorithm that ranks most recent content first is not providing values that cover this range well. This breaks the ability for admins to choose to rank more recent code higher in search results.

Proposed resolution

Fix the recency ranking implementation so that the node rankings for most recently updated content work better.

Remaining tasks

Viable patch with tests.

User interface changes

None, except search results will match ranking set-ups better.

API changes

None.

Original report by @BlakeLucchesi

During my Summer of Code project I noticed that the algorithm used to normalize and adjust the score given to nodes based on their recency was not producing values spread across the range of [0,1]. Using the devel module after completing a search I copy and pasted the query used to gather the results to see how the scores of the results were being distributed. I found that most often times the score attributed to a recent node was so small that it was essentially meaningless when compared to the score of other ranking modifiers.

Using this new algorithm nodes with recent timestamps receive scores that approach the maximum (the score given to it in the search admin screen) while previously this was not the case. I have attached a screen shot of the score results for a search done on a set of nodes created with a few weeks of the search. Additionally I have attached another screen shot of the same search using the old ranking algorithm. I'd love to get some more feedback on this.

Additional discussion:

- http://drupal.org/node/297505

- http://drupal.org/node/292217

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

BlakeLucchesi’s picture

FileSize
1.37 KB
72.03 KB
78.48 KB

Ok sorry, for some reason my patch and screen shots didn't get posted. Here they are:

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14060. If you need help with creating patches please look at http://drupal.org/patch/create

BlakeLucchesi’s picture

Fixed typo sorry.

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14062. If you need help with creating patches please look at http://drupal.org/patch/create

BlakeLucchesi’s picture

Ok working patch for real this time. I used textmate's cvs bundle the last few times and it did the diff with respect to the specific file not Drupal root directory.

douggreen’s picture

Status: Needs review » Needs work

Blake, could you give a little more detail of what this new algorithm is doing, and possibly add a comment to the code that explains it (for example, the old comment said "Exponential decay with half-life of 6 months, starting at last indexed node"). I know that 2.718 is e, what's the -5 for?

Remembering that these values must be normalized between 0 and 1, I looked at the two algorithms. The low end of both algorithms is 0, but the high end of the current algorithm does not appear to be one. Is it possible that the only problem with the current algorithm is that the decay isn't working right? Should the multiplication by 6.43e-8 (which is 1 / (60 * 60 * 24 * 365.25 / 2) which I think is a normalization factor, be inside the POW()?

I'm changing to CNR, so that we can discuss and write some simpletests. I think that a test that looped through times from a certain date forward, and showed that the algorithm produced the expected results would be helpful. For example, with the current implementation, we should have a test that starts 6 months ago, and shows that newer items are weighted twice as high as the next newer item, and so on.

douggreen’s picture

Status: Needs work » Needs review
FileSize
1.57 KB

If this is a problem, we should try to fix. Re-rolling to current HEAD.

Status: Needs review » Needs work

The last submitted patch, 303574.patch, failed testing.

douggreen’s picture

Status: Needs work » Needs review
FileSize
3.91 KB

Attached patch includes a few improvements:

  • scales from oldest changed to the newest changed, instead of oldest to REQUEST_TIME.
  • assumes vars are set, by using an update handler and sensible var defaults. These default are only used after the first node is added and before cron runs the first time. Since we now have poor mans cron, this time shouldn't be too long, yet I think the current defaults will not really fail, just not be the best possible values.
  • better comment
  • fixes test to set these vars.
jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

Looks good to me. Unless the test bot objects...

Dave Reid’s picture

Status: Reviewed & tested by the community » Needs review
+++ modules/node/node.module	21 Apr 2010 07:15:37 -0000
@@ -1620,17 +1633,21 @@ function node_ranking() {
+        ':time_range' => MAX($newest_time - $oldest_time, 1),

Should be max() instead of MAX since its in PHP. And needs tests possibly? Or test chunk isn't included in the patch?

Powered by Dreditor.

webchick’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

Hrm. If the tests aren't breaking now and this bug exists, it seems like we need extra test coverage for this..?

jhodgdon’s picture

- MAX is not in PHP, really, it's a fragment of a query.
- I'm not sure about a test for this. The old algorithm was bad in that it wasn't giving a reasonable distribution of relevancy rankings. We think the new one is better. The current tests for rankings just test the ordering, not the actual numbers, and the ordering hasn't changed...

jhodgdon’s picture

Duh. Scratch that about MAX, you are right it is in PHP.

webchick’s picture

Yeah, I guess what I'm saying is it seems like the old tests were insufficient for catching this bug. So it seems like we need expanded test coverage for also checking the relevancy rankings distribution, in addition to the ordering. Unless I misunderstand.

jhodgdon’s picture

I hear you. But I don't know if not having tests for relevancy distributions should block fixing this really broken one. Can we make it a separate issue, because not having these tests applies to all the other rankings too, and I also think that while writing them is important, it might not be quite as high priority as some other search issues?

douggreen’s picture

Assigned: Unassigned » douggreen

I've written a test, but it found bugs, so I'm fixing the bugs.

douggreen’s picture

Status: Needs work » Needs review
FileSize
11.15 KB

Attached patch adds a test, changed MAX to max, and fixes a problem we had with our current tests related to changed times not set properly.

douggreen’s picture

Issue tags: -Needs tests

Removing "needs tests" tag now that there's a test...

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

Assuming that the test bot agrees, I think this is ready to go.

douggreen’s picture

FileSize
10.92 KB

Ugh, removing debugging...

douggreen’s picture

One more note, this patch fixes the range of the recent node ranking to be from the oldest changed node to the newest changed node, rather than from the oldest changed node to the current time. This way, if nothing has changed in the last N days, the newest node still has the highest recency of 1.0.

chx’s picture

Good to go. Although using EXP() instead of POW would be a slightly better but we are not bound by anything, it's totally arbitrary.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 303574.patch, failed testing.

douggreen’s picture

I'll change it to use EXP(). Plus the new test uncovered a bug that I'm working on the fix for.

douggreen’s picture

FileSize
22.04 KB

I need to test this on postgres, and break out some of the changes into a separate patch, but attaching here so we don't lose it.

jhodgdon’s picture

Version: 7.x-dev » 8.x-dev
Issue summary: View changes
Issue tags: +Needs backport to D8

We should be working on this in 8.x now.

jhodgdon’s picture

Assigned: douggreen » jhodgdon
Issue tags: -Needs backport to D8 +Needs backport to D7

This look like something we should revive. I'm going to look into it. The latest patch is D7 and also has a bunch of other unrelated stuff in it, but it looks like a good start if we can distill out the relevant pieces and put them into 8.x.

jhodgdon’s picture

Issue summary: View changes
FileSize
10.8 KB
12.18 KB

I did some analysis of all the existing rankings (gotta put all that math background I have to use sometimes!).

There are 3 hook_ranking() implementations in Core:
- comment_ranking() - Nodes with more comments are ranked higher.
- statistics_ranking() - Nodes with more views are ranked higher
- node_ranking() - provides 4 ranking algorithms:
- Keyword relevance (this is the standard ranking provided actually by the SearchQuery class)
- Sticky
- Promoted to front page
- Recent - Nodes more recently posted are ranked higher

Analysis:
a) The "Sticky" and "Promoted" rankings are fine - nodes with those bits set get a 1, and nodes without it get a 0.

b) Comment and Statistics rankings both use an equation like this:
2.0 - 2.0 / (1.0 + count * scale)
where count is the # of comments or # of views for the node, and scale is calculated during cron to be 1 / (max count). This seems to be OK. I analyzed what it would do for nodes with a count of 1 to 30 (assuming the max in the database was 30), and the values ranged from 0.06 to 1. Here's a graph:
Graph of comment/statistics ranking values for node search

c) Recently updated (the subject of this issue) - This is currently in 8.x using an equation that looks like this:
2^ (updated time - scale) [2^ means "2 to the power of"]
where scale is set to the updated time of the latest indexed node.

Two problems with this:
- The latest indexed node may or may not really be the node with the latest updated time that exists.
- As reported originally in this issue, even for a node 2 hours older than the most recent node, since the times are in seconds, this drops way way too quickly down to essentially zero. [So, I confirmed this report.]

The proposal is instead to use
e^(-5 * (1 - (updated - oldest)/(newest - oldest))))
Here's a graph with hours old on the x-axis (ranging from a just-updated node to a year old):
Graph of proposed recency algorithm

This seems reasonable. Content that is a day old is ranked pretty close to 1. Content that is a week old gets a .91 ranking. Content that is a month old gets 0.66. Comment that is 3 months old gets 0.29. Year-old gets 0.0067. In any case, it's way better than what is there now.

But... the EXP function seems to be available in most variants of SQL. However, it looks like neither POW nor EXP is available in SQLite. I wonder if that is OK? We could instead use something more like what the comment and statistics rankings are using, such as:

2.0 - 2.0 / (1.0 + (updated - oldest)/(newest-oldest))

But this doesn't really fall off fast enough. If you have content that ranges from current to 1 year old, the 6-month-old content would still have a ranking of 0.67, and all the stuff from the first month would be above .95. So I think we probably do want it to be exponential. Using exp() is probably not any worse than using pow(), which is what the existing code does, and exp() seems to be supported in most other databases (unlike pow() which is power() in some databases anyway).

So I think we should stick with exp().

d) Keyword relevance

This is using an equation that looks like (scale * score * count) for each node, where scale = 1 / max(sum(score * count)). So this is just linear in the "score * count" value, and it is weighted with the most relevant node having value 1. I think we can assume this is working OK.

So, I've reconfirmed this issue, and validated that the proposed ranking should be much better than what we have now. I just need to take the patch and port it to D8. What it needs to do:

a) Each cron run, calculate the oldest and newest update times for nodes, and store them in the state system.
b) Use the equation above for the ranking instead of what is there.

I've updated the issue summary and I'm working on a patch.

jhodgdon’s picture

Hm. In looking at this code vs. what is there, I also noticed that the current code is based on the created time, and the proposal here is based on the updated time. The ranking is called "recently posted" in the UI. So I think we should actually use the created time and not the updated time in this search ranking, which would be more compatible with what was intended in the original code.

Also I have noticed that it is impossible to test vs. the updated time, because Node::preSave() always sets the updated time to the REQUEST_TIME, which is constant in tests, and there is no way to override this.
#2213411: WebTestBase::drupalCreateNode() shouldn't imply you can set 'changed' time.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
8.92 KB
6.9 KB

OK, here's my first go at a patch for this in Drupal 8. Tests work locally -- and they fail where you'd expect without the patch, so I think they are good tests. Let's see what the bot thinks.

I added some tests, and combined some test methods together. There was a test method called testDoubleRankings() but it was kind of lame (it basically just turned on two rankings and verified you could still search, to test #771596: Turning on multiple search weights causes PDO exception during search). So I removed this, and instead added this functionality test to the testRankings() method, which had nodes for all the ranking options, so that I could actually test that if you prioritized one ranking over another, the ordering was correct. Seemed much better.

Status: Needs review » Needs work

The last submitted patch, 32: 303574-d8-test-only-FAIL.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review

The test that failed was supposed to fail in exactly the way it did, to illustrate the problem of this issue. So, we just need a review.

The last submitted patch, 27: 303574.patch, failed testing.

jhodgdon’s picture

Huh, wonder why the patch in #27 suddenly got retested?!?

jhodgdon’s picture

32: 303574-d8-with-test.patch queued for re-testing.

The last submitted patch, 32: 303574-d8-with-test.patch, failed testing.

jhodgdon’s picture

Easy reroll.

dawehner’s picture

  1. +++ b/core/modules/node/node.module
    @@ -773,6 +773,18 @@ function node_permission() {
    +  if ($changed = db_query("SELECT MIN(created) AS oldest, MAX(created) AS newest FROM {node_field_data}")->fetchObject()) {
    

    Is there any way to use entity query instead? It does support aggregation to a certain limit.

  2. +++ b/core/modules/node/node.module
    @@ -794,14 +806,24 @@ function node_ranking() {
    +  if (($node_newest = \Drupal::state()->get('node.newest_update_time')) &&
    +    ($node_oldest = \Drupal::state()->get('node.oldest_update_time'))) {
    

    Should we store the information in one state to not require two queries, potentially.

jhodgdon’s picture

RE #40...

1. No, I don't think entity query would work. At least, the docs for the execute method say that it returns either an int (if it's a count query) or a list of node IDs (if it's not a count query). Neither is what we want for this query. Besides which, it seems like it would be a huge amount of overhead for a simple query like this.

2. Yes, we could store the result as one state variable. I thought it was clearer code this way, partly because I couldn't think of a good name for a combined state variable and these two names, I thought, were clear... I also don't think, in practice, that two state->get() calls right after each other like that will ever require two different queries?

jhodgdon’s picture

FileSize
9.85 KB
2.69 KB

dawehner and I discussed this in IRC, and dawehner supplied code in a pastebin that uses entity aggregate query, which I incorporated.

Also, as suggested, the state is now using one variable instead of two.

Here's a new patch. (No changes to the test part of the patch, so I didn't upload another "test-only" patch.)

dawehner credit!!

jhodgdon’s picture

42: 303574-d8-with-test-42.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, 42: 303574-d8-with-test-42.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
9.23 KB

Straight reroll. Some other patch took care of the hunk in Drupal.php.

jhodgdon’s picture

jhodgdon’s picture

45: 303574-d8-with-test-45.patch queued for re-testing.

jhodgdon’s picture

Reroll for PSR-4

dawehner’s picture

Status: Needs review » Reviewed & tested by the community

This patch feels quite good

catch’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/core/modules/node/node.module
@@ -755,6 +755,29 @@ function node_permission() {
+  $max_alias = 'max_created';
+  $result = \Drupal::entityQueryAggregate('node')
+    ->aggregate('created', 'MIN', NULL, $min_alias)
+    ->aggregate('created', 'MAX', NULL, $max_alias)
+    ->execute();

Could someone run an EXPLAIN on the generated query, feels like it should be fine since there's a dedicated index for the created column, but just in case.

jhodgdon’s picture

catch: I can do that, but ... is that the relevant query? That query is in the hook_cron(), so even if it is slightly slow, I don't think it matters a whole lot.

The more relevant query is probably the one that is done during search execute (implied by the hook_ranking() implementation).

Anyway, I'll see what I can do about an explain for both of them.

jhodgdon’s picture

FileSize
16.38 KB

I forgot about this...

OK, so there are a couple of queries here.

The first one is during node_cron():

+  $min_alias = 'min_created';
+  $max_alias = 'max_created';
+  $result = \Drupal::entityQueryAggregate('node')
+    ->aggregate('created', 'MIN', NULL, $min_alias)
+    ->aggregate('created', 'MAX', NULL, $max_alias)
+    ->execute();

This results in a query like this:

SELECT MIN(node_field_data.created) AS min_created, MAX(node_field_data.created) AS max_created FROM node base_table LEFT JOIN node_field_data node_field_data ON node_field_data.vid = base_table.vid

Here's a screen shot of explain output in phpmyadmin:
explain output on above query

I'll look at the other one in the next comment... I really don't know much about explain queries... not sure if this is good or bad, but this particular query is only happening once per cron run, so I am not sure why we even care too much?

jhodgdon’s picture

b) The second query happens during searches, and is in the node_ranking() code.

Without this patch, we have:

     $ranking['recent'] = array(
      'title' => t('Recently posted'),
      // Exponential decay with half-life of 6 months, starting at last indexed node
      'score' => 'POW(2.0, (GREATEST(n.created, n.changed) - :node_cron_last) * 6.43e-8)',
      'arguments' => array(':node_cron_last' => $node_cron_last),
     );

And with this patch, we have:

    $ranking['recent'] = array(
      'title' => t('Recently created'),
      'join' => array(
        'type' => 'LEFT',
        'table' => 'node_field_data',
        'alias' => 'nfd',
        'on' => 'nfd.nid = sid',
      ),
      // Exponential decay with half life of 14% of the age range of nodes.
      'score' => 'EXP(-5 * (1 - (nfd.created - :node_oldest) / :node_range))',
      'arguments' => array(
        ':node_oldest' => $node_min_max['min_created'],
        ':node_range' => max($node_min_max['max_created'] - $node_min_max['min_created'], 1),
      ),
    );

These are both added into the search query. So, to turn this on:
- Add a node.
- Go to the Search settings page, edit the NodeSearch page, and set the influence on Recent to 10 (default is 0 meaning "ignore this factor").

Then do a search. This results in a query like this with the patch:

[Note: EDIT - with and without patches were reversed in this section!]

SELECT i.langcode AS langcode, i.type AS type, i.sid AS sid, SUM((CAST (:multiply_0 AS DECIMAL(10,4))) * COALESCE((((CAST (:normalization_0 AS DECIMAL(10,4))) * i.score * t.count)), 0) / (CAST (:total_0 AS DECIMAL(10,4))) + (CAST (:multiply_1 AS DECIMAL(10,4))) * COALESCE((EXP(-5 * (1 - (nfd.created - :node_oldest) / :node_range))), 0) / (CAST (:total_1 AS DECIMAL(10,4)))) AS calculated_score FROM search_index i INNER JOIN node_field_data n ON n.nid = i.sid LEFT JOIN node_field_data nfd ON nfd.nid = sid INNER JOIN search_total t ON i.word = t.word INNER JOIN search_dataset d ON i.sid = d.sid AND i.type = d.type AND i.langcode = d.langcode WHERE (n.status = :db_condition_placeholder_0) AND( (i.word = :db_condition_placeholder_1) )AND (i.type = :db_condition_placeholder_2) AND( (d.data LIKE :db_condition_placeholder_3 ESCAPE '\\') ) GROUP BY i.langcode, i.type, i.sid HAVING (COUNT(*) >= :matches) ORDER BY calculated_score DESC LIMIT 10 OFFSET 0

And without the patch:

SELECT i.langcode AS langcode, i.type AS type, i.sid AS sid, SUM((CAST (:multiply_0 AS DECIMAL(10,4))) * COALESCE((((CAST (:normalization_0 AS DECIMAL(10,4))) * i.score * t.count)), 0) / (CAST (:total_0 AS DECIMAL(10,4))) + (CAST (:multiply_1 AS DECIMAL(10,4))) * COALESCE((POW(2.0, (GREATEST(n.created, n.changed) - :node_cron_last) * 6.43e-8)), 0) / (CAST (:total_1 AS DECIMAL(10,4)))) AS calculated_score FROM search_index i INNER JOIN node_field_data n ON n.nid = i.sid INNER JOIN search_total t ON i.word = t.word INNER JOIN search_dataset d ON i.sid = d.sid AND i.type = d.type AND i.langcode = d.langcode WHERE (n.status = :db_condition_placeholder_0) AND( (i.word = :db_condition_placeholder_1) )AND (i.type = :db_condition_placeholder_2) AND( (d.data LIKE :db_condition_placeholder_3 ESCAPE '\\') ) GROUP BY i.langcode, i.type, i.sid HAVING (COUNT(*) >= :matches) ORDER BY calculated_score DESC LIMIT 10 OFFSET 0

Ugh. I have to go and those placeholders... do you need an explain on this one?

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

These two queries are essentially the same. The difference is that with the patch, the calculation of the score goes from

POW(2.0, (GREATEST(n.created, n.changed) - 100000) * 6.43e-8)

to

EXP(-5 * (1 - (nfd.created - 100) / 50))

(so only created is used instead of changed, and a different mathematical function).

And there is a new join between node_field_data and search_index, on nfd.nid = i.sid

When I run an explain on a query that does this join, it says it is using PRIMARY for that, so I think it is OK. There is not any difference between the two queries in the WHERE, ORDER BY, or GROUP BY, or HAVING, except the score is the ORDER BY, so the ordering will be different. But if anything, the score is involving fewer database fields in the new query.

I'm provisionally going to set this back to RTBC...

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

We're not using the node.cron_last state anymore there we can should remove all sets and deletes of it - NodeSearch::indexNode() and node_uninstall() need fixing.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
12.65 KB
3.27 KB

Ah, good idea.

Here's a new patch. This removes everything having to do with the state variable 'node.cron_last', and in addition, NodeSearch doesn't require State any more.

Status: Needs review » Needs work

The last submitted patch, 57: 303574-search-ranking-57.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
12.65 KB

Straight reroll. This is probably RTBC...

alexpott’s picture

Issue tags: +Needs change record

We need a CR to outline the change and removal of node.cron_last

jhodgdon’s picture

Issue tags: -Needs change record

Good idea. https://www.drupal.org/node/2337437 is the change record.

alexpott’s picture

Version: 8.0.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

Committed 829c436 and pushed to 8.0.x. Thanks!

  • alexpott committed 829c436 on 8.0.x
    Issue #303574 by jhodgdon, douggreen, BlakeLucchesi: Fixed Search...
jhodgdon’s picture

Assigned: jhodgdon » Unassigned

Thanks alexpott!

  • alexpott committed 829c436 on 8.1.x
    Issue #303574 by jhodgdon, douggreen, BlakeLucchesi: Fixed Search...

  • alexpott committed 829c436 on 8.3.x
    Issue #303574 by jhodgdon, douggreen, BlakeLucchesi: Fixed Search...

  • alexpott committed 829c436 on 8.3.x
    Issue #303574 by jhodgdon, douggreen, BlakeLucchesi: Fixed Search...

  • alexpott committed 829c436 on 8.4.x
    Issue #303574 by jhodgdon, douggreen, BlakeLucchesi: Fixed Search...

  • alexpott committed 829c436 on 8.4.x
    Issue #303574 by jhodgdon, douggreen, BlakeLucchesi: Fixed Search...