When the db_rewrite_sql node access join returns multiple rows, the search result score normalization is wrong.

I only noticed it because I backported #146466: D6 search refactoring (backport to 5) and #145242: refactor search node_rank with hook_node_rank scoring factors to 5.x, installed both of them on a client site, and wrote a node ranking for one particular content type to have a higher rating. It worked fine for admin, but didn't work for anonymous user's because of the og db_rewrite. The use-case happens when og returns that the node access is both public and accessible by admin, in which we double the search score for these nodes.

The SQL without node access returns the expected results:

mysql> SELECT i.type, i.sid, (1 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.0434782608696)), 0) + 8 * COALESCE(((149.956535618 * SUM(i.score * t.count))), 0) + 5 * COALESCE((n.sticky), 0) + 5 * COALESCE((n.promote), 0) + 10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN i18n_node i18n ON n.nid = i18n.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid WHERE n.status = 1 AND (i18n.language ='en' OR i18n.language ='en' OR i18n.language ='' OR i18n.language IS NULL) AND (i.word = 'serena') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1 ORDER BY score DESC LIMIT 0, 10;+------+------+-----------------+
| type | sid  | score           |
+------+------+-----------------+
| node | 1740 | 18.000000000016 | 
| node | 5348 | 15.229196562838 | 
| node | 5292 | 10.066757050011 | 
| node | 5296 |   9.89196507437 | 
| node | 5293 |   9.80863507437 | 
| node | 5294 |   9.80863507437 | 
| node | 6744 |   9.80863507437 | 
| node | 5295 |   9.80863507437 | 
| node | 5351 | 9.5505130987287 | 
| node | 5930 | 9.5505130987287 | 
+------+------+-----------------+
10 rows in set (0.01 sec)

But with node access it returns the following, notice that sid 5348's score is doubled (and so are many others):

mysql> SELECT i.type, i.sid, (1 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.0434782608696)), 0) + 8 * COALESCE(((149.956535618 * SUM(i.score * t.count))), 0) + 5 * COALESCE((n.sticky), 0) + 5 * COALESCE((n.promote), 0) + 10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN i18n_node i18n ON n.nid = i18n.nid INNER JOIN node_access na ON na.nid = n.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid WHERE n.status = 1 AND (i18n.language ='en' OR i18n.language ='en' OR i18n.language ='' OR i18n.language IS NULL) AND (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all') OR (na.gid = 0 AND na.realm = 'og_public') OR (na.gid = 5268 AND na.realm = 'og_admin') OR (na.gid = 1740 AND na.realm = 'og_admin') OR (na.gid = 804 AND na.realm = 'ogr_access') OR (na.gid = 805 AND na.realm = 'ogr_access') OR (na.gid = 1 AND na.realm = 'workflow_access') OR (na.gid = 0 AND na.realm = 'workflow_access_owner'))) AND (i.word = 'serena') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1 ORDER BY score DESC LIMIT 0, 10;
+------+------+-----------------+
| type | sid  | score           |
+------+------+-----------------+
| node | 5348 | 30.458393125675 | 
| node | 1740 | 26.000000000031 | 
| node | 5292 | 20.133514100023 | 
| node | 5296 |  19.70060014874 | 
| node | 5294 |  19.61727014874 | 
| node | 6744 |  19.61727014874 | 
| node | 5295 |  19.61727014874 | 
| node | 5293 |  19.61727014874 | 
| node | 5352 | 19.101026197457 | 
| node | 5307 | 19.101026197457 | 
+------+------+-----------------+
10 rows in set (0.00 sec)

When I modified the second query to see what's happening, it's obvious that we have a node_access issue:

mysql> SELECT i.type, i.sid, na.*, (1 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.0434782608696)), 0) + 8 * COALESCE(((149.956535618 * (i.score * t.count))), 0) + 5 * COALESCE((n.sticky), 0) + 5 * COALESCE((n.promote), 0) + 10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN i18n_node i18n ON n.nid = i18n.nid INNER JOIN node_access na ON na.nid = n.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid WHERE n.status = 1 AND (i18n.language ='en' OR i18n.language ='en' OR i18n.language ='' OR i18n.language IS NULL) AND (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all') OR (na.gid = 0 AND na.realm = 'og_public') OR (na.gid = 5268 AND na.realm = 'og_admin') OR (na.gid = 1740 AND na.realm = 'og_admin') OR (na.gid = 804 AND na.realm = 'ogr_access') OR (na.gid = 805 AND na.realm = 'ogr_access') OR (na.gid = 1 AND na.realm = 'workflow_access') OR (na.gid = 0 AND na.realm = 'workflow_access_owner'))) AND (i.word = 'serena') AND i.type = 'node' ORDER BY score DESC LIMIT 0, 10;+------+------+------+------+-----------+------------+--------------+--------------+------------------+
| type | sid  | nid  | gid  | realm     | grant_view | grant_update | grant_delete | score            |
+------+------+------+------+-----------+------------+--------------+--------------+------------------+
| node | 1740 | 1740 | 1740 | og_admin  |          1 |            1 |            1 | 18.0000000000156 | 
| node | 1740 | 1740 |    0 | og_public |          1 |            0 |            0 | 18.0000000000156 | 
| node | 5348 | 5348 |    0 | og_public |          1 |            0 |            0 | 15.2291965628376 | 
| node | 5348 | 5348 | 1740 | og_admin  |          1 |            1 |            1 | 15.2291965628376 | 
| node | 5292 | 5292 | 1740 | og_admin  |          1 |            1 |            1 | 10.0667570500113 | 
| node | 5292 | 5292 |    0 | og_public |          1 |            0 |            0 | 10.0667570500113 | 
| node | 5296 | 5296 |    0 | og_public |          1 |            0 |            0 | 9.89196840770338 | 
| node | 5296 | 5296 | 1740 | og_admin  |          1 |            1 |            1 | 9.89196840770338 | 
| node | 5295 | 5295 | 1740 | og_admin  |          1 |            1 |            1 | 9.80863507436998 | 
| node | 5294 | 5294 |    0 | og_public |          1 |            0 |            0 | 9.80863507436998 | 
+------+------+------+------+-----------+------------+--------------+--------------+------------------+

You can see that we're doubling the score for several of the nodes, based on the node access join.

I think that this problem has been around for a long time. I just reviewed the 5.x search code, which also does a GROUP BY, and is thus has the same bug. I suspect that this is fairly common.

I'm not sure yet what the solution is. We need to divide the result by the number of node access rows that match. But I don't think that this is COUNT(*) or COUNT(i.score) because if I had searched for two words, "serena island", the query is supposed to find two matches in {search_index} and we're supposed to SUM() those two scores. However, now with the node access join, in the two word search, we're actually returning four rows, and doubling the score of both words!

Also note that the last scoring factor "10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)" comes from my custom node ranking made possible with #145242: refactor search node_rank with hook_node_rank scoring factors , but that this has nothing to do with this issue, other than being the factor that made me notice that the scoring is not working properly.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

douggreen’s picture

Version: 7.x-dev » 6.x-dev

Ah, great, this is already fixed in 7.x. The solution is to just remove the SUM(). Does anyone know the d.o. issue where we removed that in 7.x so that we can backport that issue to 6.x and 5.x.

-   $columns2 = str_replace('i.relevance', '('. (1.0 / $normalize) .' * SUM(i.score * t.count))', $columns2);
+   $columns2 = str_replace('i.relevance', '('. (1.0 / $normalize) .' * (i.score * t.count))', $columns2);
brianV’s picture

Issue tags: +Needs backport to D6

just tagging for needs backport.

jhodgdon’s picture

Status: Active » Patch (to be ported)

Looks like this still needs a fix in Drupal 6.

Albert Volkman’s picture

Status: Patch (to be ported) » Needs review
FileSize
724 bytes

Backport per @douggreen's suggestion.

jhodgdon’s picture

Issue summary: View changes
Status: Needs review » Reviewed & tested by the community
Issue tags: -Needs backport to D6

Looking through old issues... This would be OK to go into 6.x if we are still fixing 6.x bugs. I agree that this is what d7 is doing.

jhodgdon’s picture

jhodgdon’s picture

Status: Reviewed & tested by the community » Closed (won't fix)

Talked with Gabor (the Drupal 6 branch maintainer) and D6 issues are really not being committed unless they're really essential -- we really don't have a test system for Drupal 6 and it's too dangerous. So... sorry, but this is a "won't fix".