When the db_rewrite_sql node access join returns multiple rows, the search result score normalization is wrong.
I only noticed it because I backported #146466: D6 search refactoring (backport to 5) and #145242: refactor search node_rank with hook_node_rank scoring factors to 5.x, installed both of them on a client site, and wrote a node ranking for one particular content type to have a higher rating. It worked fine for admin, but didn't work for anonymous user's because of the og db_rewrite. The use-case happens when og returns that the node access is both public and accessible by admin, in which we double the search score for these nodes.
The SQL without node access returns the expected results:
mysql> SELECT i.type, i.sid, (1 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.0434782608696)), 0) + 8 * COALESCE(((149.956535618 * SUM(i.score * t.count))), 0) + 5 * COALESCE((n.sticky), 0) + 5 * COALESCE((n.promote), 0) + 10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN i18n_node i18n ON n.nid = i18n.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid WHERE n.status = 1 AND (i18n.language ='en' OR i18n.language ='en' OR i18n.language ='' OR i18n.language IS NULL) AND (i.word = 'serena') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1 ORDER BY score DESC LIMIT 0, 10;+------+------+-----------------+
| type | sid | score |
+------+------+-----------------+
| node | 1740 | 18.000000000016 |
| node | 5348 | 15.229196562838 |
| node | 5292 | 10.066757050011 |
| node | 5296 | 9.89196507437 |
| node | 5293 | 9.80863507437 |
| node | 5294 | 9.80863507437 |
| node | 6744 | 9.80863507437 |
| node | 5295 | 9.80863507437 |
| node | 5351 | 9.5505130987287 |
| node | 5930 | 9.5505130987287 |
+------+------+-----------------+
10 rows in set (0.01 sec)
But with node access it returns the following, notice that sid 5348's score is doubled (and so are many others):
mysql> SELECT i.type, i.sid, (1 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.0434782608696)), 0) + 8 * COALESCE(((149.956535618 * SUM(i.score * t.count))), 0) + 5 * COALESCE((n.sticky), 0) + 5 * COALESCE((n.promote), 0) + 10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN i18n_node i18n ON n.nid = i18n.nid INNER JOIN node_access na ON na.nid = n.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid WHERE n.status = 1 AND (i18n.language ='en' OR i18n.language ='en' OR i18n.language ='' OR i18n.language IS NULL) AND (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all') OR (na.gid = 0 AND na.realm = 'og_public') OR (na.gid = 5268 AND na.realm = 'og_admin') OR (na.gid = 1740 AND na.realm = 'og_admin') OR (na.gid = 804 AND na.realm = 'ogr_access') OR (na.gid = 805 AND na.realm = 'ogr_access') OR (na.gid = 1 AND na.realm = 'workflow_access') OR (na.gid = 0 AND na.realm = 'workflow_access_owner'))) AND (i.word = 'serena') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1 ORDER BY score DESC LIMIT 0, 10;
+------+------+-----------------+
| type | sid | score |
+------+------+-----------------+
| node | 5348 | 30.458393125675 |
| node | 1740 | 26.000000000031 |
| node | 5292 | 20.133514100023 |
| node | 5296 | 19.70060014874 |
| node | 5294 | 19.61727014874 |
| node | 6744 | 19.61727014874 |
| node | 5295 | 19.61727014874 |
| node | 5293 | 19.61727014874 |
| node | 5352 | 19.101026197457 |
| node | 5307 | 19.101026197457 |
+------+------+-----------------+
10 rows in set (0.00 sec)
When I modified the second query to see what's happening, it's obvious that we have a node_access issue:
mysql> SELECT i.type, i.sid, na.*, (1 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.0434782608696)), 0) + 8 * COALESCE(((149.956535618 * (i.score * t.count))), 0) + 5 * COALESCE((n.sticky), 0) + 5 * COALESCE((n.promote), 0) + 10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN i18n_node i18n ON n.nid = i18n.nid INNER JOIN node_access na ON na.nid = n.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid WHERE n.status = 1 AND (i18n.language ='en' OR i18n.language ='en' OR i18n.language ='' OR i18n.language IS NULL) AND (na.grant_view >= 1 AND ((na.gid = 0 AND na.realm = 'all') OR (na.gid = 0 AND na.realm = 'og_public') OR (na.gid = 5268 AND na.realm = 'og_admin') OR (na.gid = 1740 AND na.realm = 'og_admin') OR (na.gid = 804 AND na.realm = 'ogr_access') OR (na.gid = 805 AND na.realm = 'ogr_access') OR (na.gid = 1 AND na.realm = 'workflow_access') OR (na.gid = 0 AND na.realm = 'workflow_access_owner'))) AND (i.word = 'serena') AND i.type = 'node' ORDER BY score DESC LIMIT 0, 10;+------+------+------+------+-----------+------------+--------------+--------------+------------------+
| type | sid | nid | gid | realm | grant_view | grant_update | grant_delete | score |
+------+------+------+------+-----------+------------+--------------+--------------+------------------+
| node | 1740 | 1740 | 1740 | og_admin | 1 | 1 | 1 | 18.0000000000156 |
| node | 1740 | 1740 | 0 | og_public | 1 | 0 | 0 | 18.0000000000156 |
| node | 5348 | 5348 | 0 | og_public | 1 | 0 | 0 | 15.2291965628376 |
| node | 5348 | 5348 | 1740 | og_admin | 1 | 1 | 1 | 15.2291965628376 |
| node | 5292 | 5292 | 1740 | og_admin | 1 | 1 | 1 | 10.0667570500113 |
| node | 5292 | 5292 | 0 | og_public | 1 | 0 | 0 | 10.0667570500113 |
| node | 5296 | 5296 | 0 | og_public | 1 | 0 | 0 | 9.89196840770338 |
| node | 5296 | 5296 | 1740 | og_admin | 1 | 1 | 1 | 9.89196840770338 |
| node | 5295 | 5295 | 1740 | og_admin | 1 | 1 | 1 | 9.80863507436998 |
| node | 5294 | 5294 | 0 | og_public | 1 | 0 | 0 | 9.80863507436998 |
+------+------+------+------+-----------+------------+--------------+--------------+------------------+
You can see that we're doubling the score for several of the nodes, based on the node access join.
I think that this problem has been around for a long time. I just reviewed the 5.x search code, which also does a GROUP BY, and is thus has the same bug. I suspect that this is fairly common.
I'm not sure yet what the solution is. We need to divide the result by the number of node access rows that match. But I don't think that this is COUNT(*) or COUNT(i.score) because if I had searched for two words, "serena island", the query is supposed to find two matches in {search_index} and we're supposed to SUM() those two scores. However, now with the node access join, in the two word search, we're actually returning four rows, and doubling the score of both words!
Also note that the last scoring factor "10 * COALESCE((IF(n.type = 'campaign', 1, 0)), 0)" comes from my custom node ranking made possible with #145242: refactor search node_rank with hook_node_rank scoring factors , but that this has nothing to do with this issue, other than being the factor that made me notice that the scoring is not working properly.
Comment | File | Size | Author |
---|---|---|---|
#4 | search_score_normalization-437998-4.patch | 724 bytes | Albert Volkman |
Comments
Comment #1
douggreen CreditAttribution: douggreen commentedAh, great, this is already fixed in 7.x. The solution is to just remove the SUM(). Does anyone know the d.o. issue where we removed that in 7.x so that we can backport that issue to 6.x and 5.x.
Comment #2
brianV CreditAttribution: brianV commentedjust tagging for needs backport.
Comment #3
jhodgdonLooks like this still needs a fix in Drupal 6.
Comment #4
Albert Volkman CreditAttribution: Albert Volkman commentedBackport per @douggreen's suggestion.
Comment #5
jhodgdonLooking through old issues... This would be OK to go into 6.x if we are still fixing 6.x bugs. I agree that this is what d7 is doing.
Comment #6
jhodgdonThe duplicate entries in the result here are causing other problems:
#1842604: Searching for multiple words does not work correctly in the presence of a node access module
Comment #7
jhodgdonTalked with Gabor (the Drupal 6 branch maintainer) and D6 issues are really not being committed unless they're really essential -- we really don't have a test system for Drupal 6 and it's too dangerous. So... sorry, but this is a "won't fix".