Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
In mysql 5.0.21, the comment block sql
SELECT DISTINCT(c.nid), c.subject, c.cid, c.timestamp FROM drupal_comments c INNER JOIN drupal_node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 ORDER BY c.timestamp DESC LIMIT 0, 10;
ends up doing a full table scan on the comments table:
+----+-------------+-------+------+-------------------------------------+------+---------+-----------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------------------------+------+---------+-----------------+------+----------------------------------------------+
| 1 | SIMPLE | c | ALL | lid | NULL | NULL | NULL | 1092 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid | nid | 4 | c.nid | 1 | Using where; Distinct |
+----+-------------+-------+------+-------------------------------------+------+---------+-----------------+------+----------------------------------------------+
This makes the comment block uselessly slow with a large number of comments.
Comment | File | Size | Author |
---|---|---|---|
#67 | comments_speedup_0.patch | 2.76 KB | robertDouglass |
#65 | comments_speedup.patch | 2.77 KB | robertDouglass |
#62 | comment_speedup_10.diff | 2.43 KB | pwolanin |
#58 | comment_speedup_9a.patch | 2.46 KB | pwolanin |
#57 | comment_speedup_9.patch | 2.46 KB | pwolanin |
Comments
Comment #1
Jax CreditAttribution: Jax commentedThe only useful key would be status. So the proposed patch adds a key on status just like it's done for the nodes table.
Comment #2
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedstatus, if i remember right, is the flag that tells you if a comment is published or not. I imagine in most cases, that wouldn't be particularly selective (almost all comments would be published) in which case the database should correctly decide to ignore the key and do the full table scan anyway...
right?
Comment #3
Jax CreditAttribution: Jax commentedDo you have any reference documentation that explains why and when a full table scan would be desired? I wasn't aware that that ever would perform better.
Comment #4
pwolanin CreditAttribution: pwolanin commentedDoes this only display one comment per node, or really the 10 most recent comments?
Could this query be faster to do in two parts:
1) limit the query by finding the 10 nodes with the most recent comment timestamp
2) use these nid values to speed the query of comments.
I'm a little fuzzy on the mysql syntax, since there are notations like:
in the node_comment_statistics table.
Here's the original code (with linebreaks for readibility):
Anyhow, how about something like (leaving off the db_rewrite_sql calls for simplicity):
Maybe this could also be written as a compound query?
Comment #5
Wesley Tanaka CreditAttribution: Wesley Tanaka commented> Do you have any reference documentation that explains why and when a full table scan would be desired?
web search for index selectivity
Comment #6
pwolanin CreditAttribution: pwolanin commentedI tested the code below and it seems to work, but I don't have enough comments to tell if there's a speed difference:
Comment #7
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedWould you be willing to upload that in patch form?
Comment #8
pwolanin CreditAttribution: pwolanin commentedOk, here's a patch against 4.7. Seems to work (at least in as much as the recent comments are displayed).
Comment #9
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedDid anybody test this?
I think adding an index on c.status is probably better.
Comment #10
pwolanin CreditAttribution: pwolanin commented@killes - I don't think having an index an c.status will help at all, since most comments wil have the same status.
Comment #11
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedwell, in any case we'd need some benchmarks.
Comment #12
pwolanin CreditAttribution: pwolanin commenteddo you know of anyone who has a backup site suitable for testing this? What's the most appropriate way to benchmark this?
Comment #13
pwolanin CreditAttribution: pwolanin commentedlooking at the exisintg query, maybe it would help to put a new index on c.timestamp?
Comment #14
pwolanin CreditAttribution: pwolanin commentedquick test with ~455 nodes and ~4600 comments generated by the devel module.
Local install, MySQL 5.0, PHP 5.1.2, Apache 2.0,
the current core version takes ~120-185 ms to run after a new comment is added.
Substituting the code at the theme level (see below), the 2 queries run in about 3-4 ms each, with the longest I saw of about of 6-7 ms each - so total query time is reduced ~10-fold.
for testing purposes, you can just substitute my version at the theme level:
Comment #15
pwolanin CreditAttribution: pwolanin commentedalso, an index on c.timestamp didn't seem to make a difference.
Comment #16
pwolanin CreditAttribution: pwolanin commentedsince there is an apparent marked gain in speed, this at leat deserves a review ;-)
Comment #17
FiReaNGeL CreditAttribution: FiReaNGeL commentedTry the following and report back. Before, add a status|timestamp index (in the same index) for the comments table. I think its better than the original one in terms of speed. Don't have a huge database to test it on though. Also, I don't know if pgsql / mysql 4.0 support this kind of query (i think they should, but we better check).
SELECT DISTINCT (k.nid), k.subject, k.cid, k.timestamp
FROM node n, (
SELECT c.nid, c.cid, c.subject, c.timestamp
FROM comments c
WHERE c.status =0
ORDER BY c.timestamp DESC
) AS k
WHERE k.nid = n.nid
AND n.status =1
LIMIT 10 ;
Comment #18
FiReaNGeL CreditAttribution: FiReaNGeL commentedLittle follow up. If we assume that the last 1000 comments are in at least 10 separate nodes (not a wild assumption), we can do
EXPLAIN SELECT distinct(k.nid), k.subject, k.cid, k.timestamp FROM node n, (SELECT c.nid, c.cid, c.subject, c.timestamp from comments c where c.status = 0 order by c.timestamp DESC LIMIT 1000) as k WHERE k.nid = n.nid AND n.status = 1 LIMIT 10;
(Notice the additional limit 1000 in the subselect), which is way, way faster than my other query. We could set this limit to 2000, or even 5000 for safety purposes and get fast performance.
Comment #19
Dries CreditAttribution: Dries commentedUnfortunately, we can't really rely on assumptions like that. There are sites that might do more comments per node.
Comment #20
FiReaNGeL CreditAttribution: FiReaNGeL commentedYes, I know. But 5000 concurrent, back-to-back comments in less than 10 nodes? And even if that somehow happens, you'll get only those in the 'most recently commented on' block, which would be accurate. But I understand its a less than perfect solution (althought the performance gain from it on large site is crazy, as only the small 5000 comments table need to be file sorted, instead of the whole comments table).
Will try to find something better, but I'll probably use this version on my site. Thanks!
Comment #21
FiReaNGeL CreditAttribution: FiReaNGeL commentedBTW, my query in comment 17 should get benchmarked compared to the original one, im pretty sure its better and doesnt make any crazy assumptions :)
Comment #22
IncrediblyKenzi CreditAttribution: IncrediblyKenzi commentedWe applied the patch in comment #8 on a site with 300,000 comments in the DB.
The original query was causing some significant backlog on the mysql side and a subsequent "too many connections" error from MySQL itself. After applying the patch, the site has been purring along nicely.
Thanks for the help!
Comment #23
chx CreditAttribution: chx commentedThen let this get in. I rerolled #8 against HEAD and did nothing else.
Comment #24
chx CreditAttribution: chx commented*SIGH* I have a sloppy day.
Comment #25
chx CreditAttribution: chx commentedWow. How many style errors there can be in five lines? Apparently, tons. I think I catched them all now.
Comment #26
Dries CreditAttribution: Dries commentedNeeds a code comment or two.
Comment #27
pwolanin CreditAttribution: pwolanin commentedadded code comments
Comment #28
RobRoy CreditAttribution: RobRoy commentedComment should be in "This is a sentence form." so
- The header comment should end with a period.
- "//select" should be "// Select" and that should end with a period too, same with the next one.
Comment #29
pwolanin CreditAttribution: pwolanin commentedOk, fixed comment formatting.
Comment #30
Jax CreditAttribution: Jax commentedYou forgot the space in the comment after the
//
:)Comment #31
pwolanin CreditAttribution: pwolanin commenteddoh!
Comment #32
robertDouglass CreditAttribution: robertDouglass commentedIf you use the IN syntax, you need to make sure to only do the query if there are actual items to be selected from. Otherwise you generate a SQL error because of
IN ()
.Why not select into a temporary table?
Comment #33
robertDouglass CreditAttribution: robertDouglass commentedI tested #31. The results were not better with 50 nodes and 500 comments.
Before the patch
After the patch
Different comments in a different order are selected:
There are two queries to track, and the first one runs slower than the original query.
Conclusion
This is an important issue to solve but I'm not convinced this is the solution. Why are different comments selected and in a different order? Was the original query wrong?
I'll investigate further and look into using a temporary table as well.
Comment #34
robertDouglass CreditAttribution: robertDouglass commentedComment #35
robertDouglass CreditAttribution: robertDouglass commentedI did some more testing with fewer comments (40) and the performance gap widens:
unpatched, the average time is: 0.883
patched, the average time (of the two queries added together) is: 1.631
I was unable to recreate the order problems I had in the first test (where I used devel module to generate the comments), so it is possible that it won't happen under normal circumstances.
Comment #36
robertDouglass CreditAttribution: robertDouglass commentedAt 20,000 comments, the patched version starts to show an advantage:
unpatched average time: 332.26
patched average time: 68.156
The patched version is around 5 times faster.
Comment #37
robertDouglass CreditAttribution: robertDouglass commentedNote that my approach to testing has been inherently flawed in favor of the path because I've been adding the sum of the database queries. The path introduces an extra PHP loop, so I should be testing the return time of the function itself.
Comment #38
robertDouglass CreditAttribution: robertDouglass commentedI don't think db_query_temporary is capable of querying a range, so I'm giving up on trying the same approach with a temporary table. In all, I think the performance gain on larger sites is worth the performance hit on smaller sites. RTBC.
Comment #39
robertDouglass CreditAttribution: robertDouglass commentedOops, I meant to upload this updated version which protects against running the second query (with the dangerous IN()) if there are no comments.
Comment #40
pwolanin CreditAttribution: pwolanin commented@RobertDouglas- the short answer is that I haven't put a lot of effort into perfecting this code, since it's essentially just the first thing that came into my head. Also, I'm not an SQL expert by any stretch.
Ok, just did too much reading on mysql.com and postgresql.org. By temporary table, do you mean like this?
I think this syntax works for both servers.
Is this used anywhere in core?To answer my own question, the DB layer provides db_query_temporary() for making temporary tables, and this is used by the search module.Comment #41
robertDouglass CreditAttribution: robertDouglass commentedComment #42
pwolanin CreditAttribution: pwolanin commentedlooks like we were posting at the same time.
wouldn't it be simpler like the attached?
Comment #43
robertDouglass CreditAttribution: robertDouglass commented@pwolanin: yeah, search is the reason why we got db_query_temporary(), but I am very disappointed that it doesn't seem to be able to support the LIMIT 0, 10 part. As for your alternate flow structure, I'll leave it to the core team to decide which syntax they like best. Some people like to see the returns at the end of the function.
Any more people test this? It'd be good to get more confirmation that the set and order of comments shown in the block are the same as before.
Comment #44
pwolanin CreditAttribution: pwolanin commentedWell, we can open a Drupal 6.x feature request for an implementation of db_query_range_temporary(). I'm guessing there is some difference among SQL variants so that db_query and db_query_range need to be kept separate.
Comment #45
drummSince no themes in contrib seem to implement theme_comment_block(), lets go ahead and take all SQL out of that themeable function while we are here.
Comment #46
pwolanin CreditAttribution: pwolanin commentedOk, all SQL moved to a reusable helper function.
Comment #47
robertDouglass CreditAttribution: robertDouglass commentedI like this last patch the best! I can imagine doing things like keyword analysis on those last 10 comments and theming the block totally differently, whereas before, it was a pointless theme function.
Comment #48
chx CreditAttribution: chx commentedIdea is cool. But recent_comments break Drupal namespacing rules , it must start with comment_
Comment #49
pwolanin CreditAttribution: pwolanin commentedok, for a function name- what would you prefer?
comment_get_recent()
comment_find_recent()
comment_recent_comments()
comment_recent()
looks like X_get_X is pretty common in core now.
Comment #50
pwolanin CreditAttribution: pwolanin commentedok, pciked comment_get_recent()- patch attached. A question, however. Would it make this function more useful to return all the fields form the comment table, rather than just the 4 currently selected?
Comment #51
pwolanin CreditAttribution: pwolanin commentedlike this- all the fields
Comment #52
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedSelecting more fields is slower than selecting just the fields you need
Comment #53
m3avrck CreditAttribution: m3avrck commentedNote, I just applied patch #8 (nice and simple) to fix my 4.7 site that was very sluggish.
Comment #54
robertDouglass CreditAttribution: robertDouglass commented@m3avrck: did you consider #51? It is the one that is currently being considered for core inclusion.
Comment #55
m3avrck CreditAttribution: m3avrck commentedI had a look over but I need a 4.7 specific fix and wasn't sure of the state of #51 in being applied to 4.7 I had been referred to this thread to apply patch #8 specifically since my referrer was already using it on a 4.7 site :-)
Seems like the patch in 51 should be even more efficient.
Comment #56
pwolanin CreditAttribution: pwolanin commentedBased on the comment in #52, the #50 patch should be faster than the #51. Both are for HEAD, but are essentially the same code as above for 4.7, just refactored into a separate function.
Comment #57
pwolanin CreditAttribution: pwolanin commentedBased on the comments, I think this is RTBC (anyone?). I prefer the last patch (#51), since it makes the new function much more general and useful. Since the bottleneck was the query, I don't think returning all the data for 10 rows will cause a measurable speed difference (actual benchmarking appreciated).
Updated patch from #51 attached that applies without offset to current HEAD.
Comment #58
pwolanin CreditAttribution: pwolanin commenteddoh- typo in the comments fixed. No change to actual code.
Comment #59
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedOn a current real database I have, running the c.* query 10000 times took 29 second, and running the query which just pulls the 4 columns 10000 times 16 seconds (that was piping output to a file)
Piping output to /dev/null (which I imagine is the best possible case for the c.* query, since allocating memory in PHP and shuffling the data around for all the extra columns will take extra time), the c.* query took 25 seconds and the 4 column query took 15 seconds.
The difference in speed will change depending on how large the extra columns are (like the comment body, etc)
Maybe the new function could be called "comment_get_recent_headers" or something, if that would match the "getting summary data only" functionality better?
Comment #60
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedMinor thing: instead of
you could do:
Comment #61
robertDouglass CreditAttribution: robertDouglass commentedI'm convinced that the 4 column approach is better. I like this patch. Let's get it in before RC.
Comment #62
pwolanin CreditAttribution: pwolanin commentedOk, attached patch against current HEAD, returns only the 4 columns, fixes type in code comment, and utilzes Wesley's suggestion for compacting the code.
Comment #63
Dries CreditAttribution: Dries commentedBut, but ... if I understand the performance results, the benefits are marginal?
- from should be FROM, in that SQL query.
- I'd rename $cnode to $comment.
- Indentation of the PHPdoc looks awkward.
- I think we need to add a little paragraph explaining why we do this in two queries.
- Long code comments should be wrapped.
Comment #64
pwolanin CreditAttribution: pwolanin commented@Dries - the last benchmarking was between two versions of this patch NOT between the current code and patch code. So, the marginal difference in recent comments is for returning only 4 fields from the comment table row, as opposed to returning the entire row. As for the patch speed see #36:
More benchmarking would certainly be good. I'll work on the other issues.
Comment #65
robertDouglass CreditAttribution: robertDouglass commentedPatch rerolled according to Dries' suggestions. I'm also benchmarking.
Comment #66
pwolanin CreditAttribution: pwolanin commentedtwo minor points:
1) I personally find the re-use of $comments potentially confusing. I'd rather see $cnode be renamed to $node (or something-node), since actually what we are selecting in the first step are the nodes with the most recent comments.
2) in my code, I think I should not have removed the initialization
$items = array()
, since otherwise we get the non-E_ALL testif (NULL)
Comment #67
robertDouglass CreditAttribution: robertDouglass commentedagree.
Comment #68
robertDouglass CreditAttribution: robertDouglass commentedBenchmarks. I tested with ab against the front page with only the recent comments block enabled.
0 nodes, 0 comments
no patch
Requests per second: 42.03 [#/sec] (mean)
Requests per second: 42.36 [#/sec] (mean)
Requests per second: 42.23 [#/sec] (mean)
with patch
Requests per second: 42.13 [#/sec] (mean)
Requests per second: 42.26 [#/sec] (mean)
Requests per second: 42.58 [#/sec] (mean)
5 nodes, 20 comments
no patch
Requests per second: 29.18 [#/sec] (mean)
Requests per second: 29.53 [#/sec] (mean)
Requests per second: 28.55 [#/sec] (mean)
with patch
Requests per second: 27.66 [#/sec] (mean)
Requests per second: 28.80 [#/sec] (mean)
Requests per second: 29.44 [#/sec] (mean)
100 nodes, 1000 comments
no patch
Requests per second: 13.48 [#/sec] (mean)
Requests per second: 13.86 [#/sec] (mean)
Requests per second: 13.74 [#/sec] (mean)
with patch
Requests per second: 13.73 [#/sec] (mean)
Requests per second: 13.68 [#/sec] (mean)
Requests per second: 13.44 [#/sec] (mean)
200 nodes, 2000 comments
no patch
Requests per second: 8.20 [#/sec] (mean)
Requests per second: 8.10 [#/sec] (mean)
Requests per second: 8.04 [#/sec] (mean)
with patch
Requests per second: 12.26 [#/sec] (mean)
Requests per second: 12.28 [#/sec] (mean)
Requests per second: 12.15 [#/sec] (mean)
Requests per second: 12.59 [#/sec] (mean)
2000 nodes, 10000 comments
no patch
Requests per second: 2.12 [#/sec] (mean)
Requests per second: 2.12 [#/sec] (mean)
Requests per second: 2.08 [#/sec] (mean)
with patch
Requests per second: 9.90 [#/sec] (mean)
Requests per second: 9.50 [#/sec] (mean)
Requests per second: 9.70 [#/sec] (mean)
Comment #69
pwolanin CreditAttribution: pwolanin commentedfantastic.
Comment #70
ChrisKennedy CreditAttribution: ChrisKennedy commentedMy host doesn't have ab, but I eventually figured out how to insert 3400 comments into my test site (only a few nodes). When viewing the front page with comments and devel blocks enabled the patch caused the average execution time to drop by about 50%.
Comment #71
drummCommitted to HEAD.
Comment #72
pwolanin CreditAttribution: pwolanin commentedthanks- maybe for 4.7 too?
Comment #73
chx CreditAttribution: chx commentedThis is an API change. I doubt this needs porting.
Comment #74
chx CreditAttribution: chx commentedAs ChrisKennedy points out, no modules or themes will break just they won't get faster. And if someone wants to use this function in his 4.7 theme in such a way that's compatible with every 4.7 then it's still possible to do:
Comment #75
Zen CreditAttribution: Zen commentedIMO, this speedup patch is incomplete as the first node_comment_statistics query does not use a key.. A compound index for comment_count and last_comment_timestamp ought to be added and an appropriate update will be required.
It would also be nice if core committers decide on a best practice when it comes to dealing with queries using IN ().. Some of you insist that _all_ (including internal) input be vetted using %ds, while others do not care in cases where the input is assuredly safe.
chx: I don't see a 4.7 patch. Which patch are you RTBC-ing?
Thanks,
-K
P.S IMO, however, the recent comment block should live up to its name and only show _recent_ comments (based on a configurable period of time) and not _only_ the last X comments. This would likely make it possible to retrieve everything in one (probably) efficient set of JOINS.
Comment #76
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedI disagree with this -- it sounds like a case of changing a UI because of implementation details.
The "latest N comments" behavior is better because the block is the same size even if there has been a recent spate of comments or a recent dearth.
Comment #77
pwolanin CreditAttribution: pwolanin commented@Zen I don't understand your comment- the table
node_comment_statistics
has definedKEY node_comment_timestamp (last_comment_timestamp)
.Comment #78
FiReaNGeL CreditAttribution: FiReaNGeL commentedpwolanin : He means a compound index on both keys at the same time.
Comment #79
pwolanin CreditAttribution: pwolanin commentedOk, I see- I guess the real question is whether a compound index would make a difference in real life? Maybe it would for sites with many new nodes where few of those nodes have comments.
Comment #80
pwolanin CreditAttribution: pwolanin commentedI just discovered there is a minor problem with the code committed to 5.x/HEAD:
This query:
Actually should (I think) be written something like:
to conform to the API for db_rewrite_sql
Otherwise, this causes SQL errors in modules that implement hook_db_rewrite_sql: http://drupal.org/node/111421
Comment #81
pwolanin CreditAttribution: pwolanin commentedplease follow-up discussion about the bug in #80 at this separate issue: http://drupal.org/node/111830
Comment #82
catchpwolanin has linked to the db_rewrite_sql issue, and indexes are being dealt with here: http://drupal.org/node/164532
so marking as duplicate.