(follow-up issue from http://drupal.org/node/44276)
that bug fix further slowed down an already slow query. some EXPLAIN SELECT investigation of the queries that drive the "Recect posts" page shows that it's already doing the worst possible kind of query: a filesort on the *entire* {node_comment_statistics} table. :( the new patch doesn't (and can't fix that) and in fact makes it a little worse.
so, perhaps it'd be nice to speed this query up. here's my initial idea (from http://drupal.org/node/44276#comment-171227):
we'd have to add a new column to the {node} table, sort of like "changed" called something like "updated". we'd need to set it anytime either the node is edited, or a comment is posted. we'd have an index on this. that would avoid:
- the GREATEST()
- ORDER BY on a computed value
- one of the INNER JOIN's (on {node_comment_statistics})
i'm guessing the speedup could be quite huge. but, this will probably be a very contentious DB schema change, and it's *way* too late in D5 for that.
Comment | File | Size | Author |
---|---|---|---|
#269 | tracker.patch | 27.99 KB | catch |
#266 | tracker_105639.patch | 28.08 KB | drewish |
#264 | tracker_105639.patch | 27.8 KB | drewish |
#263 | tracker_105639.patch | 24.55 KB | drewish |
#262 | tracker_105639.patch | 24.55 KB | drewish |
Comments
Comment #1
ChrisKennedy CreditAttribution: ChrisKennedy commentedThat would be handy - my one Views patch simulates such a column: http://drupal.org/node/89421
Comment #2
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedGreat suggestion and sure useful. However, the new table belongs into node_comment_statistics. Maybe we could rename it to node_statistics.
Comment #3
dwwwhy? then we still have the additional INNER JOIN. is the thought that comments get added faster than node updates, so we'd get into insert/update locking hell again or something? if this value belongs in {node_statistics} (good name, if that's what it's turning into) what about {node}.changed itself? just a little unclear on the logic for using the separate table, when one of the goals was avoiding the INNER JOIN itself.
thanks,
-derek
Comment #4
geodaniel CreditAttribution: geodaniel commentedJust a note that the INNER JOIN to {node_comment_statistics} is also causing issues in 4.7.x and 5.x when the comment module is not enabled: #110556.
Comment #5
pwolanin CreditAttribution: pwolanin commentedWith 5.x and multiple cache table, why (for 6.x at least) can't the result cached?
On d.o, the tracker page takes a long time to load. However, maybe cache_clear_all() gets called too often for caching the tracker to be of benefit unless there's a minimum cache lifetime?
Comment #6
pwolanin CreditAttribution: pwolanin commentedThe performance of the tracker on d.o has been terrible recently...
Comment #7
mindless CreditAttribution: mindless commentedHere's another idea, perhaps a first step on the way to a more involved change. This one doesn't require any db change.. it removes the inner join on users table and queries for the needed usernames afterwards. Much faster on our system with 59k nodes, 227k comments.
Comment #8
kbahey CreditAttribution: kbahey commented@ mindless
Can you please install the devel module and enable the query log and the page generation time, and post the before the patch values, and after the patch?
This will quantify the difference more precisely.
Comment #9
mindless CreditAttribution: mindless commentedquery time(ms):
tracker/ Before: 5041.671 After: 2828.042
tracker/{uid} Before: 3657.517 After: 1790.737
Comment #10
kbahey CreditAttribution: kbahey commentedYou have my +1.
The fact that no schema changes are required makes it +2 for me.
Comment #11
mindless CreditAttribution: mindless commentedForgot to include the additional query for usernames in the numbers.. adds just 22.692 ms to each "After" value above.
Comment #12
FiReaNGeL CreditAttribution: FiReaNGeL commentedmindless: its a good improvement, but wouldnt it be better to actually fix the problem, which as dww suggested, requires a database change? 3 seconds query are not what I would call acceptable, even on a 'big' site.
That being said, it's still an improvement. If nobody is up to the task of implementing the required changes (and I'm not such a person, sadly), it's way better than nothing.
Comment #13
kbahey CreditAttribution: kbahey commentedThe issue with this is that now for each comment that is added, you need to update the node table. This means you are incurring a lock and write on the node table.
On busy sites, this can be a bottleneck, since others cannot read the node table until the lock/write is done.
killes has decoupled a column in the users table and moved it to its own table because of that reason. So now we are moving the other way.
Comment #14
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedmindess: make sure to propery use the db API and escape input values
Comment #15
dwwright, the patch in #7 would allow SQL injection...
Comment #16
mindless CreditAttribution: mindless commentedsorry, i was proposing an idea/concept in the form of a patch, that wasn't intended as a patch for review. anyone feel free to take the idea and go with it..
Comment #17
pwolanin CreditAttribution: pwolanin commentedIs there really an SQL injection issue? These uid values all came from the DB.
Attached is a patch against the lastest DRUPAL-5, though it should apply fine to HEAD with offset. It's mostly just a rewrite of the patch and approach suggested by mindless. Includes code to escape all the uids in the query and other minor improvements. In general, this seems like a good approach, since at max we will have 25 uids in this second query.
Note, the patch looks bigger than it really is since some if the code was wrapped in an if{} block, and hench there is a chance in indentation.
Comment #18
pwolanin CreditAttribution: pwolanin commentedAny thought on this? Improving tracker performance on d.o is really critical.
Comment #19
FiReaNGeL CreditAttribution: FiReaNGeL commentedThe suggestion dww did is in my opinion the only viable one, long term, and isn't that hard to implement.
Comment #20
pwolanin CreditAttribution: pwolanin commentedI assume you mean the suggestion at the top of adding a new column or table? Yes, for Drupal 6, that might be best.
However, please take a look at the attached patch in terms of an immediate improvement that might be applied to Drupal 5. It's similar in spirit to the change in code made to the recent comments block to get a speedup.
Comment #21
moshe weitzman CreditAttribution: moshe weitzman commentedthis query is now a smoking gun for drupal.org suckiness. please review and RTBC if it makes sense. Would be nice to test with lots of nodes and comments.
Comment #22
Dries CreditAttribution: Dries commentedThe patch's code looks good, however, I'd like to see two things happen:
1. benchmarking -- does this really have a performance gain? I think this needs to be benchmarked first. So far, we've only been guessing.
2. documentation -- can we document in the code why we use this approach? Novice developers might otherwise think that this can be optimized by merging the two queries into one query. One or two lines of documentation would be sufficient.
Comment #23
pwolanin CreditAttribution: pwolanin commented@Dries - I don't have a setup for benchmarking at this point- can you or Moshe recruit someone with the setup and experience?
#2 I'll address tonight (EDT).
Comment #24
mindless CreditAttribution: mindless commentedfwiw, http://gallery.menalto.com/tracker is using this patch. Afraid I didn't do any measurements, but when I put it in place the performance of tracker pages definitely improved. Site has 61k nodes and 232k comments.
Comment #25
pwolanin CreditAttribution: pwolanin commentedOk, new patch attached against lastest DRUPAL-5. I added 2 lines of comment, and switch back to the original, simpler, implode for the uids. Since these are ints coming straight out of the DB, it seems silly to run the extra code to escape them. We did the same in comment.module on line 263.
Comment #26
kbahey CreditAttribution: kbahey commentedCan't we test this on scratch.d.o for a before and after scenario?
This would give us LOTS of nodes and comments, and users and will tell us definitely if it would help d.o or not, and by how much?
Comment #27
merlinofchaos CreditAttribution: merlinofchaos commentedSubscribing
Comment #28
pwolanin CreditAttribution: pwolanin commentedok my uninformed attempt to benchmark this: I inserted 10000 users, 6800 nodes, 18000 comments, then disabled the devel module. Ran some queries to get the right data into {node_comment_statistics} (I think the generate functionality of devel needs work).
Added the line
$uid = mt_rand(10000);
at the top of the function to avoid any caching of the query results.Serving from my laptop (2.0 GHz Intel, Mac 10.4, Apache 1.33 with mod_php, PHP 4.4.4, MySQL 4.1) across my local network to another computer running ab (so latency ~1 ms).
Hitting /tracker without the patch (and random uid) I get ~1.05 pages/sec at c=1, or ~0.59/sec at c=5
Hitting /tracker with the patch (and with random uid) I get ~1.20 pages/sec at c=1, or ~0.76/sec at c=5
However, in both cases the number of "failed requests" was >95% of the total (due the "Length" I'm not sure what that means).
So, this should probably be repeated with a larger number of nodes/comments per user. With this ratio, only a few items show up on each user's tracker page.
Comment #29
FiReaNGeL CreditAttribution: FiReaNGeL commentedOK fiddled around a little with this tonight. The proposed patch in the above comments is only a temporary fix - code freeze is looming for d6, and i'd rather fix the problem than apply a temporary band-aid.
First of all, simply adding a db column to replace the GREATEST value and index it as dww suggested would help, but isnt sufficient. The second part of the problem lies in the "AND (n.uid = %d OR c.uid = %d)" section of the query; mysql couldnt use an index on the proposed 'updated or commented' column because of the where clause on another table.
This specific where wants to filter on nodes that are A- created by the users or B- in which the user commented.
So lets do an UNION of those things, which is much faster. WARNING: Tons of mysql incoming (even if its just one query).
I replaced the %d for testing purposes. On my dev box (16000 nodes, not many comments), this query is 3 times faster and SHOULD give the same results as the other one. With the db change and index dww proposed, this should get much faster.
Can anyone bench this query along the original one and the proposed patch? Also, is UNION ok with postgre? I have no idea. Also, I'm in the process of moving my girlfriend in my app, so I dont have time to code this right now.
Also, I think the proposed 'updated or commented' column shouldnt go in 'node' directly, as kbahey pointed out. Maybe we could move 'changed' from node to its own table along with the new 'updated or commented' column.
Comment #30
pwolanin CreditAttribution: pwolanin commented@FiReaNG3L - as will the recent comments query in comment module, I think there are problems since the limit syntax varies a bit by DB ( which is why we have db_query_range()) as well as support for compound queries.
So, even with the new column it seem likely to be necessary to do the second query as in the patch.
Comment #31
kbahey CreditAttribution: kbahey commentedLooking at an explain of this query, I get this:
What this tells me is that it is doing a temporary table for the node_comment_statistics table and doing filesort on it, and that should be the slow part.
All we need from this table are the columns
last_comment_timestamp
and thecomment_count
. These columns are 1:1 with the node table, and according to normalization rules can (or should) be in the node table itself. Yes, this will cause an UPDATE on the node table every time a new comment is added, or deleted, but it may not be an issue.So,
1. Can someone with ssh access on d.o do an explain on the query on d.o, and paste the output here, to confirm if the access path and contention is indeed in the node_comment_statistics table?
2. If there are no significant objections, I can go ahead and create a patch for this, and we can test it on scratch.d.o for a definitive yes/no.
Comment #32
pwolanin CreditAttribution: pwolanin commentedWhat about the suggested solution of making this a new column in {node_comment_statistics}? Yes, you'll still have the join, but you could index this new column and so the penalty of sorting a computed value would go away.
I think (from comments above) writing the node table for each comment update is a bad idea- that's why {node_comment_statistics} is a separate table.
Comment #33
kbahey CreditAttribution: kbahey commentedHere are the explains again with the correct queries (the first ones were not as per the module).
With the GREATEST and node_comment_statistics:
Without the GREATEST and the node_comment_statistics table at all:
I am getting variances from site to site as to keys and such, but all do a temp and file sort without a where for the first query (the existing one in tracker module).
Can we confirm that this is the same on d.o ... someone with ssh access ... please?
Comment #34
dwwon d.o:
and:
Comment #35
killes@www.drop.org CreditAttribution: killes@www.drop.org commented@Fireangel: Your Union query takes about the same time as the original one :( It also only returned 6 results.
@kbahey: Exchanging temporary table usage for table locks strikes me as a bad idea.
Comment #36
pwolanin CreditAttribution: pwolanin commented@killes - Do you see any improvement with just the patch in #25?
Comment #37
kbahey CreditAttribution: kbahey commented@dww; thanks
@killes: true, locking is not good, but we don't know how much of that will happen vs. the filesort of 105,000 nodes every time an authenticated users hits tracker.
Also, my suggestion does not avoid the GREATEST problem anyway, which causes the filesort.
Seems the most promising solution (from a performance point of view) is to have a new derived column with GREATEST precalculated in it, as pwolanin said.
Comment #38
killes@www.drop.org CreditAttribution: killes@www.drop.org commented@pwolanin: Yes, #25 helps bit, but not that much. It brought the query down from 23 to 19 seconds.
Comment #39
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedSELECT DISTINCT(t.id) AS nid, t.title, t.type FROM ((SELECT c.nid AS id, n.title, n.type, c.timestamp as timestamp FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE c.uid = 83 AND n.status = 1 ORDER BY timestamp DESC) UNION (SELECT n2.nid AS id, n2.title AS title, n2.type, c.last_comment_timestamp as timestamp FROM node n2 INNER JOIN node_comment_statistics c ON n2.nid = c.nid WHERE n2.uid = 83 AND n2.status = 1 ORDER BY timestamp DESC) UNION (SELECT n3.nid AS id, n3.title AS title, n3.type, n3.changed as timestamp FROM node n3 WHERE n3.uid = 83 AND n3.status = 1 ORDER BY timestamp DESC)) AS t ORDER BY t.timestamp DESC LIMIT 0, 10;
I've been playing ith the above query which brought the execution time down to 16 seconds.
Comment #40
pwolanin CreditAttribution: pwolanin commented@killes- even that much improvement makes it seem worth committing #25 as the basis for further improvement?
Comment #41
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedI should mention that my query gets a slightly different set of results. Mainly the order seems to differ.
@pwolanin: It probably won't hurt.
Comment #42
Dries CreditAttribution: Dries commentedAs the node table and the node_comment_statistics table have a 1-on-1 mapping, we might as well merge both into one table.
Comment #43
merlinofchaos CreditAttribution: merlinofchaos commentedBefore we merge node_comment_statistics into node, we should check into why we even need to.
If there is a 1::1 mapping, node_comment_statistics should be sorting on its primary key. Why would it need a filesort?
Is the problem just that we're not indexing that table properly?
Comment #44
magico CreditAttribution: magico commentedDoes not need any DISTINCT. There are only INNER JOIN that should return only one row.
Removing DISTINCT should give some speed up.
Removed the DISTINCT; the most expensive query that does a LEFT JOIN with the "comments" was filtered before the final join.
Cannot test with thousands of records, but the query works.
Comment #45
magico CreditAttribution: magico commentedIt's 2 queries:
* the first should have the DISTINCT removed, no more optimization to do except merge the tables schema
* the second, should be applied like it is
Comment #46
moshe weitzman CreditAttribution: moshe weitzman commentedi already tried to merge node_comment_statistics into node but the initial benchmarks were not promising. folks are welcome to keep trying though. see http://drupal.org/node/75122
Comment #47
merlinofchaos CreditAttribution: merlinofchaos commentedRemoving DISTINCT won't help. node_access will just add it back.
Comment #48
kbahey CreditAttribution: kbahey commented@magico, the DISTINCT is unnecessary, I agree. But even if it is removed, I think that db_rewrite_sql() will add it back (a common problem).
@Dries, I suggested the same in #31 (merging back the fields in the node table). The issue is now that the node table will be updated with every comment post and delete, which *might* cause extra locking. This is the best solution from a schema/normalization point of view, but may create bottlenecks.
@merlinofchaos, The main problem is that we use GREATEST(col1, col2), which must do a filesort of the entire table since it does not know which is the greatest of the two per column, hence the performance problem.
Creating a new column called last_update and making it an index should solve this bottleneck, but that field is now a derived field that depends on other columns. If a module updates the others, but forgets to update the derived field, then we have data inconsistency.
None of the above solutions appeal to me so far.
Comment #49
magico CreditAttribution: magico commented@kbahey: don't forget the second query, that has a different JOIN.
Could you spend some minutes and test both queries, running them at the server (if you have access)?
Thanks
Comment #50
FiReaNGeL CreditAttribution: FiReaNGeL commented@kbahey: 'Creating a new column called last_update and making it an index should solve this bottleneck, but that field is now a derived field that depends on other columns. If a module updates the others, but forgets to update the derived field, then we have data inconsistency.'
I dont see this as a problem. Modules shouldnt directly update node.changed and comment.last_updated_timestamp directly.
@killes: Did you try it as-is? Cause I wrongly put status = 0 (instead of 1) while testing; this could explain why you only got 6 results :) The LIMITs in the subqueries should also go, as this strategy will screw up paging.
I'll rework the UNION query, as I think its key to optimize this beast. The reason is that we want a merge of two different things : recent posts by the user, and posts that the user recently commented in (correct me if im wrong). Im sure we can optimize those two parts separately and merge them, and that it would be faster. If we do both in one query, it cant use indexes effectively because of the "AND (n.uid = %d OR c.uid = %d)" (where clause in two different tables, cant optimize for a ORDER BY).
I also think that joins on users and node_comments_statistics can be LEFT JOINs, because there's a 1-1 mapping and they cant be null.
That being said, the db change (adding a new column) is still needed.
Also, removing the DISTINCT if its useless is needed. We should rework the DB API if its adding it back uselessly, as DISTINCT can be costly.
We should get in the DB change and optimize the query from there.
Comment #51
dwwto clarify, in english, what this query is trying to accomplish:
for obvious reasons, this "latest activity time" stuff is the killer in this query, which is why way back in the original post, i suggested we just store that timestamp ourselves, update it whenever either the node is edited or a comment is added, have an index on that field, and (probably) see all of our problems go away. i'm fine with renaming {node_comment_statistics} to just {node_statistics} and putting the column in there, to avoid LOCKs on the {node} table for every new comment...
yes, it's "derived" info, in the sense that it's always equal to either {node}.changed or {node_comment_statistics}.last_comment_timestamp. however, last_comment_timestamp is only set in exactly 2 places in all of core: 1) comment_nodeapi('insert') and 2) _comment_update_node_statistics(). {node}.changed is only touched via node_save(). so, there's only 3 places we'd have to change to support this field. no contribs should ever be updating these columns directly, anyway.
i don't want to spend the time on a patch if it's going to be vetoed outright by the D6 core maintainers, but i've always believed this is the simplest and most effective solution to this problem. should we at least try it and see? the patch should be relatively simple (~100 lines, including the schema upgrade), and we could run it on s.d.o for testing. but, if it's doomed from the start, i don't want to put effort into that approach.
Comment #52
moshe weitzman CreditAttribution: moshe weitzman commented@dww - i always thought that was the best solution too. go for it.
the "but it is another derived column" argument is not worth discussing anymore. this problem requires some compromise.
Comment #53
dwwhere's a patch for the new column:
- schema upgrade (coded for both mysql and pgsql).
- changes to update the column correctly in comment.module and node.module.
- changes the tracker query to use the new column instead of GREATEST().
i tested the upgrade path on mysq, but haven't had a chance to test pgsql yet. the schema update is probably more complicated than it should be, since it started out to rename the table to just {node_statistics}, but that seemed like a lot more of a diff than i wanted to mess with in the patch. so, we could probably simplify things in there, but it certainly works as is (and has the benefit of a single query to update all the records, even on mysql 4.0 sites like d.o). ;)
in my testing, this all seems to work great. i don't have a good environment for stress testing and benchmarking, so help with that would be most appreciated.
Comment #54
FiReaNGeL CreditAttribution: FiReaNGeL commentedThanks dww for this great initiative! I manually patched my 5.x install and everything is working #1; the upgrade path worked too.
Sadly I think I have bad news to report; it didnt change a thing on my dev box (performance-wise). For some reason, mysql refuse to use the new index to sort.
I'd be interested in seeing the results on scratch.d.o.
It would also be interesting to have others bench it.
The EXPLAIN changed a bit
Its strange, cause I thought as dww and others, that an additional column to get rid of the ORDER BY GREATEST would help things. The problem is that mysql cant use it.
From: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
So the problem is that were doing a WHERE on many tables, and mysql can only use one index at a time, so it cannot sort by our new index.
The following query is 3 times faster when compared to the original query (both pre and post dww's patch, with a proper ORDER BY on the new column and without the GREATESTs) and give the same results (on my dev box at least). Its better than my last one which couldnt be paged and gave funky results:
And for the record I found out that the DISTINCT was needed (for the cases where you commented in your own node). UNION queries do it by default, so no need to add it back.
Comment #55
Dries CreditAttribution: Dries commentedI think dww's approach makes sense. Good analysis.
I'd still like to see this combined with http://drupal.org/node/75122 -- maybe that will make the index work!
Comment #56
killes@www.drop.org CreditAttribution: killes@www.drop.org commented@fireangel: I don't get a performance improvement from your new query.
@dww: can you patch s.d.o?
@Dries: I am afraid #75122 will give us some additional table locks. However, my offer to benchmark it stills stands if somebody updates the patch.
Comment #57
dwwi patched s.d.o and ran the db update (took just over a minute to run, which is pretty good IMHO). sadly, neither http://scratch.drupal.org/tracker nor http://scratch.drupal.org/tracker/46549 seem any faster. :( stupid mysql.
Comment #58
FiReaNGeL CreditAttribution: FiReaNGeL commenteddww: yeah it seem consistent with what i saw on my dev box :(
killes: strange; maybe the improvements i see are mysql 5.1 specific, or the size of d.o. db change things. Still, i find it strange and disappointing.
Here's another idea; we could restrict "recent posts" to a month, with links to see the past 6 months, 1 year and since the dawn of (unix) time. At the end of the pager, we could add a link such as 'click here to see more' or something along those lines.
This would definitively be faster. And most of the time, I dont think people are interested in seeing all their post history all the time.
Comment #59
kbahey CreditAttribution: kbahey commentedStrange and sad.
I was hoping this would improve things.
:-(
Comment #60
pwolanin CreditAttribution: pwolanin commentedCan we combine dww's patch with #25 to avoid the join?
Perhaps also a compound index is needed (at least for MySQL)?
Comment #61
kbahey CreditAttribution: kbahey commentedI am trying to contact a MySQL expert for this (why there is no improvement with the latest patch).
Will let everyone know if I find something useful.
Comment #62
magico CreditAttribution: magico commentedCan someone instruct me on how can I generate thousand of nodes and comments in an heretogenic way, so I can test my queries above?
Thanks
Comment #63
dww@magico: install devel.module and devel_generate.module.
Comment #64
pwolanin CreditAttribution: pwolanin commented@dww - I think devel module needs a patch for generate- it was not properly populating the {node_comment_statistics} table for me. I'll assume that the current code is just a refactoring of the 5.x scripts, which saved nodes directly to the DB, rather than calling node_save().
Comment #65
David StraussHere is a patch versus Drupal 5.1. I chose to make it against Drupal 5.1 because Drupal.org desperately needs the update. It can easily be adapted to Drupal 6.
This patch splits the bad tracker queries into two queries: one to get recent nodes and another to get recent comments. The results are intelligently combined in PHP. With this change, the queries can now run in logarithmic time.
Comment #66
David StraussI've updated the patch to address a problem where you could only see results from nodes you posted.
Comment #67
David StraussIgnore the patches I've posted so far. I'm working on a few fixes.
Comment #68
David StraussHere's my latest patch, produced in cooperation with killes, who helped by testing on scratch.drupal.org.
This patch slightly changes tracker functionality. The code is optimized around showing a reasonably long history of recent comments and node updates. It doesn't show the entire history if it's really long. (If the history is short, then the functionality should be nearly identical.) This trade off should be alright because any site with enough nodes and comments that the limit affects the results probably has performance issues with the current module's implementation.
The following indexes (shown below in MySQL syntax) need to be added:
ALTER TABLE `comments` ADD INDEX `tracker_user` ( `uid` , `status` , `timestamp` );
ALTER TABLE `comments` ADD INDEX `tracker_global` ( `status` , `timestamp` );
ALTER TABLE `node` ADD INDEX `tracker_user` ( `uid` , `status` , `changed` );
ALTER TABLE `node` ADD INDEX `tracker_global` ( `status` , `changed` );
Comment #69
David Strausswebchick requested that $sanity_limit pull from a variable (even though there won't be an interface exposed).
Comment #70
Senpai CreditAttribution: Senpai commented@ #51 by dww
Why is it necessary for Tracker to display *all* posts and comments by a single user, and then try to sort that mountain of data by most recent activity? Wouldn't it be much more efficient to offer the user a chance to see nodes created by that uid OR comments posted by that uid? I mean no disrespect, but it appears very difficult from a user friendly standpoint to offer a complete list of everything that user has ever done, and lump it into page after page of links. Not to mention that nodes and comments are still two different entities, so why not treat them as such?
What if, since you're going back into Tracker to optimize, you look at splitting the query into two parts? Offer users a tab to see their own comments, or a tab to see their own posts. After all, in all my years of surfing sites, I've never needed to look up all my own comments while searching for content that I created, and I've never needed to see stories I've posted while I was looking for that thread I commented to.
You could cut down on the amount of data retrieved by almost 50% if you went this route. Now, having said that, I have no idea how much it would jack up the Tracker module to attempt this. It might be darn easy, but then again, it could require a complete refactor. I just don't know.
Comment #71
pwolanin CreditAttribution: pwolanin commentedI'm not really liking that you run individual queries to get this data:
Wouldn't it be more efficient to run through the array, collect the nid and uid values and get them in 2 queries?
Also, I don't like the naming of the array as
$results
. Almost everywhere in core$results
is used as the name of the variable holding the resource returned from a db_query(). How about calling it something like$tracker
or just$rows
?Comment #72
David Strauss@Senpai
It was one of the first things I asked killes about. Killes thought that it wouldn't be friendly to users. This whole problem becomes quite trivial if you don't combine comments with nodes. As you can see from my patch, I actually split nodes and comments into two separate queries to get the speed increase.
Comment #73
David StraussGlobal comments query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000;
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
| 1 | SIMPLE | c | ref | lid,tracker_global | tracker_global | 1 | const | 359 | Using where |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid,tracker_global | PRIMARY | 4 | fk_portal.c.nid | 1 | Using where |
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
Global nodes query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
| 1 | SIMPLE | n | ref | status,node_status_type,tracker_global | tracker_global | 4 | const | 222 | Using where |
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
1 row in set (0.00 sec)
User nodes query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
| 1 | SIMPLE | n | ref | status,uid,node_status_type,tracker_user,tracker_global | tracker_user | 8 | const,const | 77 | Using where |
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
1 row in set (0.00 sec)
User comments query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
| 1 | SIMPLE | c | ref | lid,tracker_user,tracker_global | tracker_user | 5 | const,const | 47 | Using where |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid,tracker_global | nid | 4 | fk_portal.c.nid | 1 | Using where |
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)
These are running on my company's portal, but the query plan shouldn't differ elsewhere.
Comment #74
David Strauss@pwolanin
"Wouldn't it be more efficient to run through the array, collect the nid and uid values and get them in 2 queries?"
It would only decrease chatting with the database. If you give MySQL a disjoint set of values to match (like a list of uids), it will basically re-run the query to satisfy each value. Plus, the collection, pre-processing, and post-processing would possibly cancel out the meager reduction in communication with the DB.
Comment #75
webchickOutput from devel module.. values are ms, # of times queries ran, query executed.
drupal,org, without the patch:
/tracker:
/tracker/24967:
scratch.drupal.org, with the patch:
/tracker:
/tracker/24967:
Comment #76
Dries CreditAttribution: Dries commentedMight be better to run both on the same database (i.e. run both on scratch.drupal.org) -- it might make a big difference.
That said, this seems like a clear winner. I'll try to look at it more closely tomorrow morning. In the mean time, feel free to work on this more. We might be able to squize more out of this.
Comment #77
webchickRight you are. Here are the numbers from scratch.drupal.org, sans patch, so we're comparing apples to apples:
tracker:
tracker/24967:
Comment #78
FiReaNGeL CreditAttribution: FiReaNGeL commentedGreat improvement! Same strategy I tried to apply intra mysql with my union query, but this is definitively better! Kudos, David :)
Comment #79
dwwi helped webchick get the numbers on s.d.o, and everyone agreed this looks like a vast improvement. feeling bold, i applied the patch on d.o, and the tracker pages clearly speak for themselves. if we want to tweak this and make it even better, great, but d.o is finally usable again, even on the old db hardware. in order to stop the bleeding, i decided to take bold action...
huge thanks to david strauss for this work!
Comment #80
webchickdrupal.org, post-patch:
tracker:
tracker/24967:
We owe David Strauss, and his company Four Kitchen Studios who paid to let him work on this all day ;), a great deal of gratitude. If possible, please credit both in the commit message.
Btw, it looks like we can tweak this further by getting rid of the username query in /tracker.
Comment #81
catchon my site. 15,000 nodes, 176,000 comments, 3,750 users
etch vps, 1gb ram, php5.2, mysql5, apache2.2, apc.
before patch:
/tracker
478.79 1 pager_query SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN users u ON n.uid = u.uid INNER JOIN node_comment_statistics l ON n.nid = l.nid WHERE n.status = 1 ORDER BY last_updated DESC LIMIT 0, 25
/tracker/11
1021.4 1 pager_query SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 11 OR c.uid = 11) ORDER BY last_updated DESC LIMIT 0, 25
739.03 1 pager_query SELECT COUNT(DISTINCT(n.nid)) FROM node n LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 11 OR c.uid = 11)
after patch
/tracker
415.74 1 tracker_page SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000
/tracker/11
338.51 1 tracker_page SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 11 ORDER BY last_updated DESC LIMIT 0, 1000
And a lot of page hits it's getting served up by mysql query cache (0.35 speeds), not sure if that's relevant but it gave me a shock when I saw it.
Comment #82
kbahey CreditAttribution: kbahey commentedExcellent work David!
Silly MySQL got fooled by discrete queries vs. one combined. No filesorts anywhere anymore.
I guess we can return the new database server for a refund now ... :-)
Comment #83
dww@catch are you sure you added the new indexes that the patch relies on? see comment #68 for details. the patch doesn't (yet) include a schema update to add these indexes via update.php so you have to do it manually.
Comment #84
David Strauss@catch
Your results look like you haven't run the index updates.
Comment #85
webchickOk, here is David's patch with the following changes:
1) Rolled against HEAD, rather than 5.x.
2) Minor re-formatting/re-wording of comments to fit coding standards
3) Upgrade path. I included this in a tracker.install file, because I wasn't sure if those indexes slowed down the site if you're not using tracker. I also am not sure if we should have tracker.install adding and removing indexes willy-nilly (on uninstall, etc.)... on scratch.drupal.org this locked the site for over a minute.
Remaining issues:
1) (I think) Global tracker should not be doing the username query.
2) There's a small bug.. if you go to http://drupal.org/tracker?page=55, the last entry is a an empty post by Anonymous posted 37 years 20 weeks ago.
I will need help to solve these, and also input from someone who knows databases well on where these index changes should go (system.install or tracker.install).
Comment #86
webchickSigh. How about a patch? :)
Comment #87
MichelleI noticed that my recent posts is now a bit odd... The status update post, for example, is showing that it was last updated 1 day 12 hours ago even though there are 7 new posts since I last read it just a few hours ago. I also seem to remember there being other more recent posts at the top, though I can't swear to that.
On the 37 week post in #85, that might not be new. Some time ago... Several months at least, maybe even a year or more ago, I got curious and looked at the end of the tracker and I remember their being some weirdness with impossibly old posts. I know that's vague and not terribly helpful, but I thought I should mention that it might be an existing bug from before this work was done.
Michelle
Comment #88
David StraussReplace:
With:
I inaccurately assumed that something would get added every loop, but that's not the case because of duplicates.
Comment #89
David StraussHey, I even messed up the comment:
Comment #90
David StraussI keep pasting the wrong thing:
This fixes the phantom post on the last tracker page.
Comment #91
mcurry CreditAttribution: mcurry commentedsubscribing
Comment #92
catch@dww/david strauss: oops.
Won't be able to run again until a bit later but will post back!
Comment #93
xamox CreditAttribution: xamox commentedThanks! the site seems 10x faster!
Comment #94
catchOK with patch, same spec as before, with the index update this time:
I love you all.
Comment #95
catcharrgh, that one was query cache fooling me. Still about
21.47
which is 20-50 times as fast on 15k nodes, 170kcomments, 3.7k users. Amazing work.
Comment #96
MichelleOk, I just found another post that is showing up properly as updated in all recent posts and not in my tracker. http://drupal.org/node/112421 is the post. And http://drupal.org/node/144228 is still doing it as well as I reported a few comments back.
Michelle
Comment #97
David Strauss@Michelle
Node 144228 is the 8th item in your tracker. Node 112421 is on the second page of your tracker.
Comment #98
webernet CreditAttribution: webernet commentedThere also seems to be a small bug - the last entry in each tracker is strange.
For example:
Comment #99
David Strauss@webernet This issue is corrected with the change in #90.
@Michelle I should clarify that there is a change in the method the tracker uses to determine how recent something is. The old tracker used the latest comment timestamp even if it wasn't your comment. My tracker patch uses your latest comment timestamp on a node. I think this is an improvement. Why should a node you commented on last year appear on the first page of your tracker just because someone else commented on it today?
Comment #100
pwolanin CreditAttribution: pwolanin commented@David - I think ordering should be based on when any comment is made - isn't that the point of the tracker? How else can I see when my comment provoked another comment, or someone else had related information. I think there was a suggestion above for separate tabs with "my recent comments" and "my recent posts", but that's separate.
Comment #101
David Strauss@pwolanin You can easily see when your (recent) comments have created discussion. Just go through the first few pages of your tracker and look for "X new" next to the items. Moreover, if the goal is to see how your comments have created discussion, then we should create a better, thread-based model for tracking.
Comment #102
pwolanin CreditAttribution: pwolanin commented@David - I disagree. I know already when I made a comment- the purpose (to me at least) of the tracker is to know when others comment on the same threads.
In any case, this patch should not change the function of the tracker- at least not without more serious discussion.
Comment #103
David Strauss@pwolanin Always keep in mind that the tracker isn't just for your use. I might want to see nodes you've recently commented on. If you'd like to see discussions you've sparked, I'd prefer to add a tracker tab for that purpose alone.
Comment #104
FiReaNGeL CreditAttribution: FiReaNGeL commentedI agree with pwolanin here, to me the tracker is useful to check for replies to comments I made in the past. As I understand it, this functionality is gone since your (quite excellent) patch. Tradeoff to be made maybe, but i`d like to keep that functionality :)
Comment #105
dwwi agree w/ pwolanin, too. the underlying behavior should stay the same: any edit to the node or comment (regardless of who made it) counts as "activity", and the tracker should be sorted by most recent activity, period.
also, sorry i haven't gotten around to re-patching d.o based on the bug fixes posted here. however, i'd like to just wait for a new patch (a real patch, please) that a) includes the bug fixes and b) corrects the tracker behavior to return it to its original behavior...
thanks,
-derek
Comment #106
MichelleJust getting back to my computer now. Ok, so the weirdness I've been experiencing is intentional by the sound of it, but I don't agree. I use my tracker all day long to see if any threads I've commented on have been updated. Having to go through every single page of my tracker looking for updated markings every time would suck big time. They really need to float to the top just like they do in the all posts tracker. To answer your question, yes, if someone comments on a thread I participated in a year ago, I want to see it.
Michelle
Comment #107
David Strauss@dww I'm not sure we can return tracker behavior to the original behavior with acceptable performance. Currently, my tracker code JOINs comments with nodes to produce its results. Listing the tracker items by the latest comment on the node requires another JOIN from nodes to the node_comment_statistics table. Such a JOIN may push the query back into temp_table/filesort territory. I initially tried to optimize the global feed to use the node_comment_statistics table instead of comments, but changing the JOIN sent the query back into using temp_table/filesort because of the condition requiring nodes to be published. MySQL seems to want to apply WHERE restrictions first, even if they're on the node table and very inclusive. Returning tracker to the original behavior may involve larger schema changes than adding a few indexes.
Comment #108
catchanother -1 on functionality change. If people have to go back through ten pages to find replies to their posts, then we lose the benefit of the query optimisation from all those extra page requests.
Comment #109
catchAlso, "last updated" now displays the last time you posted, not when the node was last updated, which isn't what it says. This is obviously a side-effect of the change in behaviour, but in terms of UI it's a bug since that column isn't there to tell you when you commented.
Comment #110
kbahey CreditAttribution: kbahey commentedDavid
How about pushing all the fields (or the two we care about) in node_comment_statistics columns back in the node table? This will avoid the join you talk about, and it is a 1:1 relationship anyway.
Moshe tried a bit with this. See above for the issue on that.
Comment #111
Dries CreditAttribution: Dries commentedI also use the tracker to look for follow-ups on my posts/comments.
Comment #112
webchickIMO, splitting these into two tabs would not be the end of the world. It would make it easier for me to find handbook pages I wrote vs. random forum posts where I helped someone (I care a lot more about making sure I'm keeping up with the comments on the former :)). And people who want just a single tab with the old behaviour on their sites can just use Views, which comes with a 'tracker view for this purpose.
But I didn't quite understand the most recent comments... does that mean we'd actually need *four* tabs? "recent posts", "recent comments", "my recent posts", "my recent comments" ? If so that would be... icky. ;)
Comment #113
David Strauss@kbahey This would be a good idea. It's the {node_comment_statistics} and {node} JOIN that kills the query. What MySQL does is see the published = 1 restriction on {node} and no restriction on {node_comment_statistics}, so it start
@catch We only lose the benefit if they go through about 300 pages. :-) Fixing last updated is easy: we only need to load those values for the current page.
@webchick Along those lines, we could split it into "My posts" and "My conversations." The latter would list nodes where you've commented in the decending order of latest comment by anyone.
Comment #114
David StraussI'm going to play around with joins to {node_comment_statistics} to see if there's anything that won't require a temp_table.
Comment #115
David StraussAfter some research and experimentation, I've concluded that placing the fields from {node_comment_statistics} in {node} will not help. The problem is that, for the user tracker, we put conditions on {comments} but sort on a different table. This makes sorting using a key impossible, forcing a filesort.
Comment #116
David StraussOkay, here's what needs to happen to the database schema to maintain tracker performance while returning it to the original functionality. The last_comment_timestamp column needs to appear in {comments}. (Yes, it's a violation of normalization, but it's the only way to make this work.)
This needs to run every time a comment is posted:
UPDATE {comments} SET last_comment_timestamp = %d WHERE nid = %d (with the obvious values for timestamp and nid)
Comment #117
David StraussUpdates for system.install:
ALTER TABLE {comments} ADD last_comment_timestamp INT(11) NOT NULL;
ALTER TABLE {comments} DROP INDEX tracker_user, ADD INDEX tracker_user (uid, status, last_comment_timestamp);
This patch may need some revisions, but I'd rather post early so people can have a look.
Comment #118
David StraussOne more update for system.install (in addition to those in #117):
ALTER TABLE {comments} DROP INDEX tracker_global, ADD INDEX tracker_global (status, last_comment_timestamp);
Comment #119
catchhappy to test changes, but only have 5.1 site to test on. Is it cheeky to ask for a 5.1 version of that patch?
Comment #120
David StraussComplete patch for 5.1, including schema updates.
Comment #121
David StraussThis is a reordering of the updates in the 5.1 patch to create the index only after the column gets populated.
Comment #122
David StraussComment #123
catch@David Strauss, thanks for 5.1 patch
I've applied that. database update went fine, patch applied ok. attached new patch which just fixes typo db_queryd > db_query and removes some folder information to make it apply easier.
Something's going on with the main tracker query though:
on comments I have indexes: tracker_user (uid, stats, last_comment_timestamp) and tracker_global (status, last comment timestamp)
last_comment_timestamp seems to have been updated ok in comments table as well. So not sure why it's so much slower than the tracker/1
Comment #124
catchargg, last patch added my own typo in. This one ought to be ok. Going to try to see whether my 2,000-4,000ms /tracker query is a result of the new query or my own incompetence applying patches ;)
Comment #125
catchalright, for some reason $sanity_limit isn't being called when the query gets executed:
I hard coded LIMIT 0, 1000 into $comments_sql
which is fine. can confirm that "my recent posts" now functions as it used to as well.
Can't work out why $sanity_limit isn't running though, it seems to be the same as before.
Comment #126
David StraussYou replaced db_queryd (which was admittedly a typo) with db_query. It should have been replaced with db_query_range. Here's an updated patch.
Comment #127
David StraussThe LIMIT patch I posted should fix the problem. Because placing a LIMIT on the query speeds it up so much, the 2 to 4 seconds spent on the non-limited query must be due to the sheer quantity of results. Every comment is in that result set when there's no LIMIT.
Comment #128
catchWe have c. 190k comments, I think that's why the LIMIT made such a difference.
Changed to db_query_range (manually, but patch looks fine), took out hard coding:
RTBC?
Comment #129
David StraussRTBC?
Comment #130
catch"ready to be committed" - i.e. it works great, no bugs left that I can see, so just time to commit to core. I'd imagine core comitters will want another review or two before it goes in, so I didn't mark it as such.
Comment #131
David StraussConsidering the necessity of having this patch on Drupal.org and the fact that it's already made for 5.1, I'd like to roll this patch into 5.2 and 6.x-dev.
Comment #132
David Strauss@catch The patch still needs PostgreSQL schema updates.
Comment #133
catchI think the way it's done now is committed to 6.x then backported to 5.x-dev (which'd mean it'd get into the 5.2 release when that comes out assuming a security fix doesn't beat your patch going in).
BTW, I know this was for drupal.org, but it's made a big difference to the load on my own site as well, so thanks again!
Comment #134
kbahey CreditAttribution: kbahey commentedI have serious objections to this.
So far the denormalizations we have been doing was of the benign type, such as having a 1:1 relationship broken into two tables to minimize the locks, ...etc.
This one is far less benign, because of the sheer number of rows involved. We have far more data redundancy.
Before I say more, let us see the impact of this via benchmarks though. If it doesn't do much good, then it is dead in the womb. If it does improve things we can weigh in the trade off of redundancy vs. performance.
Comment #135
David Strauss@kbahey Object all you want, but maintaining the original tracker interface at a good level of performance requires it. The only way to avoid temporary tables and filesorts is to have the WHERE and SORT BY criteria in the same table. (If they're not in the same table, an index cannot span them.) Because the WHERE criteria are in the comments table, the SORT BY criteria must exist there, too. Take a close look at the tracker query for comments, and you'll see what I mean.
The UPDATE query I've written shouldn't have too bad of an effect. It scales linearly with the number of comments on a node. It only takes logarithmic time to identify the comments associated with a node because of the nid index on comments. The UPDATE also won't affect row sizes because we're only updating an integer. Keeping the rows the same size goes a long way to keeping performance up.
{node_comment_statistics}is a bad example of denormalization. I'm working toward removing it from Drupal entirely. A query has to JOIIN it with {node} to make it useful. The extra JOIN reverses many of the performance benefits it was designed to provide.
Comment #136
Chris Johnson CreditAttribution: Chris Johnson commentedI tend to agree with David here. Denormalizing by duplicating data to improve performance is a classic case of when denormalization is justified.
Comment #137
David StraussOne other place I could denormalize instead of {comments} is {history}. By putting an interacted (boolean) and node_last_updated (timestamp) into there, it would be easy to query for nodes that the user had interacted with but that had been updated since the last visit. It would be slightly less duplication of data than in {comments}. It would also allow us to have only one query for tracker because node_last_updated would be updated with a node change or a comment.
Comment #138
David StraussOne other place I could denormalize instead of {comments} is {history}. By putting an interacted (boolean) and node_last_updated (timestamp) into there, it would be easy to query for nodes that the user had interacted with but that had been updated since the last visit. It would be slightly less duplication of data than in {comments}. It would also allow us to have only one query for tracker because node_last_updated would be updated with a node change or a comment.
Comment #139
David StraussI'm sending this back to "code needs work" so I can move the denormalization to {history}.
Comment #140
macm CreditAttribution: macm commentedI have tabs created by Jquery in tpl pages.
I need print tracker_page, "Recent posts of a user" or "My recent posts" into a tab. How can I do this?
I tried a lot of things but always fail.
So my tab is in /mylanguage/user/1#mytracker
There is a easy way?
Could I get and render the path 'tracker/' .arg(1) --- into a tab(/mylanguage/user/1#mytracker)?
Folks sorry if here isnt appropriate place but I cant find best and this node was in tracker.module.
Fell free to delete with Drupal Webmaster.
I tried adapt this function in my page, but is very expensive to do this with my level of knowledge of Drupal's core.
Anyway thanks cores you are amazing.
Regards
Comment #141
dww@macm: Please don't hijack issues. This issue is about fixing the performance of the tracker page. Yours is a completely separate support request. So, file your request separately and leave this issue to address the underlying performance problems.
Comment #142
dwwThis is David's issue now, not mine, but I can't assign it directly to him directly using the UI (yet), so I'm just unassigning myself...
Comment #143
David StraussThis patch is superseded by http://drupal.org/node/148849.
Comment #144
kbahey CreditAttribution: kbahey commentedSo, I consulted the folk from MySQLPerformanceBlog and they generously investigated the issue for us free of charge, because we are an Open Source project (thanks Peter, Vadim and Maciej
They gave a detailed analysis of the queries, and their end advice was to rewrite them using unions.
Attached is a patch to tracker.module for 5.2, for testing on drupal.org.
Here are the results of testing on my development server, with restarting MySQL in between the tests to nullify the effect of the query cache. The tests were done using 10,000 nodes and 10,000 comments.
However, I have been getting a lot of variance on this machine lately, so the results need to be replicated somewhere, preferrably directly on d.o (what do we have to lose?)
The schema is the standard Drupal 5.2 schema, so if you have any changes to that, roll them back first.
Before the patch
This is the performance of this query without MySQL cache query, and using the existing 5.2 code.
With patch
Two variants are tested for this patch.
Variant 1: Using the original count query
This variant rewrites the query using a UNION with an ORDER BY, but leaves the $sql_count as it is. This proved faster on the test data set described above, which is different from d.o (no users, and less number of rows). This proved faster on my test server, but that may not be the case for d.o due to volume and cardinality.
Variant 2: Using a union in the count query
This uses a UNION for both queries. That was slower on my test server, but may be faster on d.o.
Comment #145
David StraussWhile those changes win points for less invasiveness, they will never approach the performance of the rewrites I've suggested (and largely finished in my NCS/node merge patch). It seems better than what the base Drupal code has now, but I guarantee it will be far slower than the current patched tracker on Drupal.org.
Comment #146
kbahey CreditAttribution: kbahey commentedDavid
I recommended to Dries that we disable tracker altogether if it is the only source of contention in the database. At least issue submission, browsing and the rest of the site will be fast, rather than the continued pain of what d.o is today.
Barring that, this patch is not a replacement for whatever schema changes you are doing which are needed to improve d.o overall. But I see those are long term.
This is (at least) a stop gap until we get d.o on D6 or D7 (whichever one has the schema changes). If it proves to be worth keeping, then so be it.
Worth testing at least to see if they are more functional that what is present at the moment on d.o (does not give people what they are looking for).
Here is the analysis done on a copy if drupal.org by Maciej.
** 1. **
SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
GREATEST(n.changed, l.last_comment_timestamp) AS last_updated,
l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid =
l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid =
c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid =
1 OR c.uid = 1) ORDER BY last_updated DESC
********
EXPLAIN
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: ALL
possible_keys: PRIMARY,node_status_type,nid,uid
key: NULL
key_len: NULL
ref: NULL
rows: 140437
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: l
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: drupal_export.n.nid
rows: 1
Extra:
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: u
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: drupal_export.n.uid
rows: 1
Extra: Using where
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: lid
key: lid
key_len: 4
ref: drupal_export.n.nid
rows: 4
Extra: Using where; Distinct
********
As you see the main problem of this query is that it does full table scan on
`node`. This is because you tied together conditions from two different
tables in <>, also the index that
contains `status` cannot be used because almost all rows has this column
value equal 1. This is why the engine cannot filter out any rows from `node`
before doing join. What you can do is to split the two OR-ed conditions into
UNIONs of two SELECTs.
(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
GREATEST(n.changed, l.last_comment_timestamp) AS last_updated,
l.comment_count FROM users u JOIN node n ON n.uid = u.uid JOIN
node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND n.uid = 1)
UNION
(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
GREATEST(n.changed, l.last_comment_timestamp) AS last_updated,
l.comment_count FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON
n.nid = c.nid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status
= 1 AND c.status = 0 AND c.uid = 1)
ORDER BY last_updated DESC;
This query reads all the data efficiently through indexes and then sorts the
final result set with filesort. If the returned data can fit into the
in-memory temporary table, which size is limited by tmp_table_size and
max_heap_table_size MySQL parameters, then this query should run decently
fast. On my low-end test server the original and rewritten queries took a
few minutes and 0.45s respectively (rows examined: 644642 vs 15229).
Although 15k rows is still a bit large number for a busy on-line
application, but I guess this cannot be avoided without some changes in the
schema.
** 2. **
SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name FROM node
n INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid
AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 9575
OR c.uid = 9575) ORDER BY n.changed DESC
********
EXPLAIN
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: ALL
possible_keys: node_status_type,uid
key: NULL
key_len: NULL
ref: NULL
rows: 140437
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: u
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: drupal_export.n.uid
rows: 1
Extra: Using where
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: lid
key: lid
key_len: 4
ref: drupal_export.n.nid
rows: 4
Extra: Using where; Distinct
********
Again the same problem. So I rewrite the query to:
(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name FROM
users u JOIN node n ON n.uid = u.uid WHERE n.status = 1 AND u.uid = 1) UNION
(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name FROM
users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid WHERE
c.status = 0 AND n.status = 1 AND u.uid = 1) ORDER BY changed DESC;
Statistics are:
Time: 0.58s vs 1 minute
Examined rows: 11522 vs 503170
Comment #147
Gerhard Killesreiter CreditAttribution: Gerhard Killesreiter commented@David: Your rewrite will be great for Drupl 6, but we are still in D5 land and I need a solution "now". Your patch that is running on d.o isn't really helping since the results it provides aren't that great. As soon as we've resolved our performance issues I'll want to revert the patch and have somethign that provides the original result set.
Comment #148
catch@Killes: is drupal.org running the patch from #126? iirc it's running an earlier one, and #126 provides a better result set.
Both the UNION and the long term schema changes are obviously better long term, but it'd help until they're sorted.
Comment #149
dwwd.o is still running the patch from #69. Seems like we should use #126, but that includes the schema change for last_comment_timestamp in {comments} (not just some new indexes). So, I'd need to coordinate with killes and make sure he's happy with this, do a DB dump, etc, etc. David: anything else we should be aware of when moving from #69 to #126? I don't have time right now to carefully study every detail of both patches. Thanks.
Comment #150
cog.rusty CreditAttribution: cog.rusty commentedI am getting my user name as the author of every node I have commented on.
When hovering, the uid under the name is the correct node author.
Could this patch be responsible? I don't understand much but here's a theory:
.... UNION (SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1)....
n.uid -- found from the node table
u.name -- found not via the node table but via the comments table with "FROM users u JOIN comments c ON c.uid = u.uid"
Previously it was found using "n.nid = c.nid", while "c.uid" was only in the WHERE clause, not in the JOINs.
Comment #151
cog.rusty CreditAttribution: cog.rusty commentedThe above was about the /user/[uid]/track page.
The /tracker/user page displays "anonymous" as the author and has more problems.
Comment #152
kbahey CreditAttribution: kbahey commentedThis version contains a fix for the wrong user name displayed.
Gerhard, please install on d.o.
Comment #153
dwwReviewed #152, backed out #144 from d.o and applied #152.
Confirmed that the user names are now correct in the tracker page.
Comment #154
Ron Williams CreditAttribution: Ron Williams commentedOn user tracker page, when clicking the last button, it appears to take users to page "59" unless they have greater than 59 pages of postings.
Comment #155
Dries CreditAttribution: Dries commentedIt would be great if we could test the performance improvements of this patch in a better way. Applying it on drupal.org is one thing, but it doesn't tell us whether it is actually faster. We've made a dozen other configuration changes to Drupal.org the past couple of days. A more structured approach to testing this patch -- once the patch is in its final state -- would help me convince to commit it once it's ready.
This should be fairly easy to do: just run the queries 20 times from the mysql command line and average the result.
Comment #156
dww@Dries: I agree. I just want to point out that we're only applying these patches on d.o to stop the bleeding. And, there have been some mysql numbers posted above in the issue (see #75, #77, #80, #144, #146). Killes decided to move d.o to #144 -- I only applied #152 because #144 was buggy and needed a fix.
Given the explain results on #75-#80 vs. #144 and #146, I don't see why Killes switched d.o. Yes, the tracker results weren't ideal for the earlier patch, but the performance was orders of magnitude better. #152 seems like a minor improvement over what's in core already, which requires no schema changes. #126 would be a significant improvement, but requires a schema change. Killes doesn't want to make that change on d.o unless we're sure that schema change will be going into core for D6...
Comment #157
cog.rusty CreditAttribution: cog.rusty commentedI am getting 40 pages of nodes and 21 pages of "No posts available" in my tracker. $sql_count does not seem to agree with the rows returned.
Comment #158
kbahey CreditAttribution: kbahey commentedHere is another version that tries to correct the pager issue.
Note that the code has not changed for /tracker. The changes are effective only when you are using /tracker/UID.
Comment #159
kbahey CreditAttribution: kbahey commentedAttached.
Comment #160
catchSo the API changes aren't going to get int 6.x now, but what about the UNION query?
Comment #161
David Strauss@catch I support using the UNION query in lieu of the more fundamental changes that are going to have to wait until Drupal 7.
Comment #162
catchNo longer applies - split patch must have gone in since this was rolled.
Comment #163
catchRe-roll of kbahey's patch from #159 against HEAD. Didn't change anything I hope except some minor punctuation in code comments.
Comment #164
catchVery quick review.
Patch applies cleanly, tracker works as expected in both /tracker and user/1/track situations - so no real change in functionality.
Still needs some proper benchmarking for the actual performance implications which I'm not really set up to do. Marking as needs review.
Comment #165
dmitrig01 CreditAttribution: dmitrig01 commentedI benchmarked it with lotsa nodes (forgot how many)... here are the results
Without
With
Not much faster but there is at least some gained speed
Comment #166
joshk CreditAttribution: joshk commentedThis applies cleanly and causes no errors, but my benchmarking tests results aren't quite as expected.
I used devel_generate to add about 2000 nodes, and ran an ab series also with a higher concurrency to see if that brought out a bigger delta on the times. Unfortunately, I got the opposite of the expected result:
Before
After
The same pattern repeated itself, although wasn't as pronounced, at a -c10 level. Could be my system, but...
Comment #167
catchOK try this baby out. Modified to use UNION ALL instead of plain UNION. UNION ALL is supported by pgsql so shouldn't be an issue why
MySQL performance blog explains why - could be six times faster.
No change in functionality at my end, but again needs benchmarking which I'm not set up for.
Comment #168
Gerhard Killesreiter CreditAttribution: Gerhard Killesreiter commentedThis patch has an error in the count query. In the second part you need to exclude the nodes that the user authored, otherwise you'll count them twice.
Comment #169
catchWell I'm just re-rolling and this is hardly my strong point, but here's a stab at it:
Changed:
SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM {users} u JOIN {comments} c ON c.uid = u.uid JOIN {node} n ON n.nid = c.nid WHERE c.status = %d AND n.status = 1 AND u.uid = %d
to this?
SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM {users} u JOIN {comments} c ON c.uid = u.uid JOIN {node} n ON n.nid = c.nid WHERE c.status = %d AND n.status = 1 AND u.uid = %d AND n.uid <> c.uid
Patch just in case I'm not barking up the wrong tree.
Comment #170
cosmicdreams CreditAttribution: cosmicdreams commentedFrom my reading of that altered query, it appears you are excluding the nodes that the author commented on (as well as the comments from that node that were authored by that user)
if you wanted to exclude the records from nodes that the user has authored you should have this in you're where clause:
Comment #171
catchOK re-roll with that change:
Comment #172
Zothos CreditAttribution: Zothos commentedI would prefere using the != operator. I know they are the same, but != has a greater readability :P
Comment #173
catchZothos, I dimly remember a recent development list where it was decided to standardise on <>, will leave like that for now. This could really do with benchmarks on the UNION ALL change.
Comment #174
catchno longer applies :(
Comment #175
Gábor HojtsyThere are tabs issues too in the patch.
Also moving to the bugs queue, where D6 blockers are waiting in line.
Comment #176
catchOK I tried to re-roll this again, but because the if $uid has been taken out of tracker.pages.inc somewhere along the line you get undefined variable on the tracker/1 page.
So this needs some love to set the $uid somewhere useful, and it still needs review and benchmarking after that. I don't think I'll have time to do that today so it'd be great if someone could take it on.
Comment #177
catchhere's a re-roll, it works.
Still needs review/benchmarks.
Comment #178
catchOK well my re-roll was no good, but I fixed that in my testing install and did some very primitive profiling with devel module. Overall the main /tracker was slower by about 30-40%, and although /tracker/4 was quicker, it gave slightly different results to the current core query (about 10 nodes more in a 12 page pager).
So I'm going to bump this to D7 since there doesn't appear to be a satisfactory solution without the major schema changes (node activity counter etc.) proposed earlier in this issue. I've attached the devel results for reference.
Comment #179
David StraussImplemented in Contrib for Drupal 5: http://drupal.org/project/tracker2
Drupal 6 and 7 should only require simplification from the Drupal 5 version.
Comment #180
chipk CreditAttribution: chipk commentedFWIW - our use of tracker only needs visibility to nodes, not comments. As an interim fix, we modified 'tracker_page()' (i.e. the SQL in both $sql and $sql_count as well as the call to 'page_query()') to eliminate the join and logic related to table 'node_comment_statistics'. On a deployment of 160k+ nodes, one typical benchmark query shrank from 2680ms to 15ms. Code changes as follows:
Comment #181
David Strauss@chipk: Your implementation may work for your situation, but it undermines a large portion of what Tracker is for, which is tracking comments.
Comment #182
apotek CreditAttribution: apotek commentedUsing UNION, we get 2 temporary tables created and have to run filesort twice. Not sure how this would improve things. It takes a heck of a lot longer.
Comment #183
MichellePutting the title back. "Using UNION" is not a good title for an issue.
Michelle
Comment #184
catchLatest approach to this is http://drupal.org/project/tracker2 - this needs a port to Drupal 6, then a port to Drupal 7, then conversion into a patch.
Comment #185
Gerhard Killesreiter CreditAttribution: Gerhard Killesreiter commentedConsidering everything, the tracker query has been proven to not really be salvagable. This issue should thus be marked "won't fix".
Porting tracker2 to core might be an option. I have discussed this with Angie and she did say that this is not a priority for her. I think it is fair to say that you shouldn't use the core tracker for a large site. We might put a warning into the docs and maybe include a link to tracker2.
Comment #186
catchWe should either have tracker2 in core, or remove tracker from core and rename tracker2 to tracker in contrib. Having core modules limping along with site-killing queries release after release doesn't help anyone.
Comment #187
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedIs the query really site killing for your average Drupal site? I doubt it.
Comment #188
dwwI don't understand why we don't just put tracker2 in core. Even if it's not "a priority" for webchick doesn't mean it can't happen. A brief skim of the code looks like there's nothing that fundamentally evil about tracker2, and if anything, it'd be better if it were in core, instead of modifying core's DB schema from the outside. ;) Maybe I didn't fully grasp something about how tracker2 works, but it seems like a case where a little DB de-normalization goes a long way. Why not put it in core?
Comment #189
Damien Tournoud CreditAttribution: Damien Tournoud commented@dww: I think everyone agree that the best solution is to bring tracker2 in core. The only remaining question is: who will take the lead on that?
Comment #190
jrabeemer CreditAttribution: jrabeemer commented+1 for tracker2 in core, i was completely unaware of the slow performance of tracker module in core..
sub
Comment #191
kbahey CreditAttribution: kbahey commentedI can work on porting it to Drupal 7, but only if Angie or Dries say that it would go in 7.x, and not just wasted effort.
Comment #192
jrabeemer CreditAttribution: jrabeemer commentedNo work is wasted effort. Worst case, it gets delayed to D8. If you can demonstrate good performance gains, it will most likely go in core.
Comment #193
battochir CreditAttribution: battochir commentedsubscribing
Comment #194
swentel CreditAttribution: swentel commentedOk, attached is a patch which ports tracker2 to D7. I added it with the .txt extension because I don't want the testbot to run.
Weird thing is, I did some benchmarks with 50000 nodes and noticed no real performance difference between tracker 1 & 2 at this point. So basically, try this out, see if it makes any difference and we can discuss further if this is really so critical to fix.
Comment #195
battochir CreditAttribution: battochir commentedQuick question. Is there a working port of tracker2 to D6 anywhere? It's awesome that tracker2 is getting so much love for D7. Somehow, I thought D6 was important. From what I see, many people have not upgraded to D6 from D5 because many modules have not been updated yet or they are waiting for D7 because the update to D6 is so painful. D6 is a big leap from D5. Anyway, D6 tracker still has those expensive queries that hurt big-traffic sites.
edit: Sorry. I just saw that this thread is for D7. I did'nt mean to hijack the thread.
Comment #196
Michellebattochir - You want this issue: #294091: Port tracker2 to 6.x
Michelle
Comment #197
battochir CreditAttribution: battochir commentedThank you Michelle. That's what I wanted to know much appreciated.
cheers,
wim
Comment #198
catchSeems like tracker tests actually pass with this patch. Moving to needs review for a general look at it (re-rolled for some whitespace fixes, needs some comment/variable name tidying and dbtng conversion still).
Comment #199
dwwVery quick skim revealed this:
tracker_track_user() is wrong. hook_menu() is already auto-loading the user object, but then tracker_track_user() tries to reload the user object again using arg(). tracker_track_user() isn't properly inspecting either of the arguments passed into it from the menu system, in fact.
Comment #200
dwwOh, and those should be
return drupal_access_denied();
, etc...Comment #201
swentel CreditAttribution: swentel commentedStill need work, but gets rid of tracker_track_user function, we don't need that at all. One thing that doesn't work yet is the 'my recent posts' tab. I simply can't get it to work, unless I change the user_uid_optional_load function. Not sure if I'm doing anything wrong, or if this is a bug in the load function, I'll try to talk to peter or chx tomorrow. Note, this also doesn't work in the current version of tracker - and also D6 suffers from it ..
Edit: that bug is known, so waiting for #298722 to get in
Comment #202
catchMaybe related to #298722: _menu_translate returns FALSE before to_arg is available ?
Comment #203
David StraussI'll take a look at the "my recent posts" problem.
Comment #204
apotek CreditAttribution: apotek commented@Michelle (#183)
Yes, sorry about that. I'm still not used to the way these issue trackers work. I keep thinking the "Issue title" is the subject of my post, rather than a way to modify the title of the *entire thread*. It's an odd implementation when any logged in user can do that. My mistake, and certainly not intentional.
Comment #205
catchHere's a re-roll. The bug swentel found is already well documented in HEAD, so there's no reason to slow this up.
It's possible some of this might be centralised in materialized views if/when it gets committed - but might as well keep this one live - we can't ship yet another release with this query.
Comment #207
catchhmm
Comment #208
David StraussThis seems to look OK, but I can't RTBC what is mostly my own code.
Comment #209
catchPatch was missing tracker.pages.inc changes. Also this all needs converting to dbtng. Here's benchmarks:
ab -c1 -n500
HEAD:
/tracker: 6.15 [#/sec]
/uid/track: 1.16 [#/sec]
Patch:
/tracker: 12.75 [#/sec]
/uid/track: 12.86 [#/sec]
Comment #211
catchOK this one updates 95% of the patch to dbtng. Gabor has indicated he might want to have a backport of this in Drupal 6, so let's get this one in.
A few todos and things needing discussion (and leaving at CNR for the test bot if nothing else).
I don't think we need a UI for the indexing variable - or if we do, it should be in admin/settings/performance instead of a tiny form by itself at admin/settings/tracker
There's a few places where it looks like we could replace some update, select, insert logic with db_merge.
A couple of places need to use db_select() because they're wrapped in db_rewrite_sql(), didn't get to those yet.
And also there's one INSERT INTO .. SELECT FROM which afaik isn't current handled by db_insert() at all - so while that shouldn't hold this patch up, it'd be good to open another issue so we can do that.
Comment #214
David StraussI don't see a reason to "backport" this to Drupal 6. It should probably remain the separate Tracker 2 module.
Comment #215
catchI didn't think it was on the table - but #362151: Test and compare tracker2, views, Drupal 6 tracker suggests otherwise. If we can do it with no UI or API changes, then there's no particular reason not to simply cp contributions/modules/tracker2 modules/tracker - then we no longer have a pointless slow tracker module in Drupal 6 core.
Would need to co-ordinate the install/upgrade path of course for people upgrading from D6 - D7.
Comment #220
lilou CreditAttribution: lilou commentedAdd tag.
Comment #221
BerdirOk, I worked a bit on that one..
- Execute it with db_query and simply replace %d and so on with named parameters
- Do a select and then use multi-insert to insert the values... (That is what I did in the patch)
- Implement support for something like db_insert('tracker_node')->from($query)
There are still some things to improve..
- tracker_batch_node_alter: atleast the $op = 'delete' handling is imho unecessary, that will try to delete the node twice, or not?
- _tracker_add and _tracker_remove queries the node table even when not necessary.. (no need to fetch the information we already have in hook_node_*)
- More tests, for cron stuff, comment handling and so on...
Comment #222
killes@www.drop.org CreditAttribution: killes@www.drop.org commented"# DBTNG has currently no direct support for "INSERT INTO () SELECT ... ". Imho, there are 3 options
- Execute it with db_query and simply replace %d and so on with named parameters
- Do a select and then use multi-insert to insert the values... (That is what I did in the patch)
- Implement support for something like db_insert('tracker_node')->from($query) "
IMO DBTNG should support INSERT INTO () SELECT, otherwise I'd consider it incomplete.
Failing this, I'd use "Execute it with db_query and simply replace %d and so on with named parameters".
The nice thing about "INSEERT INSO SELECT" is that everythign is done inside mysql and php does not need to bother about this.
Comment #223
catchLooks like we'll get INSERT INTO () SELECT via #481288: Add support for INSERT INTO ... SELECT FROM ...
Comment #224
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedYes, great, we should wait until that has landed and then rework this patch to use it.
Comment #225
BerdirUpdated the patch, note that it will fail without the patch in #481288: Add support for INSERT INTO ... SELECT FROM .... As soon as that one is in, this should pass the tests.
What I did
- converted to insert query to INSERT INTO () SELECT
- fixed the admin/content/node form alter stuff for publish/unpublish, I'm pretty sure that did never work ;)
- Updated the update function to set the tracker_index_nid variable
- Added tests for cron indexing and admin/content/node unpublishing.
Comment #227
BerdirLet's see if this needs a re-roll, the tests should now pass.
Comment #229
BerdirOk well, re-roll...
Comment #231
BerdirWrong path in the tests...
Comment #233
BerdirAnother path changes... re-rolled.
Comment #234
BerdirI can't see a reason why the special count query on {node} is necessary and it does give false results during indexing... simply removing it seems to work just fine...
Comment #235
catch1000 nodes, anonymous user, no page caching:
HEAD:
Patch:
This doesn't really need benchmarking though since the same code is responsible for Drupal.org being up at the moment instead of down.
Quick nitpick review on the patch:
Don't think we need this. However maybe indexing one batch of nodes on enable would help to avoid wtfs?
Should say "Create the new tracker_node and tracker_user tables.
ID should be uppercase.
"All content has finished tracker indexing". Not sure what to do about Max node ID here, this is why I hate content/node/post inconsistency.
No phpdoc.
need updating to hook_node_insert(), the function itself is right though.
No phpdoc, same for other private functions here.
indentation is off.
Comment #236
catchImplemented my own review.
Comment #238
catchForgot -N.
Comment #240
catchre-rolled.
Comment #241
catchMinus a syntax error.
Comment #243
catchCan't reproduce locally, trying again.
Comment #244
catchdoh. sorry.
Comment #245
catchRemoving a pointless drupal_set_message() from tracker.install
Comment #246
pwolanin CreditAttribution: pwolanin commentedIt's unclear to me what tracker_cron is doing - is this only supposed to run once per node to initially populate the table - i.e. instead of trying to do a large batch operation when the module is installed?
Comment #247
David Strauss@pwolanin
"is this only supposed to run once per node to initially populate the table - i.e. instead of trying to do a large batch operation when the module is installed?"
Yes. Large sites can't build all trackers all at once.
Comment #248
pwolanin CreditAttribution: pwolanin commentedok, then please add some code comments to that effect.
Comment #249
drewish CreditAttribution: drewish commentedThese seem pointless:
but I guess there's a couple other instances of them in core.
The schema descriptions for the tracker_node and tracker_user tables seem way too terse and don't really explain what they do.
The field descriptions also don't seem to match up with user.install and node.install. Which say things like:
And in the image.module we got told we had to add foreign key info ala:
Incorrect indenting:
These comments need to be proper sentences and end in periods:
I feel like this comment is wrong... it's a callback not a callback argument. Also wouldn't this be better off in .pages.inc or .admin.inc?:
The setting for this either needs to be a select with a fixed list of values (like search module) or it needs a validate function to ensure that they provide a number.
Why is tracker_batch_node_alter() not called tracker_batch_node_submit()?
Not related to this patch specifically but I was trying to figure out if tracker_comment_update() and tracker_comment_publish() were doing the same thing and realized that hook_comment_unpublish() is never getting called. Need to open a new issue for that.
I feel like this comment is pointless and should be dropped, considering it's right before a function named add_css... Or baring that it needs a period:
The closing curly brace and else need to be on separate line:
These comments need to be proper sentences and end in periods:
Initially I'd thought there was duplicate code in this test then realized it just needs comments explaining that you're checking the user's page first then the site wide tracker page.
The comment refers to a "comment" when it should refer to a "node".
Comment #250
drewish CreditAttribution: drewish commentedI started addressing some of the issues I raised but ran out of time. Also removed some junk from tracker_page()—like $header that wasn't used and drupal_add_css() that duplicated the '#attached_css'.
Comment #251
catchdrewish, thanks for the review and the re-roll. I'm moving continent tomorrow so unlikely to be able to look at this again before the end of the week, hoping someone else can jump in on remaining cleanup.
Comment #252
drewish CreditAttribution: drewish commentedWell this has been a nice change from what I've been working on so I'll see if I can't clear up my remaining issues this afternoon.
Comment #253
drewish CreditAttribution: drewish commentedMy last patch was missing the .install file... corrected that and fixed all the schema problems I'd identified. I also did some more clean up on the settings form. I fixed a bug in the DBTNG conversion tracker_cron() where it was supposed to be matching all users other than the author.
For some reason I can't do anything with simpletest locally so i'll get some bot feedback.
One other issue I noticed is that we're double processing the comments. We catch them on both insert/update and publish. I don't know if it's a huge deal I just wanted to mention that it seemed a little tacky to me.
Comment #254
catchLooked over this, I think we should kill the settings page. While we expose a setting for search, I've seen indexing 100 nodes cause 25,000 queries - since it can be an insert for every word, loading thousands of comments etc. - whereas the tracker indexing is a known quantity. We can leave the variable there for sites which need to raise it, but it's unlikely to need lowering.
Spotted a couple of missing periods in tracker.test as well, but otherwise not much to complain about (obviously I didn't spot these when re-rolling myself).
Comment #255
drewish CreditAttribution: drewish commenteddropped the admin settings page. and cleaned up the comment ops where we were doing expensive no ops.
Comment #256
catchOK this is looking a lot tidier, but do we need the no-op hook implementation? Seems like that would be better documented in _publish()
Comment #257
drewish CreditAttribution: drewish commentedI'm okay with dropping that but it seemed harder to document it on the other hook. I'm not sure which is a worse example for core to set, not having the operation be clear risking some one "fixing" it later, or having an empty hook implementation. I'll roll a no-op free version.
Comment #258
drewish CreditAttribution: drewish commentedComment #259
David StraussIs tracker_form_node_admin_content_alter() still necessary? Don't D6 and later call normal node_save() stuff even for batches in the admin area?
$comment = (array) $comment;
should go. It was in my code because the hook variously got objects and arrays. We don't have that problem anymore.Comment #260
BerdirI tested the content_alter thingy and afaik, it is still required for publish/unpublish, I removed it for the delete.
Comment #261
drewish CreditAttribution: drewish commentedhad a diversion to get devel_generate working again then was able to test the bulk updates. looks like publish/unpublish now calls the full node_save() so we can drop tracker_form_node_admin_content_alter() and tracker_batch_node_submit() since they're just doing double processing.
Comment #262
drewish CreditAttribution: drewish commentedRealizing our test coverage sucks. Two big fatal errors that went uncaught:
- tracker_comment_delete() was calling _tracker_delete() which isn't even a real function.
- _tracker_remove() was referencing {comments} which has been renamed to {comment}.
So things that need to be tested before this is committed:
- Adding unpublished comment: Commenter should not be in tracker.
- Adding published comment: Commenters and owner should be in tracker.
- Publishing a comment: Commenters and owner should be in tracker.
- Deleting a comment: Owner still in tracker, commenter no longer in tracker.
- Need better testing of the per user tracker.
Comment #263
drewish CreditAttribution: drewish commentedthis fixes up some of the comment tests.
Comment #264
drewish CreditAttribution: drewish commentedthat was clearly the same version of the patch.
Comment #265
TheRec CreditAttribution: TheRec commentedSingle quotes should be used to be consistant with the rest of the file (2 occurences of the exact same line in the patch)
Missing a comma at the end (2 occurences of the exact same line in the patch)
Single quotes should be used to be consistant with the rest of the file.
Single quotes should be used to be consistant with the rest of the file.
Beer-o-mania starts in 3 days! Don't drink and patch.
Comment #266
drewish CreditAttribution: drewish commentedcorrected the issues TheRec raised. also cleaned up the tests a bit more. this is pretty close aside from publishing/unpublishing comments.
Comment #267
catchSince this is a schema change, I think we're better of trying to get it in before freeze, and add those final tests in a followup. If you beat webchick or Dries to a commit, please leave at RTBC.
Comment #268
webchickJust glanced through this patch quickly, but...
that's gonna be a problem. ;)
Beer-o-mania starts in 2 days! Don't drink and patch.
Comment #269
catchWe removed the settings page, so this no longer needs a permission to administer it, doh.
No other changes.
Comment #270
pwolanin CreditAttribution: pwolanin commentedadd tag
Comment #271
Dries CreditAttribution: Dries commentedReviewed this patch and it looks good. Thanks!