(follow-up issue from http://drupal.org/node/44276)

that bug fix further slowed down an already slow query. some EXPLAIN SELECT investigation of the queries that drive the "Recect posts" page shows that it's already doing the worst possible kind of query: a filesort on the *entire* {node_comment_statistics} table. :( the new patch doesn't (and can't fix that) and in fact makes it a little worse.

so, perhaps it'd be nice to speed this query up. here's my initial idea (from http://drupal.org/node/44276#comment-171227):

we'd have to add a new column to the {node} table, sort of like "changed" called something like "updated". we'd need to set it anytime either the node is edited, or a comment is posted. we'd have an index on this. that would avoid:

- the GREATEST()
- ORDER BY on a computed value
- one of the INNER JOIN's (on {node_comment_statistics})

i'm guessing the speedup could be quite huge. but, this will probably be a very contentious DB schema change, and it's *way* too late in D5 for that.

CommentFileSizeAuthor
#269 tracker.patch27.99 KBcatch
#266 tracker_105639.patch28.08 KBdrewish
#264 tracker_105639.patch27.8 KBdrewish
#263 tracker_105639.patch24.55 KBdrewish
#262 tracker_105639.patch24.55 KBdrewish
#261 tracker_105639.patch24.66 KBdrewish
#258 tracker_105639.patch25.63 KBdrewish
#255 tracker_105639.patch25.84 KBdrewish
#253 tracker_105639.patch27.33 KBdrewish
#250 tracker_105639.patch20.79 KBdrewish
#245 tracker.patch25.88 KBcatch
#244 tracker.patch25.98 KBcatch
#243 tracker.patch5.57 KBcatch
#241 tracker.patch25.98 KBcatch
#240 tracker.patch25.45 KBcatch
#238 tracker.patch25.81 KBcatch
#236 tracker.patch20.24 KBcatch
#234 tracker_24.patch24.95 KBBerdir
#233 tracker_23.patch25.17 KBBerdir
#231 tracker_22.patch25.18 KBBerdir
#229 tracker_21.patch25.18 KBBerdir
#225 tracker_20.patch24.97 KBBerdir
#221 tracker_18.patch21.82 KBBerdir
#211 tracker.patch19.52 KBcatch
#209 tracker.patch18.32 KBcatch
#207 tracker.patch13.35 KBcatch
#205 tracker.patch19.18 KBcatch
#201 tracker_9.patch19.09 KBswentel
#198 tracker.patch19.28 KBcatch
#194 tracker2.txt19.55 KBswentel
#178 tracker-devel.txt472 bytescatch
#177 tracker_union_all.patch4.36 KBcatch
#171 tracker_union_all_0.patch4.11 KBcatch
#169 tracker_union_all.patch4.11 KBcatch
#167 tracker_union_all.patch4.09 KBcatch
#163 tracker_union_reroll.patch4.08 KBcatch
#159 tracker-union-3.patch4.07 KBkbahey
#152 tracker-union-2.patch3.88 KBkbahey
#144 tracker-union.patch3.37 KBkbahey
#126 tracker_patch_5-1_1.patch8.61 KBDavid Strauss
#124 tracker_patch_5-1_0_1.patch8.61 KBcatch
#123 tracker_patch_5-1_0_0.patch8.61 KBcatch
#121 tracker_patch_5-1_0.patch8.71 KBDavid Strauss
#120 tracker_patch_5-1.patch8.65 KBDavid Strauss
#117 tracker_update.patch6.4 KBDavid Strauss
#86 tracker_5.patch9.35 KBwebchick
#69 split_tracker_queries_2.diff5.99 KBDavid Strauss
#68 split_tracker_queries_1.diff5.96 KBDavid Strauss
#66 split_tracker_queries_0.diff5.12 KBDavid Strauss
#65 split_tracker_queries.diff5.14 KBDavid Strauss
#53 tracker_last_updated.patch.txt10.7 KBdww
#25 tracker_query_25.patch4.77 KBpwolanin
#17 tracker_query_17.patch4.54 KBpwolanin
#7 tracker_query.patch.txt3.1 KBmindless
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

ChrisKennedy’s picture

That would be handy - my one Views patch simulates such a column: http://drupal.org/node/89421

killes@www.drop.org’s picture

Great suggestion and sure useful. However, the new table belongs into node_comment_statistics. Maybe we could rename it to node_statistics.

dww’s picture

why? then we still have the additional INNER JOIN. is the thought that comments get added faster than node updates, so we'd get into insert/update locking hell again or something? if this value belongs in {node_statistics} (good name, if that's what it's turning into) what about {node}.changed itself? just a little unclear on the logic for using the separate table, when one of the goals was avoiding the INNER JOIN itself.

thanks,
-derek

geodaniel’s picture

Just a note that the INNER JOIN to {node_comment_statistics} is also causing issues in 4.7.x and 5.x when the comment module is not enabled: #110556.

pwolanin’s picture

With 5.x and multiple cache table, why (for 6.x at least) can't the result cached?

On d.o, the tracker page takes a long time to load. However, maybe cache_clear_all() gets called too often for caching the tracker to be of benefit unless there's a minimum cache lifetime?

pwolanin’s picture

Priority: Normal » Critical

The performance of the tracker on d.o has been terrible recently...

mindless’s picture

FileSize
3.1 KB

Here's another idea, perhaps a first step on the way to a more involved change. This one doesn't require any db change.. it removes the inner join on users table and queries for the needed usernames afterwards. Much faster on our system with 59k nodes, 227k comments.

kbahey’s picture

@ mindless

Can you please install the devel module and enable the query log and the page generation time, and post the before the patch values, and after the patch?

This will quantify the difference more precisely.

mindless’s picture

query time(ms):
tracker/ Before: 5041.671 After: 2828.042
tracker/{uid} Before: 3657.517 After: 1790.737

kbahey’s picture

Status: Active » Needs review

You have my +1.

The fact that no schema changes are required makes it +2 for me.

mindless’s picture

Forgot to include the additional query for usernames in the numbers.. adds just 22.692 ms to each "After" value above.

FiReaNGeL’s picture

mindless: its a good improvement, but wouldnt it be better to actually fix the problem, which as dww suggested, requires a database change? 3 seconds query are not what I would call acceptable, even on a 'big' site.

That being said, it's still an improvement. If nobody is up to the task of implementing the required changes (and I'm not such a person, sadly), it's way better than nothing.

kbahey’s picture

something like "updated". we'd need to set it anytime either the node is edited, or a comment is posted. we'd have an index on this.

The issue with this is that now for each comment that is added, you need to update the node table. This means you are incurring a lock and write on the node table.

On busy sites, this can be a bottleneck, since others cannot read the node table until the lock/write is done.

killes has decoupled a column in the users table and moved it to its own table because of that reason. So now we are moving the other way.

killes@www.drop.org’s picture

mindess: make sure to propery use the db API and escape input values

dww’s picture

Status: Needs review » Needs work

right, the patch in #7 would allow SQL injection...

mindless’s picture

sorry, i was proposing an idea/concept in the form of a patch, that wasn't intended as a patch for review. anyone feel free to take the idea and go with it..

pwolanin’s picture

Status: Needs work » Needs review
FileSize
4.54 KB

Is there really an SQL injection issue? These uid values all came from the DB.

Attached is a patch against the lastest DRUPAL-5, though it should apply fine to HEAD with offset. It's mostly just a rewrite of the patch and approach suggested by mindless. Includes code to escape all the uids in the query and other minor improvements. In general, this seems like a good approach, since at max we will have 25 uids in this second query.

Note, the patch looks bigger than it really is since some if the code was wrapped in an if{} block, and hench there is a chance in indentation.

pwolanin’s picture

Any thought on this? Improving tracker performance on d.o is really critical.

FiReaNGeL’s picture

The suggestion dww did is in my opinion the only viable one, long term, and isn't that hard to implement.

pwolanin’s picture

I assume you mean the suggestion at the top of adding a new column or table? Yes, for Drupal 6, that might be best.

However, please take a look at the attached patch in terms of an immediate improvement that might be applied to Drupal 5. It's similar in spirit to the change in code made to the recent comments block to get a speedup.

moshe weitzman’s picture

this query is now a smoking gun for drupal.org suckiness. please review and RTBC if it makes sense. Would be nice to test with lots of nodes and comments.

Dries’s picture

The patch's code looks good, however, I'd like to see two things happen:

1. benchmarking -- does this really have a performance gain? I think this needs to be benchmarked first. So far, we've only been guessing.

2. documentation -- can we document in the code why we use this approach? Novice developers might otherwise think that this can be optimized by merging the two queries into one query. One or two lines of documentation would be sufficient.

pwolanin’s picture

@Dries - I don't have a setup for benchmarking at this point- can you or Moshe recruit someone with the setup and experience?

#2 I'll address tonight (EDT).

mindless’s picture

fwiw, http://gallery.menalto.com/tracker is using this patch. Afraid I didn't do any measurements, but when I put it in place the performance of tracker pages definitely improved. Site has 61k nodes and 232k comments.

pwolanin’s picture

FileSize
4.77 KB

Ok, new patch attached against lastest DRUPAL-5. I added 2 lines of comment, and switch back to the original, simpler, implode for the uids. Since these are ints coming straight out of the DB, it seems silly to run the extra code to escape them. We did the same in comment.module on line 263.

kbahey’s picture

Can't we test this on scratch.d.o for a before and after scenario?

This would give us LOTS of nodes and comments, and users and will tell us definitely if it would help d.o or not, and by how much?

merlinofchaos’s picture

Subscribing

pwolanin’s picture

ok my uninformed attempt to benchmark this: I inserted 10000 users, 6800 nodes, 18000 comments, then disabled the devel module. Ran some queries to get the right data into {node_comment_statistics} (I think the generate functionality of devel needs work).

Added the line $uid = mt_rand(10000); at the top of the function to avoid any caching of the query results.

Serving from my laptop (2.0 GHz Intel, Mac 10.4, Apache 1.33 with mod_php, PHP 4.4.4, MySQL 4.1) across my local network to another computer running ab (so latency ~1 ms).

Hitting /tracker without the patch (and random uid) I get ~1.05 pages/sec at c=1, or ~0.59/sec at c=5
Hitting /tracker with the patch (and with random uid) I get ~1.20 pages/sec at c=1, or ~0.76/sec at c=5

However, in both cases the number of "failed requests" was >95% of the total (due the "Length" I'm not sure what that means).

So, this should probably be repeated with a larger number of nodes/comments per user. With this ratio, only a few items show up on each user's tracker page.

FiReaNGeL’s picture

OK fiddled around a little with this tonight. The proposed patch in the above comments is only a temporary fix - code freeze is looming for d6, and i'd rather fix the problem than apply a temporary band-aid.

First of all, simply adding a db column to replace the GREATEST value and index it as dww suggested would help, but isnt sufficient. The second part of the problem lies in the "AND (n.uid = %d OR c.uid = %d)" section of the query; mysql couldnt use an index on the proposed 'updated or commented' column because of the where clause on another table.

This specific where wants to filter on nodes that are A- created by the users or B- in which the user commented.

So lets do an UNION of those things, which is much faster. WARNING: Tons of mysql incoming (even if its just one query).

SELECT DISTINCT (
t.nid
), t.title, t.type, t.changed, t.uid, t.last_updated, t.comment_count, t.name
FROM (
(

SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST( n.changed, l.last_comment_timestamp ) AS last_updated, l.comment_count
FROM node n
INNER JOIN node_comment_statistics l ON n.nid = l.nid
INNER JOIN users u ON n.uid = u.uid
LEFT JOIN comments c ON n.nid = c.nid
WHERE n.uid =1
AND n.status =1
ORDER BY last_updated DESC
LIMIT 0 , 30
)
UNION (

SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST( n.changed, l.last_comment_timestamp ) AS last_updated, l.comment_count
FROM node n
INNER JOIN node_comment_statistics l ON n.nid = l.nid
INNER JOIN users u ON n.uid = u.uid
LEFT JOIN comments c ON n.nid = c.nid
WHERE c.uid =1
AND (
c.status =0
OR c.status IS NULL
)
AND n.status =1
ORDER BY last_updated DESC
LIMIT 0 , 30
)
) AS t
ORDER BY last_updated DESC
LIMIT 0 , 30

I replaced the %d for testing purposes. On my dev box (16000 nodes, not many comments), this query is 3 times faster and SHOULD give the same results as the other one. With the db change and index dww proposed, this should get much faster.

Can anyone bench this query along the original one and the proposed patch? Also, is UNION ok with postgre? I have no idea. Also, I'm in the process of moving my girlfriend in my app, so I dont have time to code this right now.

Also, I think the proposed 'updated or commented' column shouldnt go in 'node' directly, as kbahey pointed out. Maybe we could move 'changed' from node to its own table along with the new 'updated or commented' column.

pwolanin’s picture

@FiReaNG3L - as will the recent comments query in comment module, I think there are problems since the limit syntax varies a bit by DB ( which is why we have db_query_range()) as well as support for compound queries.

So, even with the new column it seem likely to be necessary to do the second query as in the patch.

kbahey’s picture

Looking at an explain of this query, I get this:

mysql> explain SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, l.last_comment_timestamp AS last_post, l.comment_count  FROM node n  INNER JOIN node_comment_statistics l ON n.nid = l.nid  INNER JOIN users u ON n.uid = u.uid  LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL)  WHERE n.status = 1  AND (n.uid = 9575 OR c.uid = 9575)  ORDER BY last_post DESC;
+----+-------------+-------+--------+--------------------+---------+---------+---------------------+------+---------------------------------+
| id | select_type | table | type   | possible_keys      | key     | key_len | ref                 | rows | Extra                           |
+----+-------------+-------+--------+--------------------+---------+---------+---------------------+------+---------------------------------+
|  1 | SIMPLE      | l     | ALL    | PRIMARY            | NULL    | NULL    | NULL                | 8323 | Using temporary; Using filesort |
|  1 | SIMPLE      | n     | eq_ref | PRIMARY,status,uid | PRIMARY | 4       | adsoftheworld.l.nid |    1 | Using where                     |
|  1 | SIMPLE      | u     | eq_ref | PRIMARY            | PRIMARY | 4       | adsoftheworld.n.uid |    1 | Using where                     |
|  1 | SIMPLE      | c     | ref    | lid                | lid     | 4       | adsoftheworld.n.nid |    9 | Using where; Distinct           |
+----+-------------+-------+--------+--------------------+---------+---------+---------------------+------+---------------------------------+

What this tells me is that it is doing a temporary table for the node_comment_statistics table and doing filesort on it, and that should be the slow part.

All we need from this table are the columns last_comment_timestamp and the comment_count. These columns are 1:1 with the node table, and according to normalization rules can (or should) be in the node table itself. Yes, this will cause an UPDATE on the node table every time a new comment is added, or deleted, but it may not be an issue.

So,

1. Can someone with ssh access on d.o do an explain on the query on d.o, and paste the output here, to confirm if the access path and contention is indeed in the node_comment_statistics table?

2. If there are no significant objections, I can go ahead and create a patch for this, and we can test it on scratch.d.o for a definitive yes/no.

pwolanin’s picture

What about the suggested solution of making this a new column in {node_comment_statistics}? Yes, you'll still have the join, but you could index this new column and so the penalty of sorting a computed value would go away.

I think (from comments above) writing the node table for each comment update is a bad idea- that's why {node_comment_statistics} is a separate table.

kbahey’s picture

Here are the explains again with the correct queries (the first ones were not as per the module).

With the GREATEST and node_comment_statistics:

mysql> explain SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, 
  GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count 
  FROM node n 
  INNER JOIN node_comment_statistics l ON n.nid = l.nid 
  INNER JOIN users u ON n.uid = u.uid 
  LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) 
  WHERE n.status = 1 
  AND (n.uid = 9575 OR c.uid = 9575) 
  ORDER BY last_updated DESC;
+--+------+-----+------+------------------+-------+-----+-------------------+----+-------------------------------+
|id|type  |table|type  |possible_keys     |key    |k_len|ref                |rows|Extra                          |
+--+------+-----+------+------------------+-------+--+----------------------+----+-------------------------------+
| 1|SIMPLE|l    |ALL   |PRIMARY           |NULL   |NULL |NULL               |8323|Using temporary; Using filesort|
| 1|SIMPLE|n    |eq_ref|PRIMARY,status,uid|PRIMARY|4    |adsoftheworld.l.nid|   1|Using where                    |
| 1|SIMPLE|u    |eq_ref|PRIMARY           |PRIMARY|4    |adsoftheworld.n.uid|   1|Using where                    |
| 1|SIMPLE|c    |ref   |lid               |lid    |4    |adsoftheworld.n.nid|   9|Using where; Distinct          |
+--+------+-----+------+------------------+-------+-----+-------------------+----+-------------------------------+
4 rows in set (0.00 sec)

Without the GREATEST and the node_comment_statistics table at all:

mysql> explain SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name 
  FROM node n 
  INNER JOIN users u ON n.uid = u.uid 
  LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) 
  WHERE n.status = 1 
  AND (n.uid = 9575 OR c.uid = 9575) 
  ORDER BY n.changed DESC;
+--+------+-----+------+-------------+-------+-----+-------------------+----+---------------------------------------------+
|id|type  |table|type  |possible_keys|key    |k_len|ref                |rows| Extra                                       |
+--+------+-----+------+-------------+-------+-----+-------------------+----+---------------------------------------------+
| 1|SIMPLE|n    |ref   |status,uid   |status |4    |const              |8237| Using where; Using temporary; Using filesort|
| 1|SIMPLE|u    |eq_ref|PRIMARY      |PRIMARY|4    |adsoftheworld.n.uid|   1| Using where                                 |
| 1|SIMPLE|c    |ref   |lid          |lid    |4    |adsoftheworld.n.nid|   9| Using where; Distinct                       |
+--+------+-----+------+-------------+-------+-----+-------------------+----+---------------------------------------------+
3 rows in set (0.00 sec)

I am getting variances from site to site as to keys and such, but all do a temp and file sort without a where for the first query (the existing one in tracker module).

Can we confirm that this is the same on d.o ... someone with ssh access ... please?

dww’s picture

on d.o:

mysql> explain SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, 
    ->   GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count 
    ->   FROM node n 
    ->   INNER JOIN node_comment_statistics l ON n.nid = l.nid 
    ->   INNER JOIN users u ON n.uid = u.uid 
    ->   LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) 
    ->   WHERE n.status = 1 
    ->   AND (n.uid = 9575 OR c.uid = 9575) 
    ->   ORDER BY last_updated DESC;
+-------+--------+----------------------------------+---------+---------+-------+--------+----------------------------------------------+
| table | type   | possible_keys                    | key     | key_len | ref   | rows   | Extra                                        |
+-------+--------+----------------------------------+---------+---------+-------+--------+----------------------------------------------+
| n     | ALL    | PRIMARY,uid,node_status_type,nid | NULL    |    NULL | NULL  | 104980 | Using where; Using temporary; Using filesort |
| l     | eq_ref | PRIMARY                          | PRIMARY |       4 | n.nid |      1 |                                              |
| u     | eq_ref | PRIMARY                          | PRIMARY |       4 | n.uid |      1 | Using where                                  |
| c     | ref    | lid                              | lid     |       4 | n.nid |      4 | Using where; Distinct                        |
+-------+--------+----------------------------------+---------+---------+-------+--------+----------------------------------------------+
4 rows in set (0.00 sec)

and:

mysql> explain SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name 
    ->   FROM node n 
    ->   INNER JOIN users u ON n.uid = u.uid 
    ->   LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) 
    ->   WHERE n.status = 1 
    ->   AND (n.uid = 9575 OR c.uid = 9575) 
    ->   ORDER BY n.changed DESC;
+-------+--------+----------------------+---------+---------+-------+--------+----------------------------------------------+
| table | type   | possible_keys        | key     | key_len | ref   | rows   | Extra                                        |
+-------+--------+----------------------+---------+---------+-------+--------+----------------------------------------------+
| n     | ALL    | uid,node_status_type | NULL    |    NULL | NULL  | 104980 | Using where; Using temporary; Using filesort |
| u     | eq_ref | PRIMARY              | PRIMARY |       4 | n.uid |      1 | Using where                                  |
| c     | ref    | lid                  | lid     |       4 | n.nid |      4 | Using where; Distinct                        |
+-------+--------+----------------------+---------+---------+-------+--------+----------------------------------------------+
3 rows in set (0.00 sec)
killes@www.drop.org’s picture

@Fireangel: Your Union query takes about the same time as the original one :( It also only returned 6 results.

@kbahey: Exchanging temporary table usage for table locks strikes me as a bad idea.

pwolanin’s picture

@killes - Do you see any improvement with just the patch in #25?

kbahey’s picture

@dww; thanks

@killes: true, locking is not good, but we don't know how much of that will happen vs. the filesort of 105,000 nodes every time an authenticated users hits tracker.

Also, my suggestion does not avoid the GREATEST problem anyway, which causes the filesort.

Seems the most promising solution (from a performance point of view) is to have a new derived column with GREATEST precalculated in it, as pwolanin said.

killes@www.drop.org’s picture

@pwolanin: Yes, #25 helps bit, but not that much. It brought the query down from 23 to 19 seconds.

killes@www.drop.org’s picture

SELECT DISTINCT(t.id) AS nid, t.title, t.type FROM ((SELECT c.nid AS id, n.title, n.type, c.timestamp as timestamp FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE c.uid = 83 AND n.status = 1 ORDER BY timestamp DESC) UNION (SELECT n2.nid AS id, n2.title AS title, n2.type, c.last_comment_timestamp as timestamp FROM node n2 INNER JOIN node_comment_statistics c ON n2.nid = c.nid WHERE n2.uid = 83 AND n2.status = 1 ORDER BY timestamp DESC) UNION (SELECT n3.nid AS id, n3.title AS title, n3.type, n3.changed as timestamp FROM node n3 WHERE n3.uid = 83 AND n3.status = 1 ORDER BY timestamp DESC)) AS t ORDER BY t.timestamp DESC LIMIT 0, 10;

I've been playing ith the above query which brought the execution time down to 16 seconds.

pwolanin’s picture

@killes- even that much improvement makes it seem worth committing #25 as the basis for further improvement?

killes@www.drop.org’s picture

I should mention that my query gets a slightly different set of results. Mainly the order seems to differ.

@pwolanin: It probably won't hurt.

Dries’s picture

As the node table and the node_comment_statistics table have a 1-on-1 mapping, we might as well merge both into one table.

merlinofchaos’s picture

Before we merge node_comment_statistics into node, we should check into why we even need to.

If there is a 1::1 mapping, node_comment_statistics should be sorting on its primary key. Why would it need a filesort?

Is the problem just that we're not indexing that table properly?

magico’s picture

SELECT
  DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
  GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count
FROM
  {node} n
  INNER JOIN {users} u ON n.uid = u.uid
  INNER JOIN {node_comment_statistics} l ON n.nid = l.nid
WHERE
  n.status = 1
ORDER BY
  last_updated DESC

Does not need any DISTINCT. There are only INNER JOIN that should return only one row.
Removing DISTINCT should give some speed up.

SELECT
  n.nid, n.title, n.type, n.changed, n.uid, u.name,
  GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count
FROM
  node n
  INNER JOIN users u ON u.uid = n.uid
  INNER JOIN node_comment_statistics l ON l.nid = n.nid
  LEFT JOIN (
    SELECT n.nid
    FROM node n INNER JOIN comments c ON c.nid = n.nid AND (c.status = %d OR c.status IS NULL)
    WHERE n.uid = %d OR c.uid = %d
    GROUP BY n.nid
  ) c ON c.nid = n.nid
WHERE
  n.status = 1
ORDER BY
  last_updated DESC

Removed the DISTINCT; the most expensive query that does a LEFT JOIN with the "comments" was filtered before the final join.

Cannot test with thousands of records, but the query works.

magico’s picture

It's 2 queries:
* the first should have the DISTINCT removed, no more optimization to do except merge the tables schema
* the second, should be applied like it is

moshe weitzman’s picture

i already tried to merge node_comment_statistics into node but the initial benchmarks were not promising. folks are welcome to keep trying though. see http://drupal.org/node/75122

merlinofchaos’s picture

Removing DISTINCT won't help. node_access will just add it back.

kbahey’s picture

@magico, the DISTINCT is unnecessary, I agree. But even if it is removed, I think that db_rewrite_sql() will add it back (a common problem).

@Dries, I suggested the same in #31 (merging back the fields in the node table). The issue is now that the node table will be updated with every comment post and delete, which *might* cause extra locking. This is the best solution from a schema/normalization point of view, but may create bottlenecks.

@merlinofchaos, The main problem is that we use GREATEST(col1, col2), which must do a filesort of the entire table since it does not know which is the greatest of the two per column, hence the performance problem.

Creating a new column called last_update and making it an index should solve this bottleneck, but that field is now a derived field that depends on other columns. If a module updates the others, but forgets to update the derived field, then we have data inconsistency.

None of the above solutions appeal to me so far.

magico’s picture

@kbahey: don't forget the second query, that has a different JOIN.

Could you spend some minutes and test both queries, running them at the server (if you have access)?

Thanks

FiReaNGeL’s picture

@kbahey: 'Creating a new column called last_update and making it an index should solve this bottleneck, but that field is now a derived field that depends on other columns. If a module updates the others, but forgets to update the derived field, then we have data inconsistency.'

I dont see this as a problem. Modules shouldnt directly update node.changed and comment.last_updated_timestamp directly.

@killes: Did you try it as-is? Cause I wrongly put status = 0 (instead of 1) while testing; this could explain why you only got 6 results :) The LIMITs in the subqueries should also go, as this strategy will screw up paging.

I'll rework the UNION query, as I think its key to optimize this beast. The reason is that we want a merge of two different things : recent posts by the user, and posts that the user recently commented in (correct me if im wrong). Im sure we can optimize those two parts separately and merge them, and that it would be faster. If we do both in one query, it cant use indexes effectively because of the "AND (n.uid = %d OR c.uid = %d)" (where clause in two different tables, cant optimize for a ORDER BY).

I also think that joins on users and node_comments_statistics can be LEFT JOINs, because there's a 1-1 mapping and they cant be null.

That being said, the db change (adding a new column) is still needed.

Also, removing the DISTINCT if its useless is needed. We should rework the DB API if its adding it back uselessly, as DISTINCT can be costly.

We should get in the DB change and optimize the query from there.

dww’s picture

to clarify, in english, what this query is trying to accomplish:

find all the nodes that the user has either composed or commented in, and sort the list by the "latest activity time", the most recent of either a) the last time a comment was added, or b) when the last edit of the node happened.

for obvious reasons, this "latest activity time" stuff is the killer in this query, which is why way back in the original post, i suggested we just store that timestamp ourselves, update it whenever either the node is edited or a comment is added, have an index on that field, and (probably) see all of our problems go away. i'm fine with renaming {node_comment_statistics} to just {node_statistics} and putting the column in there, to avoid LOCKs on the {node} table for every new comment...

yes, it's "derived" info, in the sense that it's always equal to either {node}.changed or {node_comment_statistics}.last_comment_timestamp. however, last_comment_timestamp is only set in exactly 2 places in all of core: 1) comment_nodeapi('insert') and 2) _comment_update_node_statistics(). {node}.changed is only touched via node_save(). so, there's only 3 places we'd have to change to support this field. no contribs should ever be updating these columns directly, anyway.

i don't want to spend the time on a patch if it's going to be vetoed outright by the D6 core maintainers, but i've always believed this is the simplest and most effective solution to this problem. should we at least try it and see? the patch should be relatively simple (~100 lines, including the schema upgrade), and we could run it on s.d.o for testing. but, if it's doomed from the start, i don't want to put effort into that approach.

moshe weitzman’s picture

@dww - i always thought that was the best solution too. go for it.

the "but it is another derived column" argument is not worth discussing anymore. this problem requires some compromise.

dww’s picture

Assigned: Unassigned » dww
FileSize
10.7 KB

here's a patch for the new column:
- schema upgrade (coded for both mysql and pgsql).
- changes to update the column correctly in comment.module and node.module.
- changes the tracker query to use the new column instead of GREATEST().

i tested the upgrade path on mysq, but haven't had a chance to test pgsql yet. the schema update is probably more complicated than it should be, since it started out to rename the table to just {node_statistics}, but that seemed like a lot more of a diff than i wanted to mess with in the patch. so, we could probably simplify things in there, but it certainly works as is (and has the benefit of a single query to update all the records, even on mysql 4.0 sites like d.o). ;)

in my testing, this all seems to work great. i don't have a good environment for stress testing and benchmarking, so help with that would be most appreciated.

FiReaNGeL’s picture

Thanks dww for this great initiative! I manually patched my 5.x install and everything is working #1; the upgrade path worked too.

Sadly I think I have bad news to report; it didnt change a thing on my dev box (performance-wise). For some reason, mysql refuse to use the new index to sort.

I'd be interested in seeing the results on scratch.d.o.
It would also be interesting to have others bench it.

The EXPLAIN changed a bit

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	u 	ALL 	PRIMARY 	NULL 	NULL 	NULL 	3 	Using temporary; Using filesort
1 	SIMPLE 	n 	ref 	PRIMARY,node_status_type,uid 	uid 	4 	zscience.u.uid 	5894 	Using where
1 	SIMPLE 	l 	eq_ref 	PRIMARY 	PRIMARY 	4 	zscience.n.nid 	1 	 
1 	SIMPLE 	c 	ref 	lid,status 	lid 	4 	zscience.n.nid 	2 	Using where; Distinct

Its strange, cause I thought as dww and others, that an additional column to get rid of the ORDER BY GREATEST would help things. The problem is that mysql cant use it.

From: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html

In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. These cases include the following:

You are joining many tables, and the columns in the ORDER BY are not all from the first non-constant table that is used to retrieve rows. (This is the first table in the EXPLAIN output that does not have a const join type.)

So the problem is that were doing a WHERE on many tables, and mysql can only use one index at a time, so it cannot sort by our new index.

The following query is 3 times faster when compared to the original query (both pre and post dww's patch, with a proper ORDER BY on the new column and without the GREATESTs) and give the same results (on my dev box at least). Its better than my last one which couldnt be paged and gave funky results:

(
SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST( n.changed, l.last_comment_timestamp ) AS last_updated, l.comment_count
FROM node n
LEFT JOIN node_comment_statistics l ON n.nid = l.nid
LEFT JOIN users u ON n.uid = u.uid
WHERE n.uid =1
AND n.status =1
)
UNION (

SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST( n.changed, l.last_comment_timestamp ) AS last_updated, l.comment_count
FROM node n
LEFT JOIN comments c ON n.nid = c.nid
LEFT JOIN node_comment_statistics l ON n.nid = l.nid
LEFT JOIN users u ON n.uid = u.uid
WHERE c.uid =1
AND n.status =1
AND (
c.status =0
OR c.status IS NULL
)
)
ORDER BY last_updated DESC
LIMIT 0 , 10

And for the record I found out that the DISTINCT was needed (for the cases where you commented in your own node). UNION queries do it by default, so no need to add it back.

Dries’s picture

I think dww's approach makes sense. Good analysis.

I'd still like to see this combined with http://drupal.org/node/75122 -- maybe that will make the index work!

killes@www.drop.org’s picture

@fireangel: I don't get a performance improvement from your new query.

@dww: can you patch s.d.o?

@Dries: I am afraid #75122 will give us some additional table locks. However, my offer to benchmark it stills stands if somebody updates the patch.

dww’s picture

i patched s.d.o and ran the db update (took just over a minute to run, which is pretty good IMHO). sadly, neither http://scratch.drupal.org/tracker nor http://scratch.drupal.org/tracker/46549 seem any faster. :( stupid mysql.

FiReaNGeL’s picture

dww: yeah it seem consistent with what i saw on my dev box :(

killes: strange; maybe the improvements i see are mysql 5.1 specific, or the size of d.o. db change things. Still, i find it strange and disappointing.

Here's another idea; we could restrict "recent posts" to a month, with links to see the past 6 months, 1 year and since the dawn of (unix) time. At the end of the pager, we could add a link such as 'click here to see more' or something along those lines.

This would definitively be faster. And most of the time, I dont think people are interested in seeing all their post history all the time.

kbahey’s picture

Strange and sad.

I was hoping this would improve things.

:-(

pwolanin’s picture

Can we combine dww's patch with #25 to avoid the join?

Perhaps also a compound index is needed (at least for MySQL)?

kbahey’s picture

I am trying to contact a MySQL expert for this (why there is no improvement with the latest patch).

Will let everyone know if I find something useful.

magico’s picture

Can someone instruct me on how can I generate thousand of nodes and comments in an heretogenic way, so I can test my queries above?

Thanks

dww’s picture

@magico: install devel.module and devel_generate.module.

pwolanin’s picture

@dww - I think devel module needs a patch for generate- it was not properly populating the {node_comment_statistics} table for me. I'll assume that the current code is just a refactoring of the 5.x scripts, which saved nodes directly to the DB, rather than calling node_save().

David Strauss’s picture

FileSize
5.14 KB

Here is a patch versus Drupal 5.1. I chose to make it against Drupal 5.1 because Drupal.org desperately needs the update. It can easily be adapted to Drupal 6.

This patch splits the bad tracker queries into two queries: one to get recent nodes and another to get recent comments. The results are intelligently combined in PHP. With this change, the queries can now run in logarithmic time.

David Strauss’s picture

I've updated the patch to address a problem where you could only see results from nodes you posted.

David Strauss’s picture

Ignore the patches I've posted so far. I'm working on a few fixes.

David Strauss’s picture

Here's my latest patch, produced in cooperation with killes, who helped by testing on scratch.drupal.org.

This patch slightly changes tracker functionality. The code is optimized around showing a reasonably long history of recent comments and node updates. It doesn't show the entire history if it's really long. (If the history is short, then the functionality should be nearly identical.) This trade off should be alright because any site with enough nodes and comments that the limit affects the results probably has performance issues with the current module's implementation.

The following indexes (shown below in MySQL syntax) need to be added:
ALTER TABLE `comments` ADD INDEX `tracker_user` ( `uid` , `status` , `timestamp` );
ALTER TABLE `comments` ADD INDEX `tracker_global` ( `status` , `timestamp` );
ALTER TABLE `node` ADD INDEX `tracker_user` ( `uid` , `status` , `changed` );
ALTER TABLE `node` ADD INDEX `tracker_global` ( `status` , `changed` );

David Strauss’s picture

webchick requested that $sanity_limit pull from a variable (even though there won't be an interface exposed).

Senpai’s picture

@ #51 by dww

Why is it necessary for Tracker to display *all* posts and comments by a single user, and then try to sort that mountain of data by most recent activity? Wouldn't it be much more efficient to offer the user a chance to see nodes created by that uid OR comments posted by that uid? I mean no disrespect, but it appears very difficult from a user friendly standpoint to offer a complete list of everything that user has ever done, and lump it into page after page of links. Not to mention that nodes and comments are still two different entities, so why not treat them as such?

What if, since you're going back into Tracker to optimize, you look at splitting the query into two parts? Offer users a tab to see their own comments, or a tab to see their own posts. After all, in all my years of surfing sites, I've never needed to look up all my own comments while searching for content that I created, and I've never needed to see stories I've posted while I was looking for that thread I commented to.

You could cut down on the amount of data retrieved by almost 50% if you went this route. Now, having said that, I have no idea how much it would jack up the Tracker module to attempt this. It might be darn easy, but then again, it could require a complete refactor. I just don't know.

pwolanin’s picture

I'm not really liking that you run individual queries to get this data:

+  foreach ($results as $node) {
+    // Load these data here (instead of in a JOIN) to only load data for the current page
+    $node->comment_count = db_result(db_query('SELECT comment_count FROM {node_comment_statistics} WHERE nid = %d', $node->nid));
+    if (!isset($names[$node->uid])) {
+      $names[$node->uid] = db_result(db_query('SELECT name FROM {users} WHERE uid = %d', $node->uid));
+    }
+    $node->name = $names[$node->uid];

Wouldn't it be more efficient to run through the array, collect the nid and uid values and get them in 2 queries?

Also, I don't like the naming of the array as $results. Almost everywhere in core $results is used as the name of the variable holding the resource returned from a db_query(). How about calling it something like $tracker or just $rows?

David Strauss’s picture

@Senpai
It was one of the first things I asked killes about. Killes thought that it wouldn't be friendly to users. This whole problem becomes quite trivial if you don't combine comments with nodes. As you can see from my patch, I actually split nodes and comments into two separate queries to get the speed increase.

David Strauss’s picture

Global comments query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000;
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
| 1 | SIMPLE | c | ref | lid,tracker_global | tracker_global | 1 | const | 359 | Using where |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid,tracker_global | PRIMARY | 4 | fk_portal.c.nid | 1 | Using where |
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+

Global nodes query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
| 1 | SIMPLE | n | ref | status,node_status_type,tracker_global | tracker_global | 4 | const | 222 | Using where |
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
1 row in set (0.00 sec)

User nodes query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
| 1 | SIMPLE | n | ref | status,uid,node_status_type,tracker_user,tracker_global | tracker_user | 8 | const,const | 77 | Using where |
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
1 row in set (0.00 sec)

User comments query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
| 1 | SIMPLE | c | ref | lid,tracker_user,tracker_global | tracker_user | 5 | const,const | 47 | Using where |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid,tracker_global | nid | 4 | fk_portal.c.nid | 1 | Using where |
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)

These are running on my company's portal, but the query plan shouldn't differ elsewhere.

David Strauss’s picture

Global comments query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000;
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+
| 1 | SIMPLE | c | ref | lid,tracker_global | tracker_global | 1 | const | 359 | Using where |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid,tracker_global | PRIMARY | 4 | fk_portal.c.nid | 1 | Using where |
+----+-------------+-------+------+----------------------------------------------------+----------------+---------+-----------------+------+-------------+

Global nodes query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
| 1 | SIMPLE | n | ref | status,node_status_type,tracker_global | tracker_global | 4 | const | 222 | Using where |
+----+-------------+-------+------+----------------------------------------+----------------+---------+-------+------+-------------+
1 row in set (0.00 sec)

User nodes query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
| 1 | SIMPLE | n | ref | status,uid,node_status_type,tracker_user,tracker_global | tracker_user | 8 | const,const | 77 | Using where |
+----+-------------+-------+------+---------------------------------------------------------+--------------+---------+-------------+------+-------------+
1 row in set (0.00 sec)

User comments query:
EXPLAIN SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1 ORDER BY last_updated DESC;
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
| 1 | SIMPLE | c | ref | lid,tracker_user,tracker_global | tracker_user | 5 | const,const | 47 | Using where |
| 1 | SIMPLE | n | ref | PRIMARY,status,node_status_type,nid,tracker_global | nid | 4 | fk_portal.c.nid | 1 | Using where |
+----+-------------+-------+------+----------------------------------------------------+--------------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)

@pwolanin

"Wouldn't it be more efficient to run through the array, collect the nid and uid values and get them in 2 queries?"

It would only decrease chatting with the database. If you give MySQL a disjoint set of values to match (like a list of uids), it will basically re-run the query to satisfy each value. Plus, the collection, pre-processing, and post-processing would possibly cancel out the meager reduction in communication with the DB.

webchick’s picture

Output from devel module.. values are ms, # of times queries ran, query executed.

drupal,org, without the patch:

/tracker:

37711.93	1	pager_query	SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN users u ON n.uid = u.uid INNER JOIN node_comment_statistics l ON n.nid = l.nid WHERE n.status = 1 ORDER BY last_updated DESC LIMIT 0, 25

/tracker/24967:

49569.59	1	pager_query	SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 24967 OR c.uid = 24967) ORDER BY last_updated DESC LIMIT 0, 25
1.06	1	node_last_viewed	SELECT timestamp FROM history WHERE uid = '24967' AND nid = 79504

scratch.drupal.org, with the patch:
/tracker:

44.68	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000
115.69	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000
5.2	1	tracker_page	SELECT comment_count FROM node_comment_statistics WHERE nid = 141187
5.49	1	tracker_page	SELECT name FROM users WHERE uid = 23

/tracker/24967:

35.22	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 24967 ORDER BY last_updated DESC LIMIT 0, 1000
36.75	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 24967 ORDER BY last_updated DESC LIMIT 0, 1000
2.7	1	tracker_page	SELECT comment_count FROM node_comment_statistics WHERE nid = 94153
0.64	21	tracker_page	SELECT name FROM users WHERE uid = 24967
Dries’s picture

Might be better to run both on the same database (i.e. run both on scratch.drupal.org) -- it might make a big difference.

That said, this seems like a clear winner. I'll try to look at it more closely tomorrow morning. In the mean time, feel free to work on this more. We might be able to squize more out of this.

webchick’s picture

Right you are. Here are the numbers from scratch.drupal.org, sans patch, so we're comparing apples to apples:

tracker:

758.58	1	pager_query	SELECT COUNT(n.nid) FROM node n WHERE n.status = 1
16099.19	1	pager_query	SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN users u ON n.uid = u.uid INNER JOIN node_comment_statistics l ON n.nid = l.nid WHERE n.status = 1 ORDER BY last_updated DESC LIMIT 0, 25

tracker/24967:

17896.58	1	pager_query	SELECT COUNT(DISTINCT(n.nid)) FROM node n LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 24967 OR c.uid = 24967)
22946.22	1	pager_query	SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 24967 OR c.uid = 24967) ORDER BY last_updated DESC LIMIT 0, 25
FiReaNGeL’s picture

Great improvement! Same strategy I tried to apply intra mysql with my union query, but this is definitively better! Kudos, David :)

dww’s picture

i helped webchick get the numbers on s.d.o, and everyone agreed this looks like a vast improvement. feeling bold, i applied the patch on d.o, and the tracker pages clearly speak for themselves. if we want to tweak this and make it even better, great, but d.o is finally usable again, even on the old db hardware. in order to stop the bleeding, i decided to take bold action...

huge thanks to david strauss for this work!

webchick’s picture

drupal.org, post-patch:

tracker:

19.44	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000
22.28	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000
29.38	1	tracker_page	SELECT comment_count FROM node_comment_statistics WHERE nid = 144841
11.19	1	tracker_page	SELECT name FROM users WHERE uid = 59521

tracker/24967:

12.27	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 24967 ORDER BY last_updated DESC LIMIT 0, 1000
26.09	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 24967 ORDER BY last_updated DESC LIMIT 0, 1000
8.42	1	tracker_page	SELECT comment_count FROM node_comment_statistics WHERE nid = 79504
5.13	1	tracker_page	SELECT name FROM users WHERE uid = 24967

We owe David Strauss, and his company Four Kitchen Studios who paid to let him work on this all day ;), a great deal of gratitude. If possible, please credit both in the commit message.

Btw, it looks like we can tweak this further by getting rid of the username query in /tracker.

catch’s picture

on my site. 15,000 nodes, 176,000 comments, 3,750 users

etch vps, 1gb ram, php5.2, mysql5, apache2.2, apc.

before patch:

/tracker
478.79 1 pager_query SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN users u ON n.uid = u.uid INNER JOIN node_comment_statistics l ON n.nid = l.nid WHERE n.status = 1 ORDER BY last_updated DESC LIMIT 0, 25

/tracker/11
1021.4 1 pager_query SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 11 OR c.uid = 11) ORDER BY last_updated DESC LIMIT 0, 25
739.03 1 pager_query SELECT COUNT(DISTINCT(n.nid)) FROM node n LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 11 OR c.uid = 11)

after patch

/tracker
415.74 1 tracker_page SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000

/tracker/11
338.51 1 tracker_page SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 11 ORDER BY last_updated DESC LIMIT 0, 1000

And a lot of page hits it's getting served up by mysql query cache (0.35 speeds), not sure if that's relevant but it gave me a shock when I saw it.

kbahey’s picture

Excellent work David!

Silly MySQL got fooled by discrete queries vs. one combined. No filesorts anywhere anymore.

I guess we can return the new database server for a refund now ... :-)

dww’s picture

@catch are you sure you added the new indexes that the patch relies on? see comment #68 for details. the patch doesn't (yet) include a schema update to add these indexes via update.php so you have to do it manually.

David Strauss’s picture

@catch

Your results look like you haven't run the index updates.

webchick’s picture

Ok, here is David's patch with the following changes:
1) Rolled against HEAD, rather than 5.x.
2) Minor re-formatting/re-wording of comments to fit coding standards
3) Upgrade path. I included this in a tracker.install file, because I wasn't sure if those indexes slowed down the site if you're not using tracker. I also am not sure if we should have tracker.install adding and removing indexes willy-nilly (on uninstall, etc.)... on scratch.drupal.org this locked the site for over a minute.

Remaining issues:
1) (I think) Global tracker should not be doing the username query.
2) There's a small bug.. if you go to http://drupal.org/tracker?page=55, the last entry is a an empty post by Anonymous posted 37 years 20 weeks ago.

I will need help to solve these, and also input from someone who knows databases well on where these index changes should go (system.install or tracker.install).

webchick’s picture

FileSize
9.35 KB

Sigh. How about a patch? :)

Michelle’s picture

I noticed that my recent posts is now a bit odd... The status update post, for example, is showing that it was last updated 1 day 12 hours ago even though there are 7 new posts since I last read it just a few hours ago. I also seem to remember there being other more recent posts at the top, though I can't swear to that.

On the 37 week post in #85, that might not be new. Some time ago... Several months at least, maybe even a year or more ago, I got curious and looked at the end of the tracker and I remember their being some weirdness with impossibly old posts. I know that's vague and not terribly helpful, but I thought I should mention that it might be an existing bug from before this work was done.

Michelle

David Strauss’s picture

Replace:

    // Case where only $node is set
    else {
      $results[$commented_node->nid] = $commented_node;
    }

With:

    // Case where only $node is set
    else if ($commented_node) {
      $results[$commented_node->nid] = $commented_node;
    }

I inaccurately assumed that something would get added every loop, but that's not the case because of duplicates.

David Strauss’s picture

Hey, I even messed up the comment:

    // Case where only $commented_node is set
    else {
      $results[$commented_node->nid] = $commented_node;
    }
David Strauss’s picture

I keep pasting the wrong thing:

    // Case where only $commented_node is set
    else if ($commented_node) {
      $results[$commented_node->nid] = $commented_node;
    }

This fixes the phantom post on the last tracker page.

mcurry’s picture

subscribing

catch’s picture

@dww/david strauss: oops.

Won't be able to run again until a bit later but will post back!

xamox’s picture

Thanks! the site seems 10x faster!

catch’s picture

OK with patch, same spec as before, with the index update this time:

/tracker
1.49	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000
1.33	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000

/tracker/11

21.22	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.timestamp AS last_updated FROM comments c INNER JOIN node n ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 11 ORDER BY last_updated DESC LIMIT 0, 1000

3.34	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 11 ORDER BY last_updated DESC LIMIT 0, 1000

I love you all.

catch’s picture

arrgh, that one was query cache fooling me. Still about

21.47

which is 20-50 times as fast on 15k nodes, 170kcomments, 3.7k users. Amazing work.

Michelle’s picture

Ok, I just found another post that is showing up properly as updated in all recent posts and not in my tracker. http://drupal.org/node/112421 is the post. And http://drupal.org/node/144228 is still doing it as well as I reported a few comments back.

Michelle

David Strauss’s picture

@Michelle

Node 144228 is the 8th item in your tracker. Node 112421 is on the second page of your tracker.

webernet’s picture

There also seems to be a small bug - the last entry in each tracker is strange.

For example:

<tr class="even"><td></td><td><a href="/node/"></a> </td><td>Anonymous (not verified)</td><td class="replies">0</td><td>37 years 20 weeks ago</td> </tr>
David Strauss’s picture

@webernet This issue is corrected with the change in #90.

@Michelle I should clarify that there is a change in the method the tracker uses to determine how recent something is. The old tracker used the latest comment timestamp even if it wasn't your comment. My tracker patch uses your latest comment timestamp on a node. I think this is an improvement. Why should a node you commented on last year appear on the first page of your tracker just because someone else commented on it today?

pwolanin’s picture

@David - I think ordering should be based on when any comment is made - isn't that the point of the tracker? How else can I see when my comment provoked another comment, or someone else had related information. I think there was a suggestion above for separate tabs with "my recent comments" and "my recent posts", but that's separate.

David Strauss’s picture

@pwolanin You can easily see when your (recent) comments have created discussion. Just go through the first few pages of your tracker and look for "X new" next to the items. Moreover, if the goal is to see how your comments have created discussion, then we should create a better, thread-based model for tracking.

pwolanin’s picture

@David - I disagree. I know already when I made a comment- the purpose (to me at least) of the tracker is to know when others comment on the same threads.
In any case, this patch should not change the function of the tracker- at least not without more serious discussion.

David Strauss’s picture

@pwolanin Always keep in mind that the tracker isn't just for your use. I might want to see nodes you've recently commented on. If you'd like to see discussions you've sparked, I'd prefer to add a tracker tab for that purpose alone.

FiReaNGeL’s picture

I agree with pwolanin here, to me the tracker is useful to check for replies to comments I made in the past. As I understand it, this functionality is gone since your (quite excellent) patch. Tradeoff to be made maybe, but i`d like to keep that functionality :)

dww’s picture

i agree w/ pwolanin, too. the underlying behavior should stay the same: any edit to the node or comment (regardless of who made it) counts as "activity", and the tracker should be sorted by most recent activity, period.

also, sorry i haven't gotten around to re-patching d.o based on the bug fixes posted here. however, i'd like to just wait for a new patch (a real patch, please) that a) includes the bug fixes and b) corrects the tracker behavior to return it to its original behavior...

thanks,
-derek

Michelle’s picture

Just getting back to my computer now. Ok, so the weirdness I've been experiencing is intentional by the sound of it, but I don't agree. I use my tracker all day long to see if any threads I've commented on have been updated. Having to go through every single page of my tracker looking for updated markings every time would suck big time. They really need to float to the top just like they do in the all posts tracker. To answer your question, yes, if someone comments on a thread I participated in a year ago, I want to see it.

Michelle

David Strauss’s picture

@dww I'm not sure we can return tracker behavior to the original behavior with acceptable performance. Currently, my tracker code JOINs comments with nodes to produce its results. Listing the tracker items by the latest comment on the node requires another JOIN from nodes to the node_comment_statistics table. Such a JOIN may push the query back into temp_table/filesort territory. I initially tried to optimize the global feed to use the node_comment_statistics table instead of comments, but changing the JOIN sent the query back into using temp_table/filesort because of the condition requiring nodes to be published. MySQL seems to want to apply WHERE restrictions first, even if they're on the node table and very inclusive. Returning tracker to the original behavior may involve larger schema changes than adding a few indexes.

catch’s picture

another -1 on functionality change. If people have to go back through ten pages to find replies to their posts, then we lose the benefit of the query optimisation from all those extra page requests.

catch’s picture

Also, "last updated" now displays the last time you posted, not when the node was last updated, which isn't what it says. This is obviously a side-effect of the change in behaviour, but in terms of UI it's a bug since that column isn't there to tell you when you commented.

kbahey’s picture

David

How about pushing all the fields (or the two we care about) in node_comment_statistics columns back in the node table? This will avoid the join you talk about, and it is a 1:1 relationship anyway.

Moshe tried a bit with this. See above for the issue on that.

Dries’s picture

I also use the tracker to look for follow-ups on my posts/comments.

webchick’s picture

IMO, splitting these into two tabs would not be the end of the world. It would make it easier for me to find handbook pages I wrote vs. random forum posts where I helped someone (I care a lot more about making sure I'm keeping up with the comments on the former :)). And people who want just a single tab with the old behaviour on their sites can just use Views, which comes with a 'tracker view for this purpose.

But I didn't quite understand the most recent comments... does that mean we'd actually need *four* tabs? "recent posts", "recent comments", "my recent posts", "my recent comments" ? If so that would be... icky. ;)

David Strauss’s picture

@kbahey This would be a good idea. It's the {node_comment_statistics} and {node} JOIN that kills the query. What MySQL does is see the published = 1 restriction on {node} and no restriction on {node_comment_statistics}, so it start

@catch We only lose the benefit if they go through about 300 pages. :-) Fixing last updated is easy: we only need to load those values for the current page.

@webchick Along those lines, we could split it into "My posts" and "My conversations." The latter would list nodes where you've commented in the decending order of latest comment by anyone.

David Strauss’s picture

I'm going to play around with joins to {node_comment_statistics} to see if there's anything that won't require a temp_table.

David Strauss’s picture

After some research and experimentation, I've concluded that placing the fields from {node_comment_statistics} in {node} will not help. The problem is that, for the user tracker, we put conditions on {comments} but sort on a different table. This makes sorting using a key impossible, forcing a filesort.

David Strauss’s picture

Okay, here's what needs to happen to the database schema to maintain tracker performance while returning it to the original functionality. The last_comment_timestamp column needs to appear in {comments}. (Yes, it's a violation of normalization, but it's the only way to make this work.)

This needs to run every time a comment is posted:
UPDATE {comments} SET last_comment_timestamp = %d WHERE nid = %d (with the obvious values for timestamp and nid)

David Strauss’s picture

FileSize
6.4 KB

Updates for system.install:
ALTER TABLE {comments} ADD last_comment_timestamp INT(11) NOT NULL;
ALTER TABLE {comments} DROP INDEX tracker_user, ADD INDEX tracker_user (uid, status, last_comment_timestamp);

This patch may need some revisions, but I'd rather post early so people can have a look.

David Strauss’s picture

One more update for system.install (in addition to those in #117):
ALTER TABLE {comments} DROP INDEX tracker_global, ADD INDEX tracker_global (status, last_comment_timestamp);

catch’s picture

happy to test changes, but only have 5.1 site to test on. Is it cheeky to ask for a 5.1 version of that patch?

David Strauss’s picture

FileSize
8.65 KB

Complete patch for 5.1, including schema updates.

David Strauss’s picture

FileSize
8.71 KB

This is a reordering of the updates in the 5.1 patch to create the index only after the column gets populated.

David Strauss’s picture

Title: try to improve tracker query performance » Tracker performance improvements
Assigned: dww » David Strauss
catch’s picture

FileSize
8.61 KB

@David Strauss, thanks for 5.1 patch

I've applied that. database update went fine, patch applied ok. attached new patch which just fixes typo db_queryd > db_query and removes some folder information to make it apply easier.

Something's going on with the main tracker query though:

/tracker/1

6.84	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, n.changed AS last_updated FROM node n WHERE n.status = 1 AND n.uid = 1 ORDER BY last_updated DESC LIMIT 0, 1000
6.01	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.last_comment_timestamp AS last_updated FROM node n INNER JOIN comments c ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1 ORDER BY last_updated DESC

/tracker

4842.86	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.last_comment_timestamp AS last_updated FROM node n INNER JOIN comments c ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC

on comments I have indexes: tracker_user (uid, stats, last_comment_timestamp) and tracker_global (status, last comment timestamp)

last_comment_timestamp seems to have been updated ok in comments table as well. So not sure why it's so much slower than the tracker/1

catch’s picture

FileSize
8.61 KB

argg, last patch added my own typo in. This one ought to be ok. Going to try to see whether my 2,000-4,000ms /tracker query is a result of the new query or my own incompetence applying patches ;)

catch’s picture

alright, for some reason $sanity_limit isn't being called when the query gets executed:

I hard coded LIMIT 0, 1000 into $comments_sql

17.11	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.last_comment_timestamp AS last_updated FROM node n INNER JOIN comments c ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000

which is fine. can confirm that "my recent posts" now functions as it used to as well.

Can't work out why $sanity_limit isn't running though, it seems to be the same as before.

David Strauss’s picture

FileSize
8.61 KB

You replaced db_queryd (which was admittedly a typo) with db_query. It should have been replaced with db_query_range. Here's an updated patch.

David Strauss’s picture

The LIMIT patch I posted should fix the problem. Because placing a LIMIT on the query speeds it up so much, the 2 to 4 seconds spent on the non-limited query must be due to the sheer quantity of results. Every comment is in that result set when there's no LIMIT.

catch’s picture

We have c. 190k comments, I think that's why the LIMIT made such a difference.

Changed to db_query_range (manually, but patch looks fine), took out hard coding:

17.19	1	tracker_page	SELECT n.nid, n.title, n.type, n.changed, n.uid, c.last_comment_timestamp AS last_updated FROM node n INNER JOIN comments c ON n.nid = c.nid WHERE n.status = 1 AND c.status = 0 AND 1 ORDER BY last_updated DESC LIMIT 0, 1000

RTBC?

David Strauss’s picture

catch’s picture

"ready to be committed" - i.e. it works great, no bugs left that I can see, so just time to commit to core. I'd imagine core comitters will want another review or two before it goes in, so I didn't mark it as such.

David Strauss’s picture

Considering the necessity of having this patch on Drupal.org and the fact that it's already made for 5.1, I'd like to roll this patch into 5.2 and 6.x-dev.

David Strauss’s picture

@catch The patch still needs PostgreSQL schema updates.

catch’s picture

I think the way it's done now is committed to 6.x then backported to 5.x-dev (which'd mean it'd get into the 5.2 release when that comes out assuming a security fix doesn't beat your patch going in).

BTW, I know this was for drupal.org, but it's made a big difference to the load on my own site as well, so thanks again!

kbahey’s picture

The last_comment_timestamp column needs to appear in {comments}. (Yes, it's a violation of normalization, but it's the only way to make this work.)

This needs to run every time a comment is posted:
UPDATE {comments} SET last_comment_timestamp = %d WHERE nid = %d (with the obvious values for timestamp and nid)

I have serious objections to this.

So far the denormalizations we have been doing was of the benign type, such as having a 1:1 relationship broken into two tables to minimize the locks, ...etc.

This one is far less benign, because of the sheer number of rows involved. We have far more data redundancy.

Before I say more, let us see the impact of this via benchmarks though. If it doesn't do much good, then it is dead in the womb. If it does improve things we can weigh in the trade off of redundancy vs. performance.

David Strauss’s picture

@kbahey Object all you want, but maintaining the original tracker interface at a good level of performance requires it. The only way to avoid temporary tables and filesorts is to have the WHERE and SORT BY criteria in the same table. (If they're not in the same table, an index cannot span them.) Because the WHERE criteria are in the comments table, the SORT BY criteria must exist there, too. Take a close look at the tracker query for comments, and you'll see what I mean.

The UPDATE query I've written shouldn't have too bad of an effect. It scales linearly with the number of comments on a node. It only takes logarithmic time to identify the comments associated with a node because of the nid index on comments. The UPDATE also won't affect row sizes because we're only updating an integer. Keeping the rows the same size goes a long way to keeping performance up.

{node_comment_statistics}is a bad example of denormalization. I'm working toward removing it from Drupal entirely. A query has to JOIIN it with {node} to make it useful. The extra JOIN reverses many of the performance benefits it was designed to provide.

Chris Johnson’s picture

I tend to agree with David here. Denormalizing by duplicating data to improve performance is a classic case of when denormalization is justified.

David Strauss’s picture

One other place I could denormalize instead of {comments} is {history}. By putting an interacted (boolean) and node_last_updated (timestamp) into there, it would be easy to query for nodes that the user had interacted with but that had been updated since the last visit. It would be slightly less duplication of data than in {comments}. It would also allow us to have only one query for tracker because node_last_updated would be updated with a node change or a comment.

David Strauss’s picture

One other place I could denormalize instead of {comments} is {history}. By putting an interacted (boolean) and node_last_updated (timestamp) into there, it would be easy to query for nodes that the user had interacted with but that had been updated since the last visit. It would be slightly less duplication of data than in {comments}. It would also allow us to have only one query for tracker because node_last_updated would be updated with a node change or a comment.

David Strauss’s picture

Status: Needs review » Needs work

I'm sending this back to "code needs work" so I can move the denormalization to {history}.

macm’s picture

Title: Tracker performance improvements » Custom Tracker page
Version: 6.x-dev » 5.1
Assigned: David Strauss » macm
Category: task » support
Priority: Critical » Normal
Status: Needs work » Active

I have tabs created by Jquery in tpl pages.

I need print tracker_page, "Recent posts of a user" or "My recent posts" into a tab. How can I do this?

I tried a lot of things but always fail.

So my tab is in /mylanguage/user/1#mytracker

There is a easy way?

Could I get and render the path 'tracker/' .arg(1) --- into a tab(/mylanguage/user/1#mytracker)?

Folks sorry if here isnt appropriate place but I cant find best and this node was in tracker.module.

Fell free to delete with Drupal Webmaster.

I tried adapt this function in my page, but is very expensive to do this with my level of knowledge of Drupal's core.

Anyway thanks cores you are amazing.

Regards

dww’s picture

Title: Custom Tracker page » Tracker performance improvements
Version: 5.1 » 6.x-dev
Assigned: macm » dww
Category: support » task
Priority: Normal » Critical
Status: Active » Needs work

@macm: Please don't hijack issues. This issue is about fixing the performance of the tracker page. Yours is a completely separate support request. So, file your request separately and leave this issue to address the underlying performance problems.

dww’s picture

Assigned: dww » Unassigned

This is David's issue now, not mine, but I can't assign it directly to him directly using the UI (yet), so I'm just unassigning myself...

David Strauss’s picture

Assigned: Unassigned » David Strauss
Status: Needs work » Closed (fixed)

This patch is superseded by http://drupal.org/node/148849.

kbahey’s picture

Status: Closed (fixed) » Needs review
FileSize
3.37 KB

So, I consulted the folk from MySQLPerformanceBlog and they generously investigated the issue for us free of charge, because we are an Open Source project (thanks Peter, Vadim and Maciej

They gave a detailed analysis of the queries, and their end advice was to rewrite them using unions.

Attached is a patch to tracker.module for 5.2, for testing on drupal.org.

Here are the results of testing on my development server, with restarting MySQL in between the tests to nullify the effect of the query cache. The tests were done using 10,000 nodes and 10,000 comments.

However, I have been getting a lot of variance on this machine lately, so the results need to be replicated somewhere, preferrably directly on d.o (what do we have to lose?)

The schema is the standard Drupal 5.2 schema, so if you have any changes to that, roll them back first.

Before the patch

This is the performance of this query without MySQL cache query, and using the existing 5.2 code.

Executed 75 queries in 1640.97 milliseconds. Page execution time was 1734.76 ms.
ms	#	where	query

874.04	1	pager_query	SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid = l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 1 OR c.uid = 1) ORDER BY last_updated DESC LIMIT 0, 25

296.64	1	pager_query	SELECT COUNT(DISTINCT(n.nid)) FROM node n LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 1 OR c.uid = 1)

With patch

Two variants are tested for this patch.

Variant 1: Using the original count query

This variant rewrites the query using a UNION with an ORDER BY, but leaves the $sql_count as it is. This proved faster on the test data set described above, which is different from d.o (no users, and less number of rows). This proved faster on my test server, but that may not be the case for d.o due to volume and cardinality.

Executed 75 queries in 586.61 milliseconds. Page execution time was 666.62 ms.
ms	#	where	query

363.42	1	pager_query	(SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM users u JOIN node n ON n.uid = u.uid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND n.uid = 1) UNION (SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1) ORDER BY last_updated DESC LIMIT 0, 25

146.03	1	pager_query	SELECT COUNT(DISTINCT(n.nid)) FROM node n LEFT JOIN comments c ON n.nid = c.nid AND (c.status = 1 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 0 OR c.uid = 1)

Variant 2: Using a union in the count query

This uses a UNION for both queries. That was slower on my test server, but may be faster on d.o.

Executed 75 queries in 851.03 milliseconds. Page execution time was 997.75 ms.
ms	#	where	query

446.17	1	pager_query	(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM users u JOIN node n ON n.uid = u.uid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND n.uid = 1) UNION (SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1) ORDER BY last_updated DESC LIMIT 0, 25

338.77	1	pager_query	SELECT SUM(cnt) FROM ((SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM users u JOIN node n ON n.uid = u.uid WHERE n.status = 1 AND u.uid = 1) UNION (SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid WHERE c.status = 0 AND n.status = 1 AND u.uid = 1)) AS sum_cnt
David Strauss’s picture

While those changes win points for less invasiveness, they will never approach the performance of the rewrites I've suggested (and largely finished in my NCS/node merge patch). It seems better than what the base Drupal code has now, but I guarantee it will be far slower than the current patched tracker on Drupal.org.

kbahey’s picture

David

I recommended to Dries that we disable tracker altogether if it is the only source of contention in the database. At least issue submission, browsing and the rest of the site will be fast, rather than the continued pain of what d.o is today.

Barring that, this patch is not a replacement for whatever schema changes you are doing which are needed to improve d.o overall. But I see those are long term.

This is (at least) a stop gap until we get d.o on D6 or D7 (whichever one has the schema changes). If it proves to be worth keeping, then so be it.

Worth testing at least to see if they are more functional that what is present at the moment on d.o (does not give people what they are looking for).

Here is the analysis done on a copy if drupal.org by Maciej.

** 1. **
SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
GREATEST(n.changed, l.last_comment_timestamp) AS last_updated,
l.comment_count FROM node n INNER JOIN node_comment_statistics l ON n.nid =
l.nid INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid =
c.nid AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid =
1 OR c.uid = 1) ORDER BY last_updated DESC
********
EXPLAIN
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: ALL
possible_keys: PRIMARY,node_status_type,nid,uid
key: NULL
key_len: NULL
ref: NULL
rows: 140437
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: l
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: drupal_export.n.nid
rows: 1
Extra:
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: u
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: drupal_export.n.uid
rows: 1
Extra: Using where
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: lid
key: lid
key_len: 4
ref: drupal_export.n.nid
rows: 4
Extra: Using where; Distinct
********
As you see the main problem of this query is that it does full table scan on
`node`. This is because you tied together conditions from two different
tables in <>, also the index that
contains `status` cannot be used because almost all rows has this column
value equal 1. This is why the engine cannot filter out any rows from `node`
before doing join. What you can do is to split the two OR-ed conditions into
UNIONs of two SELECTs.

(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
GREATEST(n.changed, l.last_comment_timestamp) AS last_updated,
l.comment_count FROM users u JOIN node n ON n.uid = u.uid JOIN
node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND n.uid = 1)

UNION
(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name,
GREATEST(n.changed, l.last_comment_timestamp) AS last_updated,
l.comment_count FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON
n.nid = c.nid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status
= 1 AND c.status = 0 AND c.uid = 1)
ORDER BY last_updated DESC;

This query reads all the data efficiently through indexes and then sorts the
final result set with filesort. If the returned data can fit into the
in-memory temporary table, which size is limited by tmp_table_size and
max_heap_table_size MySQL parameters, then this query should run decently
fast. On my low-end test server the original and rewritten queries took a
few minutes and 0.45s respectively (rows examined: 644642 vs 15229).
Although 15k rows is still a bit large number for a busy on-line
application, but I guess this cannot be avoided without some changes in the
schema.

** 2. **
SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name FROM node
n INNER JOIN users u ON n.uid = u.uid LEFT JOIN comments c ON n.nid = c.nid
AND (c.status = 0 OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = 9575
OR c.uid = 9575) ORDER BY n.changed DESC
********
EXPLAIN
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: n
type: ALL
possible_keys: node_status_type,uid
key: NULL
key_len: NULL
ref: NULL
rows: 140437
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: u
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: drupal_export.n.uid
rows: 1
Extra: Using where
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: lid
key: lid
key_len: 4
ref: drupal_export.n.nid
rows: 4
Extra: Using where; Distinct
********
Again the same problem. So I rewrite the query to:

(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name FROM
users u JOIN node n ON n.uid = u.uid WHERE n.status = 1 AND u.uid = 1) UNION
(SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name FROM
users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid WHERE
c.status = 0 AND n.status = 1 AND u.uid = 1) ORDER BY changed DESC;

Statistics are:
Time: 0.58s vs 1 minute
Examined rows: 11522 vs 503170

Gerhard Killesreiter’s picture

@David: Your rewrite will be great for Drupl 6, but we are still in D5 land and I need a solution "now". Your patch that is running on d.o isn't really helping since the results it provides aren't that great. As soon as we've resolved our performance issues I'll want to revert the patch and have somethign that provides the original result set.

catch’s picture

@Killes: is drupal.org running the patch from #126? iirc it's running an earlier one, and #126 provides a better result set.

Both the UNION and the long term schema changes are obviously better long term, but it'd help until they're sorted.

dww’s picture

d.o is still running the patch from #69. Seems like we should use #126, but that includes the schema change for last_comment_timestamp in {comments} (not just some new indexes). So, I'd need to coordinate with killes and make sure he's happy with this, do a DB dump, etc, etc. David: anything else we should be aware of when moving from #69 to #126? I don't have time right now to carefully study every detail of both patches. Thanks.

cog.rusty’s picture

I am getting my user name as the author of every node I have commented on.
When hovering, the uid under the name is the correct node author.
Could this patch be responsible? I don't understand much but here's a theory:

.... UNION (SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM users u JOIN comments c ON c.uid = u.uid JOIN node n ON n.nid = c.nid JOIN node_comment_statistics l ON l.nid = n.nid WHERE n.status = 1 AND c.status = 0 AND c.uid = 1)....

n.uid -- found from the node table
u.name -- found not via the node table but via the comments table with "FROM users u JOIN comments c ON c.uid = u.uid"

Previously it was found using "n.nid = c.nid", while "c.uid" was only in the WHERE clause, not in the JOINs.

cog.rusty’s picture

The above was about the /user/[uid]/track page.
The /tracker/user page displays "anonymous" as the author and has more problems.

kbahey’s picture

FileSize
3.88 KB

This version contains a fix for the wrong user name displayed.

Gerhard, please install on d.o.

dww’s picture

Reviewed #152, backed out #144 from d.o and applied #152.
Confirmed that the user names are now correct in the tracker page.

Ron Williams’s picture

On user tracker page, when clicking the last button, it appears to take users to page "59" unless they have greater than 59 pages of postings.

Dries’s picture

It would be great if we could test the performance improvements of this patch in a better way. Applying it on drupal.org is one thing, but it doesn't tell us whether it is actually faster. We've made a dozen other configuration changes to Drupal.org the past couple of days. A more structured approach to testing this patch -- once the patch is in its final state -- would help me convince to commit it once it's ready.

This should be fairly easy to do: just run the queries 20 times from the mysql command line and average the result.

dww’s picture

@Dries: I agree. I just want to point out that we're only applying these patches on d.o to stop the bleeding. And, there have been some mysql numbers posted above in the issue (see #75, #77, #80, #144, #146). Killes decided to move d.o to #144 -- I only applied #152 because #144 was buggy and needed a fix.

Given the explain results on #75-#80 vs. #144 and #146, I don't see why Killes switched d.o. Yes, the tracker results weren't ideal for the earlier patch, but the performance was orders of magnitude better. #152 seems like a minor improvement over what's in core already, which requires no schema changes. #126 would be a significant improvement, but requires a schema change. Killes doesn't want to make that change on d.o unless we're sure that schema change will be going into core for D6...

cog.rusty’s picture

I am getting 40 pages of nodes and 21 pages of "No posts available" in my tracker. $sql_count does not seem to agree with the rows returned.

kbahey’s picture

Here is another version that tries to correct the pager issue.

Note that the code has not changed for /tracker. The changes are effective only when you are using /tracker/UID.

Index: tracker.module
===================================================================
RCS file: /cvs/drupal/drupal/modules/tracker/tracker.module,v
retrieving revision 1.143.2.2
diff -u -F^f -r1.143.2.2 tracker.module
--- tracker.module  26 Jul 2007 19:16:50 -0000  1.143.2.2
+++ tracker.module  4 Aug 2007 09:20:56 -0000
@@ -78,11 +78,22 @@ function tracker_page($uid = 0) {

   // TODO: These queries are very expensive, see http://drupal.org/node/105639
   if ($uid) {
-    $sql = 'SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_upd
ated, l.comment_count FROM {node} n INNER JOIN {node_comment_statistics} l ON n.nid = l.nid INNER JOIN {users} u ON n.uid = u.uid LEFT JO
IN {comments} c ON n.nid = c.nid AND (c.status = %d OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = %d OR c.uid = %d) ORDER BY last_
updated DESC';
+    // This is the original slow query that kills drupal.org
+    //$sql = 'SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_u
pdated, l.comment_count FROM {node} n INNER JOIN {node_comment_statistics} l ON n.nid = l.nid INNER JOIN {users} u ON n.uid = u.uid LEFT
JOIN {comments} c ON n.nid = c.nid AND (c.status = %d OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = %d OR c.uid = %d) ORDER BY las
t_updated DESC';
+
+    // This is a newer variant with UNION
+    $sql = '(SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.c
omment_count FROM {users} u JOIN {node} n ON n.uid = u.uid JOIN {node_comment_statistics} l ON l.nid = n.nid WHERE n.status = 1 AND n.uid
 = %d) UNION (SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.c
omment_count FROM {users} u JOIN {comments} c ON c.uid = u.uid JOIN {node} n ON n.nid = c.nid JOIN {node_comment_statistics} l ON l.nid =
 n.nid WHERE n.status = 1 AND c.status = %d AND c.uid = %d) ORDER BY last_updated DESC';
     $sql = db_rewrite_sql($sql);
-    $sql_count = 'SELECT COUNT(DISTINCT(n.nid)) FROM {node} n LEFT JOIN {comments} c ON n.nid = c.nid AND (c.status = %d OR c.status IS
NULL) WHERE n.status = 1 AND (n.uid = %d OR c.uid = %d)';
+
+    // Variant 1: This is the original count query
+    //$sql_count = 'SELECT COUNT(DISTINCT(n.nid)) FROM {node} n LEFT JOIN {comments} c ON n.nid = c.nid AND (c.status = %d OR c.status I
S NULL) WHERE n.status = 1 AND (n.uid = %d OR c.uid = %d)';
+
+    // Variant 2: This is a UNION variant of the above query. Slower on 10,000 nodes,
+    // but needs to be tested on drupal.org to see how it behaves with the volume and cardinality there
+    $sql_count = 'SELECT SUM(cnt) FROM ((SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM {users} u JOIN {node} n ON n.uid = u.uid WHERE n.stat
us = 1 AND u.uid = %d) UNION (SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM {users} u JOIN {comments} c ON c.uid = u.uid JOIN {node} n ON n.n
id = c.nid WHERE c.status = %d AND n.status = 1 AND u.uid = %d)) AS sum_cnt';
+
     $sql_count = db_rewrite_sql($sql_count);
-    $result = pager_query($sql, 25, 0, $sql_count, COMMENT_PUBLISHED, $uid, $uid);
kbahey’s picture

FileSize
4.07 KB

Attached.

catch’s picture

So the API changes aren't going to get int 6.x now, but what about the UNION query?

David Strauss’s picture

@catch I support using the UNION query in lieu of the more fundamental changes that are going to have to wait until Drupal 7.

catch’s picture

Status: Needs review » Needs work

No longer applies - split patch must have gone in since this was rolled.

catch’s picture

FileSize
4.08 KB

Re-roll of kbahey's patch from #159 against HEAD. Didn't change anything I hope except some minor punctuation in code comments.

catch’s picture

Status: Needs work » Needs review

Very quick review.

Patch applies cleanly, tracker works as expected in both /tracker and user/1/track situations - so no real change in functionality.

Still needs some proper benchmarking for the actual performance implications which I'm not really set up to do. Marking as needs review.

dmitrig01’s picture

I benchmarked it with lotsa nodes (forgot how many)... here are the results

Without

ab -n1000 -c10 http://localhost:8888/drupal-head/tracker
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Finished 1000 requests


Server Software:        Apache/2.0.59
Server Hostname:        localhost
Server Port:            8888

Document Path:          /drupal-head/tracker
Document Length:        10168 bytes

Concurrency Level:      10
Time taken for tests:   144.553750 seconds
Complete requests:      1000
Failed requests:        166
   (Connect: 0, Length: 166, Exceptions: 0)
Write errors:           0
Total transferred:      10685077 bytes
HTML transferred:       10162077 bytes
Requests per second:    6.92 [#/sec] (mean)
Time per request:       1445.538 [ms] (mean)
Time per request:       144.554 [ms] (mean, across all concurrent requests)
Transfer rate:          72.18 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   673 1440 225.6   1398    2350
Waiting:      663 1432 225.8   1389    2343
Total:        673 1440 225.6   1398    2350

Percentage of the requests served within a certain time (ms)
  50%   1398
  66%   1495
  75%   1571
  80%   1633
  90%   1792
  95%   1875
  98%   1938
  99%   1986
 100%   2350 (longest request)

With

ab -n1000 -c10 http://localhost:8888/drupal-head/tracker
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Finished 1000 requests


Server Software:        Apache/2.0.59
Server Hostname:        localhost
Server Port:            8888

Document Path:          /drupal-head/tracker
Document Length:        10168 bytes

Concurrency Level:      10
Time taken for tests:   131.945790 seconds
Complete requests:      1000
Failed requests:        194
   (Connect: 0, Length: 194, Exceptions: 0)
Write errors:           0
Total transferred:      10684567 bytes
HTML transferred:       10161567 bytes
Requests per second:    7.58 [#/sec] (mean)
Time per request:       1319.458 [ms] (mean)
Time per request:       131.946 [ms] (mean, across all concurrent requests)
Transfer rate:          79.08 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:   270 1313 196.5   1281    2163
Waiting:      262 1305 196.4   1274    2155
Total:        270 1313 196.5   1281    2163

Percentage of the requests served within a certain time (ms)
  50%   1281
  66%   1323
  75%   1358
  80%   1397
  90%   1534
  95%   1735
  98%   1944
  99%   2039
 100%   2163 (longest request)

Not much faster but there is at least some gained speed

joshk’s picture

This applies cleanly and causes no errors, but my benchmarking tests results aren't quite as expected.

I used devel_generate to add about 2000 nodes, and ran an ab series also with a higher concurrency to see if that brought out a bigger delta on the times. Unfortunately, I got the opposite of the expected result:

Before

 ab -n1000 -c50 http://localhost/~joshk/drupal-HEAD/?q=tracker
This is ApacheBench, Version 1.3d <$Revision: 1.73 $> apache-1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Finished 1000 requests
Server Software:        Apache/1.3.33                                      
Server Hostname:        localhost
Server Port:            80

Document Path:          /~joshk/drupal-HEAD/?q=tracker
Document Length:        15130 bytes

Concurrency Level:      50
Time taken for tests:   425.018 seconds
Complete requests:      1000
Failed requests:        645
   (Connect: 0, Length: 645, Exceptions: 0)
Broken pipe errors:     0
Total transferred:      15627773 bytes
HTML transferred:       15138773 bytes
Requests per second:    2.35 [#/sec] (mean)
Time per request:       21250.90 [ms] (mean)
<b>Time per request:       425.02 [ms] (mean, across all concurrent requests)</b>
Transfer rate:          36.77 [Kbytes/sec] received

Connnection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0     0    0.1      0     2
Processing:  1980 20604 1812.9  20505 26773
Waiting:     1978 20604 1812.9  20505 26772
Total:       1980 20604 1812.8  20505 26773

Percentage of the requests served within a certain time (ms)
  50%  20505
  66%  20809
  75%  21041
  80%  21219
  90%  21805
  95%  22437
  98%  25153
  99%  26428
 100%  26773 (last request)

After

ab -n1000 -c50 http://localhost/~joshk/drupal-HEAD/?q=tracker
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Finished 1000 requests
Server Software:        Apache/1.3.33                                      
Server Hostname:        localhost
Server Port:            80

Document Path:          /~joshk/drupal-HEAD/?q=tracker
Document Length:        15155 bytes

Concurrency Level:      50
Time taken for tests:   520.055 seconds
Complete requests:      1000
Failed requests:        256
   (Connect: 0, Length: 256, Exceptions: 0)
Broken pipe errors:     0
Total transferred:      15636909 bytes
HTML transferred:       15147909 bytes
Requests per second:    1.92 [#/sec] (mean)
Time per request:       26002.75 [ms] (mean)
<b>Time per request:       520.05 [ms] (mean, across all concurrent requests)</b>
Transfer rate:          30.07 [Kbytes/sec] received

Connnection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0     0    0.3      0     9
Processing:  6458 25441 2599.1  24822 33852
Waiting:     6458 25441 2599.1  24822 33852
Total:       6458 25441 2599.0  24822 33852

Percentage of the requests served within a certain time (ms)
  50%  24822
  66%  26500
  75%  27565
  80%  27918
  90%  28907
  95%  29545
  98%  30391
  99%  31080
 100%  33852 (last request)

The same pattern repeated itself, although wasn't as pronounced, at a -c10 level. Could be my system, but...

catch’s picture

FileSize
4.09 KB

OK try this baby out. Modified to use UNION ALL instead of plain UNION. UNION ALL is supported by pgsql so shouldn't be an issue why

MySQL performance blog explains why - could be six times faster.

No change in functionality at my end, but again needs benchmarking which I'm not set up for.

Gerhard Killesreiter’s picture

Status: Needs review » Needs work

This patch has an error in the count query. In the second part you need to exclude the nodes that the user authored, otherwise you'll count them twice.

catch’s picture

Status: Needs work » Needs review
FileSize
4.11 KB

Well I'm just re-rolling and this is hardly my strong point, but here's a stab at it:

Changed:

SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM {users} u JOIN {comments} c ON c.uid = u.uid JOIN {node} n ON n.nid = c.nid WHERE c.status = %d AND n.status = 1 AND u.uid = %d

to this?

SELECT COUNT(DISTINCT(n.nid)) AS cnt FROM {users} u JOIN {comments} c ON c.uid = u.uid JOIN {node} n ON n.nid = c.nid WHERE c.status = %d AND n.status = 1 AND u.uid = %d AND n.uid <> c.uid

Patch just in case I'm not barking up the wrong tree.

cosmicdreams’s picture

From my reading of that altered query, it appears you are excluding the nodes that the author commented on (as well as the comments from that node that were authored by that user)

if you wanted to exclude the records from nodes that the user has authored you should have this in you're where clause:

AND n.uid <> u.uid
catch’s picture

FileSize
4.11 KB

OK re-roll with that change:

Zothos’s picture

I would prefere using the != operator. I know they are the same, but != has a greater readability :P

catch’s picture

Zothos, I dimly remember a recent development list where it was decided to standardise on <>, will leave like that for now. This could really do with benchmarks on the UNION ALL change.

catch’s picture

Status: Needs review » Needs work

no longer applies :(

Gábor Hojtsy’s picture

Category: task » bug

There are tabs issues too in the patch.

Also moving to the bugs queue, where D6 blockers are waiting in line.

catch’s picture

OK I tried to re-roll this again, but because the if $uid has been taken out of tracker.pages.inc somewhere along the line you get undefined variable on the tracker/1 page.

So this needs some love to set the $uid somewhere useful, and it still needs review and benchmarking after that. I don't think I'll have time to do that today so it'd be great if someone could take it on.

catch’s picture

Status: Needs work » Needs review
FileSize
4.36 KB

here's a re-roll, it works.

Still needs review/benchmarks.

catch’s picture

Version: 6.x-dev » 7.x-dev
Status: Needs review » Needs work
FileSize
472 bytes

OK well my re-roll was no good, but I fixed that in my testing install and did some very primitive profiling with devel module. Overall the main /tracker was slower by about 30-40%, and although /tracker/4 was quicker, it gave slightly different results to the current core query (about 10 nodes more in a 12 page pager).

So I'm going to bump this to D7 since there doesn't appear to be a satisfactory solution without the major schema changes (node activity counter etc.) proposed earlier in this issue. I've attached the devel results for reference.

David Strauss’s picture

Implemented in Contrib for Drupal 5: http://drupal.org/project/tracker2

Drupal 6 and 7 should only require simplification from the Drupal 5 version.

chipk’s picture

FWIW - our use of tracker only needs visibility to nodes, not comments. As an interim fix, we modified 'tracker_page()' (i.e. the SQL in both $sql and $sql_count as well as the call to 'page_query()') to eliminate the join and logic related to table 'node_comment_statistics'. On a deployment of 160k+ nodes, one typical benchmark query shrank from 2680ms to 15ms. Code changes as follows:

  // TODO: These queries are very expensive, see http://drupal.org/node/105639
  if ($uid) {
    //$sql = 'SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, GREATEST(n.changed, l.last_comment_timestamp) AS last_updated, l.comment_count FROM {node} n INNER JOIN {node_comment_statistics} l ON n.nid = l.nid INNER JOIN {users} u ON n.uid = u.uid LEFT JOIN {comments} c ON n.nid = c.nid AND (c.status = %d OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = %d OR c.uid = %d) ORDER BY last_updated DESC';
    $sql = 'SELECT DISTINCT(n.nid), n.title, n.type, n.changed, n.uid, u.name, n.changed AS last_updated FROM {node} n INNER JOIN {users} u ON n.uid = u.uid WHERE n.status = 1 AND n.uid = %d ORDER BY last_updated DESC';
    $sql = db_rewrite_sql($sql);
    //$sql_count = 'SELECT COUNT(DISTINCT(n.nid)) FROM {node} n LEFT JOIN {comments} c ON n.nid = c.nid AND (c.status = %d OR c.status IS NULL) WHERE n.status = 1 AND (n.uid = %d OR c.uid = %d)';
    $sql_count = 'SELECT COUNT(DISTINCT(n.nid)) FROM {node} n WHERE n.status = 1 AND n.uid = %d';
    $sql_count = db_rewrite_sql($sql_count);
    //$result = pager_query($sql, 25, 0, $sql_count, COMMENT_PUBLISHED, $uid, $uid);
    $result = pager_query($sql, 25, 0, $sql_count, $uid);
  }
David Strauss’s picture

@chipk: Your implementation may work for your situation, but it undermines a large portion of what Tracker is for, which is tracking comments.

apotek’s picture

Title: Tracker performance improvements » Using UNION

Using UNION, we get 2 temporary tables created and have to run filesort twice. Not sure how this would improve things. It takes a heck of a lot longer.

Michelle’s picture

Title: Using UNION » Tracker performance improvements

Putting the title back. "Using UNION" is not a good title for an issue.

Michelle

catch’s picture

Latest approach to this is http://drupal.org/project/tracker2 - this needs a port to Drupal 6, then a port to Drupal 7, then conversion into a patch.

Gerhard Killesreiter’s picture

Considering everything, the tracker query has been proven to not really be salvagable. This issue should thus be marked "won't fix".

Porting tracker2 to core might be an option. I have discussed this with Angie and she did say that this is not a priority for her. I think it is fair to say that you shouldn't use the core tracker for a large site. We might put a warning into the docs and maybe include a link to tracker2.

catch’s picture

We should either have tracker2 in core, or remove tracker from core and rename tracker2 to tracker in contrib. Having core modules limping along with site-killing queries release after release doesn't help anyone.

killes@www.drop.org’s picture

Is the query really site killing for your average Drupal site? I doubt it.

dww’s picture

I don't understand why we don't just put tracker2 in core. Even if it's not "a priority" for webchick doesn't mean it can't happen. A brief skim of the code looks like there's nothing that fundamentally evil about tracker2, and if anything, it'd be better if it were in core, instead of modifying core's DB schema from the outside. ;) Maybe I didn't fully grasp something about how tracker2 works, but it seems like a case where a little DB de-normalization goes a long way. Why not put it in core?

Damien Tournoud’s picture

@dww: I think everyone agree that the best solution is to bring tracker2 in core. The only remaining question is: who will take the lead on that?

jrabeemer’s picture

+1 for tracker2 in core, i was completely unaware of the slow performance of tracker module in core..

sub

kbahey’s picture

I can work on porting it to Drupal 7, but only if Angie or Dries say that it would go in 7.x, and not just wasted effort.

jrabeemer’s picture

No work is wasted effort. Worst case, it gets delayed to D8. If you can demonstrate good performance gains, it will most likely go in core.

battochir’s picture

subscribing

swentel’s picture

FileSize
19.55 KB

Ok, attached is a patch which ports tracker2 to D7. I added it with the .txt extension because I don't want the testbot to run.
Weird thing is, I did some benchmarks with 50000 nodes and noticed no real performance difference between tracker 1 & 2 at this point. So basically, try this out, see if it makes any difference and we can discuss further if this is really so critical to fix.

battochir’s picture

Quick question. Is there a working port of tracker2 to D6 anywhere? It's awesome that tracker2 is getting so much love for D7. Somehow, I thought D6 was important. From what I see, many people have not upgraded to D6 from D5 because many modules have not been updated yet or they are waiting for D7 because the update to D6 is so painful. D6 is a big leap from D5. Anyway, D6 tracker still has those expensive queries that hurt big-traffic sites.

edit: Sorry. I just saw that this thread is for D7. I did'nt mean to hijack the thread.

Michelle’s picture

battochir - You want this issue: #294091: Port tracker2 to 6.x

Michelle

battochir’s picture

Thank you Michelle. That's what I wanted to know much appreciated.

cheers,

wim

catch’s picture

Status: Needs work » Needs review
FileSize
19.28 KB

Seems like tracker tests actually pass with this patch. Moving to needs review for a general look at it (re-rolled for some whitespace fixes, needs some comment/variable name tidying and dbtng conversion still).

dww’s picture

Status: Needs review » Needs work

Very quick skim revealed this:

tracker_track_user() is wrong. hook_menu() is already auto-loading the user object, but then tracker_track_user() tries to reload the user object again using arg(). tracker_track_user() isn't properly inspecting either of the arguments passed into it from the menu system, in fact.

dww’s picture

Oh, and those should be return drupal_access_denied();, etc...

swentel’s picture

FileSize
19.09 KB

Still need work, but gets rid of tracker_track_user function, we don't need that at all. One thing that doesn't work yet is the 'my recent posts' tab. I simply can't get it to work, unless I change the user_uid_optional_load function. Not sure if I'm doing anything wrong, or if this is a bug in the load function, I'll try to talk to peter or chx tomorrow. Note, this also doesn't work in the current version of tracker - and also D6 suffers from it ..

// If I change isset to !empty, than the my recent tab works ..
function user_uid_optional_load($arg) {
  return user_load(isset($arg) ? $arg : $GLOBALS['user']->uid);
}

Edit: that bug is known, so waiting for #298722 to get in

catch’s picture

David Strauss’s picture

I'll take a look at the "my recent posts" problem.

apotek’s picture

@Michelle (#183)
Yes, sorry about that. I'm still not used to the way these issue trackers work. I keep thinking the "Issue title" is the subject of my post, rather than a way to modify the title of the *entire thread*. It's an odd implementation when any logged in user can do that. My mistake, and certainly not intentional.

catch’s picture

Status: Needs work » Needs review
FileSize
19.18 KB

Here's a re-roll. The bug swentel found is already well documented in HEAD, so there's no reason to slow this up.

It's possible some of this might be centralised in materialized views if/when it gets committed - but might as well keep this one live - we can't ship yet another release with this query.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
13.35 KB

hmm

David Strauss’s picture

This seems to look OK, but I can't RTBC what is mostly my own code.

catch’s picture

FileSize
18.32 KB

Patch was missing tracker.pages.inc changes. Also this all needs converting to dbtng. Here's benchmarks:

ab -c1 -n500

HEAD:
/tracker: 6.15 [#/sec]
/uid/track: 1.16 [#/sec]

Patch:
/tracker: 12.75 [#/sec]
/uid/track: 12.86 [#/sec]

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
19.52 KB

OK this one updates 95% of the patch to dbtng. Gabor has indicated he might want to have a backport of this in Drupal 6, so let's get this one in.

A few todos and things needing discussion (and leaving at CNR for the test bot if nothing else).

I don't think we need a UI for the indexing variable - or if we do, it should be in admin/settings/performance instead of a tiny form by itself at admin/settings/tracker

There's a few places where it looks like we could replace some update, select, insert logic with db_merge.

A couple of places need to use db_select() because they're wrapped in db_rewrite_sql(), didn't get to those yet.

And also there's one INSERT INTO .. SELECT FROM which afaik isn't current handled by db_insert() at all - so while that shouldn't hold this patch up, it'd be good to open another issue so we can do that.

David Strauss’s picture

I don't see a reason to "backport" this to Drupal 6. It should probably remain the separate Tracker 2 module.

catch’s picture

I didn't think it was on the table - but #362151: Test and compare tracker2, views, Drupal 6 tracker suggests otherwise. If we can do it with no UI or API changes, then there's no particular reason not to simply cp contributions/modules/tracker2 modules/tracker - then we no longer have a pointless slow tracker module in Drupal 6 core.

Would need to co-ordinate the install/upgrade path of course for people upgrading from D6 - D7.

Status: Needs review » Needs work

The last submitted patch failed testing.

lilou’s picture

Issue tags: +Performance

Add tag.

Berdir’s picture

Status: Needs work » Needs review
FileSize
21.82 KB

Ok, I worked a bit on that one..

  1. Converted SELECT Queries to DBTNG.. I am not sure about a few things, for example, why do we need to query node for the count query? Wouldn't that actually report wrong numbers if the tracker_node table does not yet contain all nodes?
  2. DBTNG has currently no direct support for "INSERT INTO () SELECT ... ". Imho, there are 3 options
    - Execute it with db_query and simply replace %d and so on with named parameters
    - Do a select and then use multi-insert to insert the values... (That is what I did in the patch)
    - Implement support for something like db_insert('tracker_node')->from($query)
  3. Used db_merge() where possible
  4. Renamed hook_nodeapi_$op() to hook_node_$op()
  5. Moved hook_comment() to hook_comment_$op()
  6. Added a update function (untested)

There are still some things to improve..
- tracker_batch_node_alter: atleast the $op = 'delete' handling is imho unecessary, that will try to delete the node twice, or not?
- _tracker_add and _tracker_remove queries the node table even when not necessary.. (no need to fetch the information we already have in hook_node_*)
- More tests, for cron stuff, comment handling and so on...

killes@www.drop.org’s picture

"# DBTNG has currently no direct support for "INSERT INTO () SELECT ... ". Imho, there are 3 options
- Execute it with db_query and simply replace %d and so on with named parameters
- Do a select and then use multi-insert to insert the values... (That is what I did in the patch)
- Implement support for something like db_insert('tracker_node')->from($query) "

IMO DBTNG should support INSERT INTO () SELECT, otherwise I'd consider it incomplete.

Failing this, I'd use "Execute it with db_query and simply replace %d and so on with named parameters".

The nice thing about "INSEERT INSO SELECT" is that everythign is done inside mysql and php does not need to bother about this.

catch’s picture

Looks like we'll get INSERT INTO () SELECT via #481288: Add support for INSERT INTO ... SELECT FROM ...

killes@www.drop.org’s picture

Status: Needs review » Needs work

Yes, great, we should wait until that has landed and then rework this patch to use it.

Berdir’s picture

Status: Needs work » Needs review
FileSize
24.97 KB

Updated the patch, note that it will fail without the patch in #481288: Add support for INSERT INTO ... SELECT FROM .... As soon as that one is in, this should pass the tests.

What I did
- converted to insert query to INSERT INTO () SELECT
- fixed the admin/content/node form alter stuff for publish/unpublish, I'm pretty sure that did never work ;)
- Updated the update function to set the tracker_index_nid variable
- Added tests for cron indexing and admin/content/node unpublishing.

Status: Needs review » Needs work

The last submitted patch failed testing.

Berdir’s picture

Status: Needs work » Needs review

Let's see if this needs a re-roll, the tests should now pass.

Status: Needs review » Needs work

The last submitted patch failed testing.

Berdir’s picture

Status: Needs work » Needs review
FileSize
25.18 KB

Ok well, re-roll...

Status: Needs review » Needs work

The last submitted patch failed testing.

Berdir’s picture

Status: Needs work » Needs review
FileSize
25.18 KB

Wrong path in the tests...

Status: Needs review » Needs work

The last submitted patch failed testing.

Berdir’s picture

Status: Needs work » Needs review
FileSize
25.17 KB

Another path changes... re-rolled.

Berdir’s picture

FileSize
24.95 KB

I can't see a reason why the special count query on {node} is necessary and it does give false results during indexing... simply removing it seems to work just fine...

catch’s picture

1000 nodes, anonymous user, no page caching:

HEAD:

Concurrency Level:      1
Time taken for tests:   58.802 seconds
Complete requests:      500
Failed requests:        0
Write errors:           0
Total transferred:      6383000 bytes
HTML transferred:       6150500 bytes
Requests per second:    8.50 [#/sec] (mean)
Time per request:       117.605 [ms] (mean)
Time per request:       117.605 [ms] (mean, across all concurrent requests)
Transfer rate:          106.01 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   108  118  10.8    114     181
Waiting:      104  114  10.6    110     175
Total:        108  118  10.8    114     181

Patch:

Concurrency Level:      1
Time taken for tests:   49.343 seconds
Complete requests:      500
Failed requests:        0
Write errors:           0
Total transferred:      6383000 bytes
HTML transferred:       6150500 bytes
Requests per second:    10.13 [#/sec] (mean)
Time per request:       98.687 [ms] (mean)
Time per request:       98.687 [ms] (mean, across all concurrent requests)
Transfer rate:          126.33 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:    93   99   6.4     97     149
Waiting:       89   95   6.2     93     143
Total:         94   99   6.4     97     149

This doesn't really need benchmarking though since the same code is responsible for Drupal.org being up at the moment instead of down.

Quick nitpick review on the patch:

+    drupal_set_message(t('Tracker will index from node %nid downward.', array('%nid' => $max_nid)));

Don't think we need this. However maybe indexing one batch of nodes on enable would help to avoid wtfs?

+ * Create new tables.

Should say "Create the new tracker_node and tracker_user tables.

+        'description' => 'The id of the node.',

ID should be uppercase.

+      '#markup' => t('Max node ID for indexing on the next cron run: @max', array('@max' => $max_nid)),
+    );
+  }
+  else {
+    $form['max_nid'] = array(
+      '#markup' => t('Existing nodes have finished tracker indexing.'),

"All content has finished tracker indexing". Not sure what to do about Max node ID here, this is why I hate content/node/post inconsistency.

+function tracker_batch_node_alter($form, &$form_state) {

No phpdoc.

+ * Implement hook_nodeapi_insert().

need updating to hook_node_insert(), the function itself is right though.

+function _tracker_add($nid, $uid, $changed) {

No phpdoc, same for other private functions here.

+   // Add a comment to the last node as other user.

indentation is off.

catch’s picture

FileSize
20.24 KB

Implemented my own review.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
25.81 KB

Forgot -N.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
25.45 KB

re-rolled.

catch’s picture

FileSize
25.98 KB

Minus a syntax error.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
5.57 KB

Can't reproduce locally, trying again.

catch’s picture

FileSize
25.98 KB

doh. sorry.

catch’s picture

FileSize
25.88 KB

Removing a pointless drupal_set_message() from tracker.install

pwolanin’s picture

It's unclear to me what tracker_cron is doing - is this only supposed to run once per node to initially populate the table - i.e. instead of trying to do a large batch operation when the module is installed?

David Strauss’s picture

@pwolanin
"is this only supposed to run once per node to initially populate the table - i.e. instead of trying to do a large batch operation when the module is installed?"

Yes. Large sites can't build all trackers all at once.

pwolanin’s picture

Status: Needs review » Needs work

ok, then please add some code comments to that effect.

drewish’s picture

These seem pointless:

+  // Create tables.
+  // Remove tables

but I guess there's a couple other instances of them in core.

The schema descriptions for the tracker_node and tracker_user tables seem way too terse and don't really explain what they do.

The field descriptions also don't seem to match up with user.install and node.install. Which say things like:

Primary Key: {users}.uid for user.

And in the image.module we got told we had to add foreign key info ala:

  $schema['image_effects'] = array(
    'description' => 'Stores configuration options for image effects.',
    'fields' => array(
      'ieid' => array(
        'description' => 'The primary identifier for an image effect.',
        'type' => 'serial',
        'unsigned' => TRUE,
        'not null' => TRUE,
      ),
      'isid' => array(
        'description' => 'The {image_styles}.isid for an image style.',
        'type' => 'int',
        'unsigned' => TRUE,
        'not null' => TRUE,
        'default' => 0,
      ),
// ...
    ),
// ...
    'foreign keys' => array(
      'isid' => array('image_styles' => 'isid'),
    ),
  );

Incorrect indenting:

+function tracker_update_7000() {
+    $schema['tracker_node'] = array(

These comments need to be proper sentences and end in periods:

+      // Prepare a starting point for the next run
+      // If all nodes have been indexed, set to zero to skip future cron runs

I feel like this comment is wrong... it's a callback not a callback argument. Also wouldn't this be better off in .pages.inc or .admin.inc?:

+/**
+ * Menu callback argument. Prints a listing of active nodes on the site.
+ */
+function tracker_admin_settings() {

The setting for this either needs to be a select with a fixed list of values (like search module) or it needs a validate function to ensure that they provide a number.

+  $form['tracker_batch_size'] = array(
+    '#title' => t('Batch size'),
+    '#description' => t('Number of items to index during each cron run.'),
+    '#type' => 'textfield',
+    '#size' => 6,
+    '#maxlength' => 7,
+    '#default_value' => variable_get('tracker_batch_size', 1000),
+  );

Why is tracker_batch_node_alter() not called tracker_batch_node_submit()?

Not related to this patch specifically but I was trying to figure out if tracker_comment_update() and tracker_comment_publish() were doing the same thing and realized that hook_comment_unpublish() is never getting called. Need to open a new issue for that.

I feel like this comment is pointless and should be dropped, considering it's right before a function named add_css... Or baring that it needs a period:

+  // Add CSS
+  drupal_add_css(drupal_get_path('module', 'tracker') . '/tracker.css', array('preprocess' => FALSE));

The closing curly brace and else need to be on separate line:

+  } else {
+    $query = db_select('tracker_node', 't')->extend('PagerDefault');
   }

These comments need to be proper sentences and end in periods:

+    // Now, get the data and put into the placeholder array
+    // Finally display the data

Initially I'd thought there was duplicate code in this test then realized it just needs comments explaining that you're checking the user's page first then the site wide tracker page.

+    $this->drupalLogin($this->user);
+    $this->drupalGet('tracker/' . $this->user->uid);
+
+    // Assert that all node titles are displayed.
+    foreach ($edits as $i => $edit) {
+      $this->assertText($edit['title'], t('Node @i is displayed on the tracker listing pages.', array('@i' => $i)));
+    }
+    $this->assertText('1 new', t('New comment is counted on the tracker listing pages.'));
+    $this->assertText('updated', t('Node is listed as updated'));
+
+    $this->drupalGet('tracker');
+
+    // Assert that all node titles are displayed.
+    foreach ($edits as $i => $edit) {
+      $this->assertText($edit['title'], t('Node @i is displayed on the tracker listing pages.', array('@i' => $i)));
+    }
+    $this->assertText('1 new', t('New comment is counted on the tracker listing pages.'));

The comment refers to a "comment" when it should refer to a "node".

+    // Unpublish the comment.
+    $edit = array(
+      'operation' => 'unpublish',
+      'nodes[' . $node->nid . ']' => $node->nid,
+    );
+    $this->drupalPost('admin/content', $edit, t('Update'));
drewish’s picture

FileSize
20.79 KB

I started addressing some of the issues I raised but ran out of time. Also removed some junk from tracker_page()—like $header that wasn't used and drupal_add_css() that duplicated the '#attached_css'.

catch’s picture

drewish, thanks for the review and the re-roll. I'm moving continent tomorrow so unlikely to be able to look at this again before the end of the week, hoping someone else can jump in on remaining cleanup.

drewish’s picture

Well this has been a nice change from what I've been working on so I'll see if I can't clear up my remaining issues this afternoon.

drewish’s picture

Status: Needs work » Needs review
FileSize
27.33 KB

My last patch was missing the .install file... corrected that and fixed all the schema problems I'd identified. I also did some more clean up on the settings form. I fixed a bug in the DBTNG conversion tracker_cron() where it was supposed to be matching all users other than the author.

For some reason I can't do anything with simpletest locally so i'll get some bot feedback.

One other issue I noticed is that we're double processing the comments. We catch them on both insert/update and publish. I don't know if it's a huge deal I just wanted to mention that it seemed a little tacky to me.

catch’s picture

Looked over this, I think we should kill the settings page. While we expose a setting for search, I've seen indexing 100 nodes cause 25,000 queries - since it can be an insert for every word, loading thousands of comments etc. - whereas the tracker indexing is a known quantity. We can leave the variable there for sites which need to raise it, but it's unlikely to need lowering.

Spotted a couple of missing periods in tracker.test as well, but otherwise not much to complain about (obviously I didn't spot these when re-rolling myself).

drewish’s picture

FileSize
25.84 KB

dropped the admin settings page. and cleaned up the comment ops where we were doing expensive no ops.

catch’s picture

OK this is looking a lot tidier, but do we need the no-op hook implementation? Seems like that would be better documented in _publish()

+/**
+ * Implement hook_comment_insert().
+ *
+ * @see tracker_comment_publish()
+ */
+function tracker_comment_insert($comment) {
+  // Do nothing here. If the comment is published we'll take care of adding the
+  // user in tracker_comment_publish(). If it's unpublished, we don't want to
+  // bother adding to the tracker.
+}
drewish’s picture

I'm okay with dropping that but it seemed harder to document it on the other hook. I'm not sure which is a worse example for core to set, not having the operation be clear risking some one "fixing" it later, or having an empty hook implementation. I'll roll a no-op free version.

drewish’s picture

FileSize
25.63 KB
David Strauss’s picture

Status: Needs review » Needs work

Is tracker_form_node_admin_content_alter() still necessary? Don't D6 and later call normal node_save() stuff even for batches in the admin area?

$comment = (array) $comment; should go. It was in my code because the hook variously got objects and arrays. We don't have that problem anymore.

Berdir’s picture

I tested the content_alter thingy and afaik, it is still required for publish/unpublish, I removed it for the delete.

drewish’s picture

Status: Needs work » Needs review
FileSize
24.66 KB

had a diversion to get devel_generate working again then was able to test the bulk updates. looks like publish/unpublish now calls the full node_save() so we can drop tracker_form_node_admin_content_alter() and tracker_batch_node_submit() since they're just doing double processing.

drewish’s picture

FileSize
24.55 KB

Realizing our test coverage sucks. Two big fatal errors that went uncaught:
- tracker_comment_delete() was calling _tracker_delete() which isn't even a real function.
- _tracker_remove() was referencing {comments} which has been renamed to {comment}.

So things that need to be tested before this is committed:
- Adding unpublished comment: Commenter should not be in tracker.
- Adding published comment: Commenters and owner should be in tracker.
- Publishing a comment: Commenters and owner should be in tracker.
- Deleting a comment: Owner still in tracker, commenter no longer in tracker.
- Need better testing of the per user tracker.

drewish’s picture

FileSize
24.55 KB

this fixes up some of the comment tests.

drewish’s picture

FileSize
27.8 KB

that was clearly the same version of the patch.

TheRec’s picture

Status: Needs review » Needs work
+++ modules/tracker/tracker.install
@@ -0,0 +1,215 @@
+    'description' => "Tracks when nodes were last changed or commented on, for each user that authored the node or one of its comments.",

Single quotes should be used to be consistant with the rest of the file (2 occurences of the exact same line in the patch)

+++ modules/tracker/tracker.install
@@ -0,0 +1,215 @@
+      'users' => 'uid'

Missing a comma at the end (2 occurences of the exact same line in the patch)

+++ modules/tracker/tracker.pages.inc
@@ -8,63 +8,67 @@
+    $result = db_query("SELECT n.nid, n.title, n.type, n.changed, n.uid, u.name, l.comment_count FROM {node} n INNER JOIN {node_comment_statistics} l ON n.nid = l.nid INNER JOIN {users} u ON n.uid = u.uid WHERE n.nid IN (:nids)", array(':nids' => array_keys($nodes)));

Single quotes should be used to be consistant with the rest of the file.

+++ modules/tracker/tracker.test
@@ -51,22 +49,37 @@ class TrackerTest extends DrupalWebTestCase {
+    $this->assertNoText($unpublished->title, t("Unpublished nodes do not show up in the author's tracker listing."));
+    $this->assertText($my_published->title, t("Nodes show up in the author's tracker listing."));
+    $this->assertNoText($other_published_no_comment->title, t("Other user's nodes do not show up in the user's tracker listing."));
+    $this->assertText($other_published_my_comment->title, t("Other user's nodes do not show up in the user's tracker listing."));

Single quotes should be used to be consistant with the rest of the file.

Beer-o-mania starts in 3 days! Don't drink and patch.

drewish’s picture

Status: Needs work » Needs review
FileSize
28.08 KB

corrected the issues TheRec raised. also cleaned up the tests a bit more. this is pretty close aside from publishing/unpublishing comments.

catch’s picture

Status: Needs review » Reviewed & tested by the community

Since this is a schema change, I think we're better of trying to get it in before freeze, and add those final tests in a followup. If you beat webchick or Dries to a commit, please leave at RTBC.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Just glanced through this patch quickly, but...

+++ modules/tracker/tracker.module
@@ -55,11 +57,103 @@ function tracker_menu() {
+function tracker_perm() {

that's gonna be a problem. ;)

Beer-o-mania starts in 2 days! Don't drink and patch.

catch’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
27.99 KB

We removed the settings page, so this no longer needs a permission to administer it, doh.

No other changes.

pwolanin’s picture

Issue tags: +API change

add tag

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Reviewed this patch and it looks good. Thanks!

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -API change

Automatically closed -- issue fixed for 2 weeks with no activity.