Many queries split their WHERE and SORT BY clauses partly into {node} and partly into {node_comment_statistics}. This causes many queries to use temporary tables and filesorts. Because {node_comment_statistics} and {node} are one-to-one, we can just put them in the same table.

Resolving this issue will allow #145353 to proceed with more optimizations.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

David Strauss’s picture

Title: Mode {node_comment_statistics} into {node} » Merge {node_comment_statistics} into {node}

Fixed title.

David Strauss’s picture

Title: Merge {node_comment_statistics} into {node} » Merge {node_comment_statistics} and {node_counter} into {node}

Expand issue scope.

David Strauss’s picture

Status: Active » Needs work
FileSize
19.89 KB

This is an untested first draft. It may even have syntax errors. The next post will have the new node.install.

Major changes versus HEAD beyond the patch's basic description:
* Remove denormalization for last_comment_name. It wasn't a useful denormalization because we don't significantly use it for WHERE or ORDER BY clauses. A join with users will suffice without a significant performance hit.
* Rename last_comment_timestamp and last_comment_uid to last_activity_timestamp and last_activity_uid. This is more consistent with the actual use because they contain the node's attributes if no comments exist.
* Rename the old {node_counter} fields to views_total, etc. to be more consistent with other field name styles.

David Strauss’s picture

FileSize
1.67 KB

The promised node.install.

Gábor Hojtsy’s picture

These often changing values are stored in a different table, so a database query cache can get the node table records actually cached, instead of changing on all page views. So you could really lose a lot on there. This should be benchmarked.

David Strauss’s picture

@Gábor Hojtsy I assure you we pay for it more when we join two really big tables in a query with conditions split between the two tables. It forces MySQL to write the joined table out to disk as a temporary table and run filesort on it. I doubt the query cache does much to help relative to the cost of creative massive temporary tables on SELECTS.

Moreover, if the tables are combined and keys are properly cached, there's minimal disk access to satisfy the query from the source tables.

The query cache is neat, but I'll always prefer rapid results from the source to caching slow queries. It's even an issue of reliability: if you depend on the query cache for performance and restart MySQL, your server could get hammered into crashing with the load.

Finally, the queries that are really slow are the ones joining {node} and {node_comment_statistics}. Those queries can't use the cache if either table is updated, so there's no loss there. Only queries that just access {node} (and maybe {node_revisions}) benefit from the separation, and those queries tend to be very fast already.

I'll still run the benchmarks to show how dramatic the improvement will be. :-)

Gábor Hojtsy’s picture

Well, there is an effort to break up more tables into 1-1 relationships. One of the reasons is the query cache, the other is avoiding table locks. When you update often changing fields like the node counter (which is updated on all page views), you need to lock the table (in worse databases) or at least the row in question. That forces other request serving threads wait for the lock to be lifted, so they can update the data, and so on.

A proposed patch to break user table's access column into a 1-1 related table is described here: http://drupal.org/node/109037

If you think this is a generally bad practice, IMHO it is best to start a thread about this on the development list, so it gets more attention. There are certainly competing reasons to do it this way or the other, and these suggestions born from drupal.org experiences. These might might not be applicable to all Drupal sites for sure, so feel free to voice your concerns. I think this is worth taken to the development community at large.

David Strauss’s picture

@Gábor Hojtsy If someone runs a large-enough site that table locking is a problem, my response is that they need to move to a better database or configuration. I don't think it's necessary for Drupal to consider scalability on sub-par configurations because those configurations will fail regardless of what we do.

My goal is to make Drupal scale well with row-level locks but still run with databases allowing only table-level locks.

That said, I'll take this to the development list.

David Strauss’s picture

I have posted some related comments to the developers' listserv:

http://lists.drupal.org/pipermail/development/2007-June/024284.html

catch’s picture

Having experienced the tracker query improvements on my own site, which are phenomenal, an uninformed +1, and subscribe. Again, I'm happy to test patches on my 5.1 site, and 6.x when a test-site upgrade is feasible (after freeze I guess).

David Strauss’s picture

FileSize
31.24 KB

Here's an update rerolled against HEAD. Still untested.

Dries’s picture

I think it makes sense to merge these tables, but it would be great if we could benchmark this with and without the query cache. Complex queries are obviously going to benefit from this patch a LOT. Simple queries might pay a small performance cost because the query cache is now less effective. It's not clear what the net result will be, but looking at the tracker module it's clear that you can do a lot of small queries for the cost of a complex query. Even a number of simple benchmarks would be welcome.

As a bonus, this patch will fix various bugs (some of which are in core) that are a result of the node_statistics_comments table being hard to understand and use. Two random examples:
* http://drupal.org/node/106659
* http://drupal.org/node/139537

David: n.last_activity_timestamp AS last_updated is probably something that can be factored out. Also, it might be better to just name that field n.last_activity or activity. That would be more consistent with n.created and n.changed. Not sure what would be preferred but I think we can take this a little bit further in order to get most out of this. :-)

killes@www.drop.org’s picture

merging node_counter into node will make it essentially impossible to run Drupal (with node view counting enabled) with MySQL and MyISAM backend for any decent page load. The locks will simply eat you alive.

Now, you can argue that you should not be running MyISAM btu InnoDb in that case. However, InnoDb isn't without problems itself and certainly has not decreased the load on the DB server e.g. on drupal.org.

David Strauss’s picture

merging node_counter into node will make it essentially impossible to run Drupal (with node view counting enabled) with MySQL and MyISAM backend for any decent page load. The locks will simply eat you alive.

If so, the locks should eat you alive today. Even with {node_counter}, there's a whole-table lock for UPDATE with nearly every page load. The difference is that the lock would be on {node} instead. On the other hand, if someone has a block to show the most popular nodes, they'll get a huge performance boost.

We could keep node_counter separate on two conditions:
* The counter for a node only gets tracked after a node is published
* The counter is deleted if a node is unpublished

Now, you can argue that you should not be running MyISAM btu InnoDb in that case. However, InnoDb isn't without problems itself and certainly has not decreased the load on the DB server e.g. on drupal.org.

According to nnewton, the switch to InnoDB has changed the performance profile of the Drupal.org database so that it's actually maxing out the hardware. Lock contention in MyISAM prevented that before. This opens up many optimization opportunities that didn't exist with MyISAM. InnoDB is no silver bullet, but it at least gives you the ability to improve things.

dww’s picture

as per the RFC on the devel list[1] about what constitutes "activity", i strongly believe the activity of a node (for the purposes of sorting the tracker and forums, for example) should mean the most recent of:

- node created
- node updated
- last comment added to the node

see http://drupal.org/node/44276 for some prior art on this topic. everyone there was convinced the above behavior is right, so i'd rather not rehash that debate in here.

[1] http://lists.drupal.org/pipermail/development/2007-June/024460.html

dww’s picture

oh, and speaking of prior art, see http://drupal.org/node/44276#comment-171227 (comment #30) from me about starting to do exactly what this issue proposes. ;) so, certainly a +1 from me, pending benchmark results. i'd be very interested in seeing benchmarks for this on MyISAM and InnoDB and PostgreSQL, with and without the query cache. while i'm generally drawn to the idea that we should optimize performance for the really huge sites, i am a little worried about things that are going to kill performance on the crappy shared hosting environments, since i do have a few sites i care about that live in that world. ;)

catch’s picture

tracking.

robertDouglass’s picture

subscribe

David Strauss’s picture

After some consideration of the actual data use and demoting comment.module to a less-privileged status, I'm going to take this patch in a slightly different direction. I'm going to set up an activity-tracking interface for node that comment.module will use to post its own activity.

Functions:
* node_activity_save($node) records the activity and returns $aid, the activity ID
* node_activity_delete($aid) deletes the activity

Modules will be responsible for tracking the $aid values for the activities they submit if there's a chance they will want to undo an activity.

So, upon posting a comment to a node, the comment module will call node_activity_save() for the node and record the $aid with the comment. If a comment is deleted, the comment module will call node_activity_delete() with the comment's associated $aid. In either case, the node will update its overall activity statistics (for the global tracker and forums) and the user activity statistics (for the user trackers). node_activity_save() will use the current time() and $user->uid in recording the activity. If there's no other activity or $node->changed post-dates other activities, the node's author will be the person responsible for the latest activity.

The cool thing about this system is that any module will be able to publish to the activity record. For example, a voting module could record activity whenever a user votes.

moshe weitzman’s picture

subscribe

Wim Leers’s picture

Subscribing.

vivek.puri’s picture

David, can activity-tracking interface be done using actions ?

moshe weitzman’s picture

Any progress here?

David Strauss’s picture

Okay. I'm taking tomorrow off to finish this important performance update. Here's how the activity system will work.

1. An activity happens on or is removed from a node. This activity can be through any module.
2. The module recording the activity calls node_update_activity($node).
3. node_update_activity() calls hook_activity($node). Hook implementations respond with the latest timestamp they have on record for any activity on the node.
4. node_update_activity() records the latest activity into {node}.
5. node_update_activity() calls hook_user_activity($node, $uid). Hook implementations respond with TRUE if that $uid has activity on that node and FALSE otherwise. This determines whether that node can show up on the user's tracker.
6. If any module returns TRUE, the user gets an entry in a table tracking node/user activity, which will be {node_user_activity}. If all returns are FALSE, the rows for that nid and uid are removed from {node_user_activity}.
7. node_update_activity() records that latest activity on all users associated with the node in {node_user_activity}.
8. node_update_activity() calls hook_activity_updated($node) so that other modules (like taxonomy) can update their denormalized activity records for the node.

David Strauss’s picture

FileSize
34.57 KB

Here's a new draft of the patch. This is getting so big that I'm considering ignoring forums optimization for the moment.

David Strauss’s picture

FileSize
46.64 KB

Many fixes. Forum optimizations removed for simplicity.

cburschka’s picture

I'm curious... the last patches were all submitted without changing the status from CNW. Is it still incomplete, or ready for review?

David Strauss’s picture

Status: Needs work » Needs review

I'll go ahead and mark this for review, but it's not up to my usual, tested standard for marking a patch ready for review.

David Strauss’s picture

I'd like to add that this patch does all the important things already. It doesn't need a database expert to finish. It needs someone willing to tie up the loose ends I don't have the time to do.

Dries’s picture

The latest version of the patch is too intrusive to be committed this late in the game. The activity monitor sounds nice, but it introduces a whole new API. Also, it is not entirely clear why the activity monitor is so important for this patch -- wasn't it best left for another patch?

David Strauss’s picture

Version: 6.x-dev » 7.x-dev

The activity monitor simply takes the API necessary to implement this sort of consolidated node/comment tracking and makes it available to all modules. The alternative to an API is privileging the node and comment modules by querying them specifically to get their latest activity records.

David Strauss’s picture

The crux of the performance improvement created by this patch relies on consolidating the latest activity data for each node so that querying the latest activity records doesn't require calculated fields. That's why there's an API scheme that asks modules, "What was your latest activity related to this node?" and consolidates the answers.

kbahey’s picture

Subscribe.

moshe weitzman’s picture

"That's why there's an API scheme that asks modules, "What was your latest activity related to this node?" and consolidates the answers."

thats exactly analogous to the node access arbitrator in core. i.e. node_access_acquire_grants(). that model has worked well there, and should work well here too.

catch’s picture

Version: 7.x-dev » 6.x-dev

Could still make it into 6.x

catch’s picture

merging node_counter into node will make it essentially impossible to run Drupal (with node view counting enabled) with MySQL and MyISAM backend for any decent page load. The locks will simply eat you alive.

The fact that node_counter is written on every page view, to me at least, suggests that improvements in the 'popular' block and page would be undone by the locks on the table. Having said that, I've got statistics module switched off on my site, as I imagine will most medium-large sites and any that use aggressive caching.

However this argument is much less of an issue for node_activity / node_comment_statistics since they're always going to have less writes than node_counter regardless, and node_loads are affected by node and comment locks anyway right?

I'd be very keen to see this get into D6 having used the #126 patch for a while on my site and seeing the results, but I can only help with basic reviews and benchmarks. If I can be of help at this stage, any guidance on how to do so would be handy.

AjK’s picture

FileSize
48.58 KB

re-roll patch against head.

AjK’s picture

Status: Needs review » Needs work

David did say it was "untested". Well, unfortunately I failed to get this to run. I dd try to work through all the obvious problems and when php display no more warnings and no errors I was left with a front page with no teasers and just a pager :(

I've set aside today to try and bechmark/review these various patches so I'm halting further work on getting this patch to work and moving on to review another patch (and hopefully get time later to come back to this and fix it up if no one else does).

Zothos’s picture

someone plz sync it with head. I tried it. But failed... XD

catch’s picture

Version: 6.x-dev » 7.x-dev

Not going to make it in to 6.x now.

m3avrck’s picture

subscribing

I'm very in favor of this patch and would love to see one written that applies very early in the next Dev cycle so we can backport this :-)

dharamgollapudi’s picture

Subscribing...

Liam McDermott’s picture

#55246: Node updates are not reflected in node_comment_statistics correctly is related to this bug. Just wanted to link the two together so that one can be marked fixed if this change happens.

David Strauss’s picture

I don't think they're related. #148849 isn't even a bug.

Dries’s picture

David: do you still intend to work on this patch? I still think we should merge both tables, and squeeze more page views out of the Drupal comment system (forums, blogs, etc).

Liam McDermott’s picture

I don't think they're related. #148849 isn't even a bug

Well if this patch goes through that bug gets fixed. So yes, I think they are related.

bjaspan’s picture

I'm concerned about merging comment information into the node table because it seems like it makes comments unimplementable, or at least more complicated to handle, on any system in which Drupal can operate on a remote data like it can on local data. e.g. If the comment module writes directly to the node table, then when we create a comment on a Flickr photo which does not exist locally, how does that work?

To put it another way, if comments have a special relationship with node, then comments cannot be a normal field, which we were hoping to make them into.

David Strauss’s picture

@Liam McDermott
"Well if this patch goes through that bug gets fixed."

Considering that {node_comment_statistics} doesn't duplicate any information from {node} other than the foreign key, I think your confidence is misplaced.

@Dries
"David: do you still intend to work on this patch?"

It's high on my list, after deploying Tracker 2 to Drupal.org and integrating the Tracker 2 changes into Drupal 7 core.

David Strauss’s picture

@Liam McDermott
Never mind, I see how this patch could possibly fix #55246. It's been quite a while since I've looked at the changes.

catch’s picture

BJaspan. There's a lot of hard coding between node and comment - module_exists, function_exists etc. all over the place and node_comment_statistics is used regardless of whether comment module is enabled or not.

One of the earlier patches on this issue was a node_activity column to abstract this out for modules other than comment to hook into. I don't see a particular reason why something similar couldn't be applied to other core objects if we need to keep track of (internal) field updates happening to the fields independently rather than direct on the object.

In terms of field storage, there'll need to be a way to deal with WHERE and ORDER BY queries across joins - not having a site available at all (as d.o. was last year in large part due to this issue) is more important to me than whether I can attach comments to external items or not. IMO that will probably require duplicating data conditionally based on whether those particularly expensive queries are being used (and not when they're not) although that may not be a goer.

RobLoach’s picture

David Strauss’s picture

@catch The patch to create a node_activity column (not sure of the exact name I used) was one written by me for Drupal 6, but I wasn't able to keep it updated with HEAD and tested. One solution for WHERE and ORDER BY across JOINs is implemented in my Tracker 2 module.

meba’s picture

Subscribing

Dave Reid’s picture

Title: Merge {node_comment_statistics} and {node_counter} into {node} » Merge {node_comment_statistics} into {node}

FYI, {node_counter} has been moved to the statistics.module where it belongs.

dwightaspinwall’s picture

This thread caught my attention while I was looking for solutions to the problem expressed in the title, namely the inefficient execution of SELECTs whose ORDER BY uses the GREATEST of {node_comment_statistics}.last_comment_timestamp and {node}.changed. Using Views, it is quite easy and common to emit such SELECTs (here's mine for example, which yields painfully slow 2 second queries on fast hardware with a node table with only 60,000 rows):

SELECT node.nid AS nid, node.title AS node_title, legacy_HCILibrary.blurb AS legacy_HCILibrary_blurb, node_comment_statistics.comment_count AS node_comment_statistics_comment_count, GREATEST(node.changed, node_comment_statistics.last_comment_timestamp) AS node_comment_statistics_last_updated 
FROM node node 
LEFT JOIN hci_research hci_research ON node.nid = hci_research.nid 
LEFT JOIN legacy_HCILibrary legacy_HCILibrary ON hci_research.researchid = legacy_HCILibrary.id 
INNER JOIN node_comment_statistics node_comment_statistics ON node.nid = node_comment_statistics.nid 
WHERE (node.type in ('hci_research')) AND (node.status <> 0) AND (node.vid IN ( SELECT tn.vid FROM term_node tn LEFT JOIN term_hierarchy th ON th.tid = tn.tid LEFT JOIN term_hierarchy th1 ON th.parent = th1.tid WHERE tn.tid = 242 OR th1.tid = 242 )) 
ORDER BY node_comment_statistics_last_updated DESC 
LIMIT 0, 5

Having spent an enjoyable time reading this thread as well as the dev post at http://lists.drupal.org/pipermail/development/2007-June/024284.html I am left uncertain as to where things have been left. Clearly, the patch didn't make it into D6. What is the recommended solution? Is it Tracker 2? This seems to be the right approach but the module shows little activity in terms of downloads or issue queue.

catch’s picture

Tracker 2 is being used on Drupal.org, there's a related thread about tracker at #105639: Tracker performance improvements. There's also http://drupal.org/project/materialized_view which aims to do this more generically - was some discussion of getting it into core before fields went in, but don't know where that stands now.

Damien Tournoud’s picture

Materialized View is also used on drupal.org, and optimizes nicely several forum queries.

dwightaspinwall’s picture

From what I can see, materialized_view doesn't exist yet; no downloads on the project page (for D6 or otherwise) and even HEAD isn't accessible via CVS.

Tracker appears to still suffer from query inefficiency and wouldn't help my case.

So that leaves Tracker 2 for my D6 site, which I will install and test now.

catch’s picture

Version: 7.x-dev » 8.x-dev

Tracker 2 is now in core.

Forum and taxonomy both implement their own tables similar to tracker2's. So a bit of progress, but sorting this properly is a Drupal 8 task.

David Strauss’s picture

Forum and taxonomy both implement their own tables similar to tracker2's.

When did this happen?

catch’s picture

#412518: Convert taxonomy_node_* related code to use field API + upgrade path.

Forum implementation needs cron indexing still as an official followup.

moshe weitzman’s picture

Status: Needs work » Closed (won't fix)

At this point, I'd say that this is a Won't Fix given that forum and taxonomy have solved their problems by keeping own tables in D7. Please reopen if you think this should stay open.

dww’s picture

Status: Closed (won't fix) » Needs work

We're about to denormalize {node_comment_statistics} into {project_issues} over at #921210: Fix performance of "My issues" so that the "My issues" block on the d.o redesign dashboard doesn't completely melt the DB servers. So, I guess you could say project_issue is doing what forum and taxonomy do. But, it seems crazy to keep these values out of {node}. It forces DB-killing queries if you use views to make a "recent posts" block when it'd otherwise be a trivial query against {node} alone...

catch’s picture

Priority: Critical » Major

It makes sense to keep these in {node}, we default to innodb which has row-level locking now. We may not want to load them into the $node object in entity loading so they don't get stale with http://drupal.org/project/entitycache. Downgrading all D8 criticals to major per http://drupal.org/node/45111

quicksketch’s picture

Priority: Major » Normal

Unless this should hold up all new features, it should be "normal" per the final outcome of http://drupal.org/node/45111 (which resulted in http://drupal.org/node/1201874). Obviously there isn't a lot of traction on this issue, it shouldn't hold up other development.

andypost’s picture

Seems we could close this, now statistics live in {comment_entity_statistics} table with own field.
It makes sense to get rid of loading statistics in hook_entity_load

andypost’s picture

Title: Merge {node_comment_statistics} into {node} » Refactor {comment_entity_statistics} into performant Field

Suppose this could be better

andypost’s picture

Filed related #2101183: Move {comment_entity_statistics} to proper service to move all direct access to table to proper service

larowlan’s picture