On a site with around 1M nodes and xmlsitemap the function xmlsitemap_node_xmlsitemap_index_links results in very slow queries

SELECT n.nid FROM node n LEFT JOIN xmlsitemap x ON x.type = 'node' AND n.nid = x.id WHERE x.id IS NULL AND n.type IN ('type1', 'type2', 'type3') ORDER BY n.nid DESC LIMIT 0, 100;

seems to examine all rows and results in very large lock times.

Just as a quick work around - would it make sense to write a custom cron job to delete all records from xmlsitemap corresponding to nodes created earlier than say 2 weeks back? Would the module try to insert these nodes back into the xmlsitemap table? or would it move forward processing the limited set of nodes present in the xmlsitemap table.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

zyxware’s picture

Oops spoke too soon. I would have to edit the above join query to add the date filter as well.

So the updated query would be

SELECT n.nid FROM node n LEFT JOIN xmlsitemap x ON x.type = 'node' AND n.nid = x.id WHERE x.id IS NULL AND n.type IN ('type1', 'type2', 'type3') AND n.created > (unix_timestamp() - 1209600) ORDER BY n.nid DESC LIMIT 0, 100;

Would it work if I do both - i.e. delete nodes from xmlsitemap table in a cron and change the query in xmlsitemap_node_xmlsitemap_index_links to as shown above. If this works this would kill 2 issues in one shot - the slow query and an older issue with 404s (#760396: 404 page not found errors on accessing certain sitemap pages)

Dave Reid’s picture

Is there any way we can improve the query itself? We could possibly run a COUNT() on the node table first, then on the xmlsitemap table to see if the numbers are not the same, then we'd need to continue the import process.

Anonymous’s picture

Can you confirm on your database that n.type and x.type are indexed? The x.id and n.nid columns should be the primary key to their respective databases.

zyxware’s picture

Thanks for looking into this.

@Dave Reid - if we can keep a last_insert id could we run a select * from node where nid > xmlsitemap_last_inserted? Also insert could be in forced batches so that rebuild could work as well when xmlsitemap_last_inserted = 0. So select * from node where nid > xmlsitemap_last_inserted and nid < xmlsitemap_last_inserted + 5000

@earnie - there is an index type_subtype on (type, subtype) on xmlsitemap table and node_status_type with (nid, status, type) on node table.

I am also curious if my hack would work to make the module work like a lite version? I checked out the 'lite' query and it is extremely fast because of the filter on the created date.

zyxware’s picture

It looks like my hack do work in serving both purposes - limiting queries and limiting the sitemap to just the recent nodes. Attaching the patch for those who are interested in something like this.

zyxware’s picture

Even with the above patch in #5 a full rebuild would still try to include all nodes. The work around is to avoid selecting node in the select box for 'Link types' and then rebuild the sitemap and then let cron do the rest.

Anonymous’s picture

@earnie - there is an index type_subtype on (type, subtype) on xmlsitemap table and node_status_type with (nid, status, type) on node table.

The original query will do a full scan of the node table each time because node_status_type index isn't used. You need an index on just type itself or the original query needs to give n.nid, n.status and n.type in the WHERE clause.

zyxware’s picture

@earnie - thanks for the clarification.

I am also uploading the updated patch for 6.x-2.0-beta2 version of the module for those who are interested.

Dave Reid’s picture

Title: Performance issues with xmlsitemap_node_xmlsitemap_index_links » Add option to discard sitemap entries older that a certain amount of time
Version: 6.x-2.0-beta1 » 7.x-2.x-dev
Component: xmlsitemap.module » Code
Category: support » feature

We could probably make some kind of setting for 'disregard and discard sitemap entries older than x amount of time', but the problem is that we'd need to add a 'timestamp' column to the entity info so that this could be done in a generalized way.

zyxware’s picture

Can't the lastmod column in xmlsitemap be used for this? So instead of 'older than' it would be 'discard entries that have been created or updated more than x days ago'

Dave Reid’s picture

Yes, that works for items that already exist. But we need a reliable way to prevent those entries from getting re-indexed on the next cron. That's mostly what I was referring to. The module is headed towards a generic entity-based system that needs to rely on entity APIs only.

Coyote’s picture

If we're going to expire/remove nodes from sitemap based on date, how about optionally doing so based on a date field, rather than just creation/modification date.

For instance, the project I'm currently working on includes classes that occur on certain dates, for customers to sign up for - but the current xml sitemap basically just tosses all nodes of a certain type into the sitemap, even after those classes have passed. The class nodes themselves may be created weeks before the class occurs, so the creation/modification date isn't really tied to the actual class date.

I imagine that this would be useful to just about any date-field based content.

tobias.grasse’s picture

Any progress on this, especially considering a node date field for expiry information? I could really use this functionality, since we don't want to index ~ 55k of entries which have long passed.

wheelercreek’s picture

Issue summary: View changes

This looks like an old request but I'm running into a need for it..
I also need an option to just have the most recent content indexed in the sitemap.. or maybe to exclude content older than ___ days / weeks.

Another option that might work for me is to just be able to have the sitemap sort by modification date DESC, so at least the newest items are listed first (Or give a higher priority to newer content).

This is for an online news site, and I notice that the sitemap is showing content by oldest first, which I can resort in the browser - but I don't think that's relevant to how Google reads it.

mcrittenden’s picture

Here's a patch that applies to 7.x-2.3 which adds this functionality. It looks for a "xmlsitemap_node_seconds_to_keep" variable (defaults to a keeping a year) so that you can change the time or set it to 0 or FALSE to disable this feature entirely and keep all nodes.

ckoppenhaver’s picture

Re-rolling the patch to work with 2.6.