On a site with around 1M nodes and xmlsitemap the function xmlsitemap_node_xmlsitemap_index_links results in very slow queries
SELECT n.nid FROM node n LEFT JOIN xmlsitemap x ON x.type = 'node' AND n.nid = x.id WHERE x.id IS NULL AND n.type IN ('type1', 'type2', 'type3') ORDER BY n.nid DESC LIMIT 0, 100;
seems to examine all rows and results in very large lock times.
Just as a quick work around - would it make sense to write a custom cron job to delete all records from xmlsitemap corresponding to nodes created earlier than say 2 weeks back? Would the module try to insert these nodes back into the xmlsitemap table? or would it move forward processing the limited set of nodes present in the xmlsitemap table.
Comments
Comment #1
zyxware CreditAttribution: zyxware commentedOops spoke too soon. I would have to edit the above join query to add the date filter as well.
So the updated query would be
SELECT n.nid FROM node n LEFT JOIN xmlsitemap x ON x.type = 'node' AND n.nid = x.id WHERE x.id IS NULL AND n.type IN ('type1', 'type2', 'type3') AND n.created > (unix_timestamp() - 1209600) ORDER BY n.nid DESC LIMIT 0, 100;
Would it work if I do both - i.e. delete nodes from xmlsitemap table in a cron and change the query in xmlsitemap_node_xmlsitemap_index_links to as shown above. If this works this would kill 2 issues in one shot - the slow query and an older issue with 404s (#760396: 404 page not found errors on accessing certain sitemap pages)
Comment #2
Dave ReidIs there any way we can improve the query itself? We could possibly run a COUNT() on the node table first, then on the xmlsitemap table to see if the numbers are not the same, then we'd need to continue the import process.
Comment #3
Anonymous (not verified) CreditAttribution: Anonymous commentedCan you confirm on your database that n.type and x.type are indexed? The x.id and n.nid columns should be the primary key to their respective databases.
Comment #4
zyxware CreditAttribution: zyxware commentedThanks for looking into this.
@Dave Reid - if we can keep a last_insert id could we run a select * from node where nid > xmlsitemap_last_inserted? Also insert could be in forced batches so that rebuild could work as well when xmlsitemap_last_inserted = 0. So select * from node where nid > xmlsitemap_last_inserted and nid < xmlsitemap_last_inserted + 5000
@earnie - there is an index type_subtype on (type, subtype) on xmlsitemap table and node_status_type with (nid, status, type) on node table.
I am also curious if my hack would work to make the module work like a lite version? I checked out the 'lite' query and it is extremely fast because of the filter on the created date.
Comment #5
zyxware CreditAttribution: zyxware commentedIt looks like my hack do work in serving both purposes - limiting queries and limiting the sitemap to just the recent nodes. Attaching the patch for those who are interested in something like this.
Comment #6
zyxware CreditAttribution: zyxware commentedEven with the above patch in #5 a full rebuild would still try to include all nodes. The work around is to avoid selecting node in the select box for 'Link types' and then rebuild the sitemap and then let cron do the rest.
Comment #7
Anonymous (not verified) CreditAttribution: Anonymous commentedThe original query will do a full scan of the node table each time because node_status_type index isn't used. You need an index on just type itself or the original query needs to give n.nid, n.status and n.type in the WHERE clause.
Comment #8
zyxware CreditAttribution: zyxware commented@earnie - thanks for the clarification.
I am also uploading the updated patch for 6.x-2.0-beta2 version of the module for those who are interested.
Comment #9
Dave ReidWe could probably make some kind of setting for 'disregard and discard sitemap entries older than x amount of time', but the problem is that we'd need to add a 'timestamp' column to the entity info so that this could be done in a generalized way.
Comment #10
zyxware CreditAttribution: zyxware commentedCan't the lastmod column in xmlsitemap be used for this? So instead of 'older than' it would be 'discard entries that have been created or updated more than x days ago'
Comment #11
Dave ReidYes, that works for items that already exist. But we need a reliable way to prevent those entries from getting re-indexed on the next cron. That's mostly what I was referring to. The module is headed towards a generic entity-based system that needs to rely on entity APIs only.
Comment #12
Coyote CreditAttribution: Coyote commentedIf we're going to expire/remove nodes from sitemap based on date, how about optionally doing so based on a date field, rather than just creation/modification date.
For instance, the project I'm currently working on includes classes that occur on certain dates, for customers to sign up for - but the current xml sitemap basically just tosses all nodes of a certain type into the sitemap, even after those classes have passed. The class nodes themselves may be created weeks before the class occurs, so the creation/modification date isn't really tied to the actual class date.
I imagine that this would be useful to just about any date-field based content.
Comment #13
tobias.grasse CreditAttribution: tobias.grasse commentedAny progress on this, especially considering a node date field for expiry information? I could really use this functionality, since we don't want to index ~ 55k of entries which have long passed.
Comment #14
wheelercreek CreditAttribution: wheelercreek commentedThis looks like an old request but I'm running into a need for it..
I also need an option to just have the most recent content indexed in the sitemap.. or maybe to exclude content older than ___ days / weeks.
Another option that might work for me is to just be able to have the sitemap sort by modification date DESC, so at least the newest items are listed first (Or give a higher priority to newer content).
This is for an online news site, and I notice that the sitemap is showing content by oldest first, which I can resort in the browser - but I don't think that's relevant to how Google reads it.
Comment #15
mcrittenden CreditAttribution: mcrittenden at Little Blue Labs commentedHere's a patch that applies to 7.x-2.3 which adds this functionality. It looks for a "xmlsitemap_node_seconds_to_keep" variable (defaults to a keeping a year) so that you can change the time or set it to 0 or FALSE to disable this feature entirely and keep all nodes.
Comment #16
ckoppenhaver CreditAttribution: ckoppenhaver at Phase2 commentedRe-rolling the patch to work with 2.6.