I've marked CNW because I need to include an admin/settings UI form to allow the admin to control the number of rows to update in one cron run. I'm creating the issue to get feedback on this patch. The patch also corrects these issues #354453: priority_override doesn't get a value and #351570: No node entries after last update and included in this because I ran into these issues when testing the patch.

CommentFileSizeAuthor
#16 xmlsitemap-DRUPAL-6--1.patch7.08 KBAnonymous (not verified)
#4 xmlsitemap_node-hook_cron-DRUPAL-6--1.patch3.36 KBAnonymous (not verified)
#2 xmlsitemap_node-hook_cron-DRUPAL-6--1.patch3.39 KBAnonymous (not verified)
xmlsitemap_node-hook_cron-DRUPAL-6--1.patch3.38 KBAnonymous (not verified)

Comments

scottm316’s picture

I couldn't get the patch to apply. I got the file line "function xmlsitemap_node_api(&$node, $op.....{"

and the patch line "switch ($op)" do not match! Maybe I'm doing something wrong. I am using TortoiseSVN, and most recent xmlsitemap 1.x-dev. I also have the issue of no node entries. http://drupal.org/node/351570

Anonymous’s picture

Try this one.

cd modules/xmlsitemap
patch -p0 < xmlsitemap_node-hook_cron-DRUPAL-6--1.patch

The -dev is synchronized with CVS via a scheduled CRON. The patch is created against the CVS code for the 6.x-1.x-dev branch.

scottm316’s picture

I'm sorry I tried using the second patch but get the same error. Using 6.x-1.x-dev branch.

Anonymous’s picture

Keeping up with the CVS changes.

@scottm316: Make sure you apply the patch to the current CVS. The file export might not be up-to-date yet.

scottm316’s picture

I can forward you the .rej file if necessary.

[root@vstation001 xmlsitemap]# patch -p0 < xmlsitemap_node-hook_cron-DRUPAL-6--1_0.patch
patching file xmlsitemap_node/xmlsitemap_node.module
Hunk #1 FAILED at 110.
Hunk #2 succeeded at 155 with fuzz 2 (offset -1 lines).
1 out of 3 hunks FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.module.rej

avpaderno’s picture

TortoiseSVN is not for CVS; you should be using TortoiseCVS.

avpaderno’s picture

@earnie: You added a fix for the misspelled property in your patch, when that is already fixed in CVS.

The code you are proposing uses a query that select all the rows in the xmlsitemap_node table (WHERE nid NOT IN (SELECT nid FROM {xmlsitemap_node}); I imagine it would be a problem when that table contains almost all the nodes contained in Drupal core table, and the number of nodes is a big enough number.

I think the concept could be applied, but rather than using hook_cron(), I would use a different hook. I would like to do like Poormanscron does, and allow the module to handle few nodes at every page load.
The vantage of the approach is that the administrator of a Drupal-powered site would not need to go to a specific XML Sitemap page to update the data contained in its tables (it's also true for the solution you propose), and XML Sitemap would not consume excessive time resources like it is happening now with hook_xmlsitemap_links().

Anonymous’s picture

I forgot to cut the misspell correction from the patch, sorry. I've modified the query in my version like so:

  $sql = "SELECT n.nid FROM {node} n LEFT JOIN {xmlsitemap_node} xn ON n.nid = xn.nid WHERE xn.nid IS NULL AND n.promote = 1 AND n.status = 1 LIMIT %d";

This will only select rows from {node} and none from {xmlsitemap_node}. I'll upload a patch later, I want to work out the xmlsitemap_term issues so that I can begin using this module in production.

avpaderno’s picture

The checking for n.promote must be eliminated: the site map doesn't contain only the nodes that are promoted to front page.

avpaderno’s picture

it would be better if the number of nodes added in the module table during a cron session would be set in a Drupal variable which is set in the settings page of XML Sitemap.

Anonymous’s picture

Thanks for the hint on the promote filter and yes the LIMIT is being set by a variable_get with a default of 20. I haven't coded the settings page yet.

avpaderno’s picture

@#8: If you use that query, then you must also add a check for the returned nid.
I would change it in SELECT n.nid FROM {node} n LEFT JOIN {xmlsitemap_node} xn ON n.nid = xn.nid WHERE xn.nid IS NULL AND n.nid > %d AND n.status = 1 LIMIT %d.

The code must then save the last nid, and use it for the query.

avpaderno’s picture

Talking of the hook to implement for this case, I think that hook_cron() is the best candidate. If then it's required to add less rows and to do it more frequently, then it is possible to use Poormanscron. The only alternative I have found is to implement hook_exit(), which creates some problems as the code should bootstrap Drupal to be sure the required functions are available (but then it would not be sure also other modules would be loaded).

Anonymous’s picture

I don't get your suggestion of n.nid > %d. Only n.nid where xn.nid is null (i.e. the nid doesn't exist in xmlsitemap_node) will be returned. I've tested this using the same SQL via phpmyadmin. Can you explain what you're thinking for n.nid > %d and from where the value comes?

avpaderno’s picture

Maybe it's useless, but my idea was to start from the last node inserted, and start the query from the node having the successive nid.
This can help in the case the data of one node added in the module table get removed from it; in that case, the query you proposed would start again from that nid, while mine would start from the previous saved row.

Of course, the code must save the nid obtained from the query.

Anonymous’s picture

Title: hook_cron support for missing nodes in xmlsitemap_node » hook_cron support for missing nodes and terms
Component: xmlsitemap_node.module » Code
Status: Needs work » Needs review
StatusFileSize
new7.08 KB

I've implemented this feature for both nodes and terms. The patch includes a UI change to control the number of rows updated at once; one for nodes and a different one for terms. It also implements some changes in xmlsitemap_node that I found necessary to make it easier to implement this feature. I'm using feedAPI and autotag modules and this patch has good results with version 1.5 of feedAPI and version 1.18 of autotag.

Anonymous’s picture

I meant to also say that I don't see merit for starting greater than the last node updated since all nodes missing should exist in xmlsitemap_node and the query only chooses missing nodes.

avpaderno’s picture

If, for any causes, a row is being removed from the database table right after hook_cron() has added it, the implementation of hook_cron() suggested would try all the time to add it again.

To save the node ID of the last saved node is something that the Drupal core modules do too, when they use the batch operations to operate on nodes. It should not be required to do such thing in those cases too, but they do it the same; I would guess there is a reason.

avpaderno’s picture

Status: Needs review » Needs work

The patch needs to be changed.

  • The code still uses the %s placeholder in the SQL queries for the priority_override field, which should replaced by %f (like the rest of the module code correctly does, in the last commits).
  • The patch includes changed to code that is not directly involved in the implementation of hook_cron().
  • The patch uses INSERT INTO queries, when it would be better to use the drupal_write_record() that would also avoid to change the SQL query placeholders in the case, i.e., a field would be changed from integer to float.
  • Rather than having two different modules implementing hook_cron(), it would be better to have a single module that implements it, and then calls a hook implemented in the other modules.
avpaderno’s picture

Status: Needs work » Closed (won't fix)
Issue tags: +Drupal core hooks, +custom hooks

It will not implemented; to populate the modules database tables has been implemented a different system.

Anonymous’s picture

Fair enough; I almost did this myself.

avpaderno’s picture

The proposed patch has not been changed from February 6, and the code actually implemented in the project modules follows a completely different approach to update the content of the module database tables.

webchick’s picture

avpaderno’s picture

Title: hook_cron support for missing nodes and terms » hook_cron() support for missing nodes and terms
Status: Closed (won't fix) » Needs work

Following what reported by webchick, I am reopening this issue.
The batch operation code used to populate the module database tables has been removed, and it will be implemented a different system based on hook_cron().

The patches reported in this queue need to be update for the last code changes.

Anonymous’s picture

I won't be able to update the patch anytime soon if someone wants to beat me to it please do.

webchick’s picture

Since this whole migration to using cron for use with high-performance sites is "my fault," ;) I'd like to help with this.

I'm currently trying to wrap my head around how this module functions to know how to properly refactor it.

avpaderno’s picture

For an example of how the xmlsitemap_node table is populated (although the function has a different purpose), see the xmlsitmap_node_set_priority() function, and what is done when the table contains already the data for the node passed to the function.

I have to still understand how to implement code that is generic enough to be valid for each modules of the project, and all third-party modules integrating with XML Sitemap.

avpaderno’s picture

The implementation of the cron task should also follow a specific order in populating the database tables; in the same way done for the implementation of hook_xmlsitemap_links(), where XML Sitemap: Node is the first to populate the site map table (followed by XML Sitemap: Terms), also in this case XML Sitemap: Node should be called before XML Sitemap: Terms.

webchick’s picture

Assigned: Unassigned » webchick

Ok, I've now done enough investigative work that I think I basically understand what's going on. Assigning to myself.

Thanks for all of your help (and quick commits!) Kiam! :)

webchick’s picture

Category: feature » bug
Priority: Normal » Critical

Oh, and this now qualifies as a critical bug, since there is currently nothing in the code that populates legacy content.

avpaderno’s picture

I have a little question. Should the code be implemented for nodes only, or made more generic (and work for all the modules)?
If the code needs to be generic, then it will more complicated too. I guess that a single module should coordinate the others, otherwise they would try to populate their database tables at the same time, which is not a good thing.

Anonymous’s picture

Each module requiring a table build should contain its own hook_cron implementation IMO. The drupal_cron_run function does a module_invoke_all('cron') so the execution will happen sequentially based on the sorting weight of the module. Some control over how many rows or how much time is taken of the entire cron run should be managed globally though.

avpaderno’s picture

I was thinking to a control about which module should execute the cron task when the cron tasks are executed more than once; I mean that if the first time it's xmlsitemap_node to execute its hook_cron(), then the second time should be xmlsitemap_term (i.e.).

Otherwise, the cron tasks will be executed for all the XML Sitemap modules at the same time. I don't think this could be a problem, but I imagine the difficulties to calibrate the settings to make the automatic update as smooth as possible.

Anonymous’s picture

If you control the run by number of rows to process or a slice of time then the user would decide how much of the 240 seconds should go to xmlsitemap. I suppose a hook_cron_sitemap could be created for each module and xmlsitemap_cron would do the calling of each based on if there is time left.

webchick’s picture

What I have been doing to this point (my xmlsitemap module atm is a complete mess, so there's nothing really to upload in terms of a patch yet) is copy/pasting lots of code from search module.

Search module has a similar architecture to xmlsitemap, where there is a central module (search.module) running the show, and calling hooks in other modules that wish to respond (hook_update_index). So what I've done is basically:

/**
 * Implementation of hook_cron().
 */
function xmlsitemap_cron() {
  xmlsitemap_update();
}

/**
 * Updates the sitemap.
 */
function xmlsitemap_update() {
  $limit = variable_get('xmlsitemap_update_limit', 100);
  // Get link updates from everyone else.
  module_invoke_all('xmlsitemap_links', $limit);
  // Split the sitemap into chunks if necessary.
  // blah blah...
  menu_rebuild();
}

With xmlsitemap_update in its own function, rather than its stuffing its logic put into hook_cron(), we can call it from other places. For example, I put an "Update" button on the xmlsitemap settings page that can be clicked if someone can't or doesn't want to run cron. I figure we could also add in a toggle for NON high-performance sites to have the sitemap update each time a piece of content is posted, rather than at cron time.

However, when trying to refactor the hook_xmlsitemap_links implementations, I realized that no matter how performant you try and make the update step, right now it's completely wasted energy because modules completely invalidate their own cache every. single. time. the sitemap needs to be updated (see #442242: Add a 'reindex' column to xmlsitemap table?). I can't figure out a way around this problem without doing some more re-structuring in the data model.

So basically, now I'm feeling a little stuck. I want to keep going, but I'm likely to basically re-write huge chunks of the module if I do, and I know there are hordes of people begging/pleading/screaming for a 6.x release. On the other hand, the only reason I'm working on this at all is because one of our clients with 100,000+ nodes and 10 comments/second on their site needs it. And I can't deploy a module that can't handle this kind of traffic. This gives me the only other option of forking, which I don't want to do, since a) everyone and his grandmother already knows about this module, b) Kiam is a very active and responsible maintainer (and I'm definitely not ;)), and c) there have already been forks of this module, each with their own set of tweaky bugs.

So, I don't know; what do you guys think?

avpaderno’s picture

I think that the code should not depend on hook_xmlsitemap_links() that should be removed to raise the performance, and to avoid continuos passages of data from a database table to another.

I don't understand what you mean when you say that the modules invalidate their data every single time; the code is supposed to force the update of the modules database tables only when some settings are changed, and that should not happen all times.

I see that the actual code is removing the rows saved in the database table even when it should not (especially when it does it after the setting are changed), and it would be better if in some cases the code would simply mark the site map content as requiring to be regenerated.

EDIT: I created a report for the wrong behavior I described (see #443172: Database table is emptied all times some settings are changed).

Anonymous’s picture

To calm the masses Kiam could go back and grab the CVS by date that was somewhat working and tag it 6.1-alpha, then create a release from the tag. We then continue with the rework as we need.

avpaderno’s picture

That should be a step backward, and it would also mean that the changed I did until now have been useless (even when I fixed previously existing issues).

Anonymous’s picture

No, Kiam, nothing changes except the whining. It would require a separate checkout for the tagged date into a different folder on your client.

avpaderno’s picture

I should mention that I can only create an alpha version when the exhausted memory, and the PHP timeout problems are resolved. This is a directive given me by Darren Oh.

webchick’s picture

Just a note that I won't be able to get back to this before Monday afternoon. If anyone wants to try and jump in before then, feel free!

avpaderno’s picture

Assigned: webchick » avpaderno
Issue tags: -custom hooks

I will check out this, and see what damage I can do. :-)

avpaderno’s picture

Status: Needs work » Active

The code for xmlsitemap_node.module has been implemented.

avpaderno’s picture

Status: Active » Fixed

The code has been implemented for xmlsitemap_term.module as well. The change has been committed in CVS.

Feel free to reopen the issue if the support must be provided for other modules too.

Anonymous’s picture

Do you need one for the users table? I will test today.

avpaderno’s picture

Title: hook_cron() support for missing nodes and terms » hook_cron() support for populating the database tables
Status: Fixed » Active

You are right; xmlsitemap_user.module is the only module that doesn't implement the cron hook.

avpaderno’s picture

Status: Active » Fixed

hook_cron() has been implemented in xmlsitemap_user.module.

Thanks again to earnie for his support, and for being the original promoter of this method to populate the database table with data coming from Drupal core tables.

Status: Fixed » Closed (fixed)
Issue tags: -Drupal core hooks

Automatically closed -- issue fixed for 2 weeks with no activity.