Closed (fixed)
Project:
XML sitemap
Version:
6.x-1.x-dev
Component:
Code
Priority:
Critical
Category:
Bug report
Assigned:
Reporter:
Anonymous (not verified)
Created:
29 Jan 2009 at 16:01 UTC
Updated:
21 Aug 2018 at 03:31 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #1
scottm316 commentedI couldn't get the patch to apply. I got the file line "function xmlsitemap_node_api(&$node, $op.....{"
and the patch line "switch ($op)" do not match! Maybe I'm doing something wrong. I am using TortoiseSVN, and most recent xmlsitemap 1.x-dev. I also have the issue of no node entries. http://drupal.org/node/351570
Comment #2
Anonymous (not verified) commentedTry this one.
cd modules/xmlsitemap
patch -p0 < xmlsitemap_node-hook_cron-DRUPAL-6--1.patch
The -dev is synchronized with CVS via a scheduled CRON. The patch is created against the CVS code for the 6.x-1.x-dev branch.
Comment #3
scottm316 commentedI'm sorry I tried using the second patch but get the same error. Using 6.x-1.x-dev branch.
Comment #4
Anonymous (not verified) commentedKeeping up with the CVS changes.
@scottm316: Make sure you apply the patch to the current CVS. The file export might not be up-to-date yet.
Comment #5
scottm316 commentedI can forward you the .rej file if necessary.
[root@vstation001 xmlsitemap]# patch -p0 < xmlsitemap_node-hook_cron-DRUPAL-6--1_0.patch
patching file xmlsitemap_node/xmlsitemap_node.module
Hunk #1 FAILED at 110.
Hunk #2 succeeded at 155 with fuzz 2 (offset -1 lines).
1 out of 3 hunks FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.module.rej
Comment #6
avpadernoTortoiseSVN is not for CVS; you should be using TortoiseCVS.
Comment #7
avpaderno@earnie: You added a fix for the misspelled property in your patch, when that is already fixed in CVS.
The code you are proposing uses a query that select all the rows in the xmlsitemap_node table (
WHERE nid NOT IN (SELECT nid FROM {xmlsitemap_node}); I imagine it would be a problem when that table contains almost all the nodes contained in Drupal core table, and the number of nodes is a big enough number.I think the concept could be applied, but rather than using
hook_cron(), I would use a different hook. I would like to do like Poormanscron does, and allow the module to handle few nodes at every page load.The vantage of the approach is that the administrator of a Drupal-powered site would not need to go to a specific XML Sitemap page to update the data contained in its tables (it's also true for the solution you propose), and XML Sitemap would not consume excessive time resources like it is happening now with
hook_xmlsitemap_links().Comment #8
Anonymous (not verified) commentedI forgot to cut the misspell correction from the patch, sorry. I've modified the query in my version like so:
This will only select rows from {node} and none from {xmlsitemap_node}. I'll upload a patch later, I want to work out the xmlsitemap_term issues so that I can begin using this module in production.
Comment #9
avpadernoThe checking for
n.promotemust be eliminated: the site map doesn't contain only the nodes that are promoted to front page.Comment #10
avpadernoit would be better if the number of nodes added in the module table during a cron session would be set in a Drupal variable which is set in the settings page of XML Sitemap.
Comment #11
Anonymous (not verified) commentedThanks for the hint on the promote filter and yes the LIMIT is being set by a variable_get with a default of 20. I haven't coded the settings page yet.
Comment #12
avpaderno@#8: If you use that query, then you must also add a check for the returned
nid.I would change it in
SELECT n.nid FROM {node} n LEFT JOIN {xmlsitemap_node} xn ON n.nid = xn.nid WHERE xn.nid IS NULL AND n.nid > %d AND n.status = 1 LIMIT %d.The code must then save the last
nid, and use it for the query.Comment #13
avpadernoTalking of the hook to implement for this case, I think that
hook_cron()is the best candidate. If then it's required to add less rows and to do it more frequently, then it is possible to use Poormanscron. The only alternative I have found is to implementhook_exit(), which creates some problems as the code should bootstrap Drupal to be sure the required functions are available (but then it would not be sure also other modules would be loaded).Comment #14
Anonymous (not verified) commentedI don't get your suggestion of n.nid > %d. Only n.nid where xn.nid is null (i.e. the nid doesn't exist in xmlsitemap_node) will be returned. I've tested this using the same SQL via phpmyadmin. Can you explain what you're thinking for n.nid > %d and from where the value comes?
Comment #15
avpadernoMaybe it's useless, but my idea was to start from the last node inserted, and start the query from the node having the successive
nid.This can help in the case the data of one node added in the module table get removed from it; in that case, the query you proposed would start again from that
nid, while mine would start from the previous saved row.Of course, the code must save the
nidobtained from the query.Comment #16
Anonymous (not verified) commentedI've implemented this feature for both nodes and terms. The patch includes a UI change to control the number of rows updated at once; one for nodes and a different one for terms. It also implements some changes in xmlsitemap_node that I found necessary to make it easier to implement this feature. I'm using feedAPI and autotag modules and this patch has good results with version 1.5 of feedAPI and version 1.18 of autotag.
Comment #17
Anonymous (not verified) commentedI meant to also say that I don't see merit for starting greater than the last node updated since all nodes missing should exist in xmlsitemap_node and the query only chooses missing nodes.
Comment #18
avpadernoIf, for any causes, a row is being removed from the database table right after
hook_cron()has added it, the implementation ofhook_cron()suggested would try all the time to add it again.To save the node ID of the last saved node is something that the Drupal core modules do too, when they use the batch operations to operate on nodes. It should not be required to do such thing in those cases too, but they do it the same; I would guess there is a reason.
Comment #19
avpadernoThe patch needs to be changed.
hook_cron().INSERT INTOqueries, when it would be better to use thedrupal_write_record()that would also avoid to change the SQL query placeholders in the case, i.e., a field would be changed from integer to float.hook_cron(), it would be better to have a single module that implements it, and then calls a hook implemented in the other modules.Comment #20
avpadernoIt will not implemented; to populate the modules database tables has been implemented a different system.
Comment #21
Anonymous (not verified) commentedFair enough; I almost did this myself.
Comment #22
avpadernoThe proposed patch has not been changed from February 6, and the code actually implemented in the project modules follows a completely different approach to update the content of the module database tables.
Comment #23
webchickSubscribe.
Comment #24
avpadernoFollowing what reported by webchick, I am reopening this issue.
The batch operation code used to populate the module database tables has been removed, and it will be implemented a different system based on
hook_cron().The patches reported in this queue need to be update for the last code changes.
Comment #25
Anonymous (not verified) commentedI won't be able to update the patch anytime soon if someone wants to beat me to it please do.
Comment #26
webchickSince this whole migration to using cron for use with high-performance sites is "my fault," ;) I'd like to help with this.
I'm currently trying to wrap my head around how this module functions to know how to properly refactor it.
Comment #27
avpadernoFor an example of how the xmlsitemap_node table is populated (although the function has a different purpose), see the
xmlsitmap_node_set_priority()function, and what is done when the table contains already the data for the node passed to the function.I have to still understand how to implement code that is generic enough to be valid for each modules of the project, and all third-party modules integrating with XML Sitemap.
Comment #28
avpadernoThe implementation of the cron task should also follow a specific order in populating the database tables; in the same way done for the implementation of
hook_xmlsitemap_links(), where XML Sitemap: Node is the first to populate the site map table (followed by XML Sitemap: Terms), also in this case XML Sitemap: Node should be called before XML Sitemap: Terms.Comment #29
webchickOk, I've now done enough investigative work that I think I basically understand what's going on. Assigning to myself.
Thanks for all of your help (and quick commits!) Kiam! :)
Comment #30
webchickOh, and this now qualifies as a critical bug, since there is currently nothing in the code that populates legacy content.
Comment #31
avpadernoI have a little question. Should the code be implemented for nodes only, or made more generic (and work for all the modules)?
If the code needs to be generic, then it will more complicated too. I guess that a single module should coordinate the others, otherwise they would try to populate their database tables at the same time, which is not a good thing.
Comment #32
Anonymous (not verified) commentedEach module requiring a table build should contain its own hook_cron implementation IMO. The drupal_cron_run function does a module_invoke_all('cron') so the execution will happen sequentially based on the sorting weight of the module. Some control over how many rows or how much time is taken of the entire cron run should be managed globally though.
Comment #33
avpadernoI was thinking to a control about which module should execute the cron task when the cron tasks are executed more than once; I mean that if the first time it's xmlsitemap_node to execute its
hook_cron(), then the second time should be xmlsitemap_term (i.e.).Otherwise, the cron tasks will be executed for all the XML Sitemap modules at the same time. I don't think this could be a problem, but I imagine the difficulties to calibrate the settings to make the automatic update as smooth as possible.
Comment #34
Anonymous (not verified) commentedIf you control the run by number of rows to process or a slice of time then the user would decide how much of the 240 seconds should go to xmlsitemap. I suppose a hook_cron_sitemap could be created for each module and xmlsitemap_cron would do the calling of each based on if there is time left.
Comment #35
webchickWhat I have been doing to this point (my xmlsitemap module atm is a complete mess, so there's nothing really to upload in terms of a patch yet) is copy/pasting lots of code from search module.
Search module has a similar architecture to xmlsitemap, where there is a central module (search.module) running the show, and calling hooks in other modules that wish to respond (hook_update_index). So what I've done is basically:
With xmlsitemap_update in its own function, rather than its stuffing its logic put into hook_cron(), we can call it from other places. For example, I put an "Update" button on the xmlsitemap settings page that can be clicked if someone can't or doesn't want to run cron. I figure we could also add in a toggle for NON high-performance sites to have the sitemap update each time a piece of content is posted, rather than at cron time.
However, when trying to refactor the hook_xmlsitemap_links implementations, I realized that no matter how performant you try and make the update step, right now it's completely wasted energy because modules completely invalidate their own cache every. single. time. the sitemap needs to be updated (see #442242: Add a 'reindex' column to xmlsitemap table?). I can't figure out a way around this problem without doing some more re-structuring in the data model.
So basically, now I'm feeling a little stuck. I want to keep going, but I'm likely to basically re-write huge chunks of the module if I do, and I know there are hordes of people begging/pleading/screaming for a 6.x release. On the other hand, the only reason I'm working on this at all is because one of our clients with 100,000+ nodes and 10 comments/second on their site needs it. And I can't deploy a module that can't handle this kind of traffic. This gives me the only other option of forking, which I don't want to do, since a) everyone and his grandmother already knows about this module, b) Kiam is a very active and responsible maintainer (and I'm definitely not ;)), and c) there have already been forks of this module, each with their own set of tweaky bugs.
So, I don't know; what do you guys think?
Comment #36
avpadernoI think that the code should not depend on
hook_xmlsitemap_links()that should be removed to raise the performance, and to avoid continuos passages of data from a database table to another.I don't understand what you mean when you say that the modules invalidate their data every single time; the code is supposed to force the update of the modules database tables only when some settings are changed, and that should not happen all times.
I see that the actual code is removing the rows saved in the database table even when it should not (especially when it does it after the setting are changed), and it would be better if in some cases the code would simply mark the site map content as requiring to be regenerated.
EDIT: I created a report for the wrong behavior I described (see #443172: Database table is emptied all times some settings are changed).
Comment #37
Anonymous (not verified) commentedTo calm the masses Kiam could go back and grab the CVS by date that was somewhat working and tag it 6.1-alpha, then create a release from the tag. We then continue with the rework as we need.
Comment #38
avpadernoThat should be a step backward, and it would also mean that the changed I did until now have been useless (even when I fixed previously existing issues).
Comment #39
Anonymous (not verified) commentedNo, Kiam, nothing changes except the whining. It would require a separate checkout for the tagged date into a different folder on your client.
Comment #40
avpadernoI should mention that I can only create an alpha version when the exhausted memory, and the PHP timeout problems are resolved. This is a directive given me by Darren Oh.
Comment #41
webchickJust a note that I won't be able to get back to this before Monday afternoon. If anyone wants to try and jump in before then, feel free!
Comment #42
avpadernoI will check out this, and see what damage I can do. :-)
Comment #43
avpadernoThe code for xmlsitemap_node.module has been implemented.
Comment #44
avpadernoThe code has been implemented for xmlsitemap_term.module as well. The change has been committed in CVS.
Feel free to reopen the issue if the support must be provided for other modules too.
Comment #45
Anonymous (not verified) commentedDo you need one for the users table? I will test today.
Comment #46
avpadernoYou are right; xmlsitemap_user.module is the only module that doesn't implement the cron hook.
Comment #47
avpadernohook_cron()has been implemented in xmlsitemap_user.module.Thanks again to earnie for his support, and for being the original promoter of this method to populate the database table with data coming from Drupal core tables.