• Support for more than one URL list; the default list would be the one already supported by the module, but the user would have the possibility to add more URL lists. This would allow to optimize the resources used creating the URLs list as rather than having a big list, it could be possible to create short URL lists; considering that Yahoo!, and Google supports more than one URL list, this could be of help for the module that would not cause any decrease of the available resources in the case the number of nodes is too high. Having co-maintained XML sitemap for 6 months, I can say that is probably the only way to resolve the resource problem when the number of URLs to add in the list is too high.
  • Support for caching the URL list. Implementing a system for caching the URL list generated can help in making the time took from the module to output the list shorter. The con on caching the output is that the module output tends to not be quickly updated as the direct output would be, but adding more options for the generation of the list can help with this problem.

Any feedback is welcome.

Comments

deekayen’s picture

I could see some urllist option expansion like urllist by content type or author. Did you have any particular sub-list content categories in mind?

I'm more concerned about the caching. URL lists are not high traffic locations, so the only benefit seems to be making it faster for a bot to download. A bot has no feelings about the delay unless it times out entirely. I don't see why it should loose the ability to show the most recent URL list in light of that. Whether cached or not, the server still has to make the effort to generate the file, and it seems like there is useless overhead for it to update the list to keep it up to date, even if it isn't being queried.

apaderno’s picture

I was thinking more to a list based on the content type, but there could be some other criteria.
The reason to have multiple lists rather than one single list is to make the generation of the URL list faster, not to make the bot wait less, but to avoid PHP timeout problems that would cause the output to not be complete, or the server to not be less responsive to the requests it receives. That is the same reason a cache could be desired.

The problems found in the development of the XML sitemap are similar to the problems the URL list code could have. Apart that the output format of the module is simpler (which means that there are no problems of not closed XML tags caused by a PHP timeout), creating a list that could be very long has the same problems in both the modules.

  • Returning the URL list by reading chunks of text from a prepared cache file is faster than printing each list item to the default output.
  • Creating the list before the output is required by the search engines makes the server load lower; the server has then more time to reply to user requests.

I actually think that the user should be allowed to create different lists basing on the content type. In that way, it would be possible to optimize the creation of the cache files.

deekayen’s picture

You don't have to argue about the point that a cached list will load faster when requested. I agree that it likely would.

Even in an excessive example, like Drupal.org with 400,000+ nodes, I don't think cache is *the* solution to preventing incomplete lists. Yes, it is *a* solution. On the other hand, a content type selector might be. Didn't gsitemap have a feature at one time that would self-link to a continuing page if it went over 10,000 entries?

Whether you generate the cache files on every cron job, or delay it with cronplus, the server still has to manage the overhead of having to generate the cached files and not have PHP timeout in the process. It's important that the content of the URL List doesn't timeout whether it's in cron or live. Either way, you still have to generate the content in a way that doesn't timeout. With the all the search bots from all engines hitting the list maybe a half dozen times a day, the cache engine overhead isn't worth it.

It seems like you're wanting to put a lot of effort in coding for an edge case that hasn't even presented itself in the issue queue in any other form than hypothetical theory.

Dave Reid’s picture

If URLlist is going to be the 'lightweight' alternative to XML sitemap it needs to be careful to keep itself distinguished instead of creeping more and more features. I'm not a maintainer so I don't have any say, but I think it's something to keep in mind.

apaderno’s picture

@#3: I am sorry; I keep to repeat myself when I am not sure I was clear in my previous message.
Using, in example, hook_cron() is still possible to avoid the timeout. That is done from the Drupal core itself, which is able to update the search index without to cause any timeouts; the search module doesn't update the index when users search for something, but it does it in different times.

If what I propose is seen as an edge case, or it's not desired because the module is not thought for sites with a lot of nodes, then we can simply set this report as "won't fix".
My idea was to make URL list an alternative to XML sitemap which could be used also from who has a Drupal site with more than X nodes, and an alternative to using a RSS feed created by Views (which, in my case, is causing some problems because a CCK field present in a content type I created).

@#4: URL list is a lightweight alternative to XML sitemap because it uses a simpler format which carries less information, and it creates only a node list (differently from XML sitemap which populates its output with nodes, user profiles, taxonomy terms, menus, and custom links). It's not the use of a cache that makes the difference between URL list, and XML sitemap.

deekayen’s picture

URL list has to distinguish itself from XML Sitemap by more than just the output format. I'm sure through a simple if statement somwhere, XML sitemap could skip the XML formatting and just output a list as URL List does.

Dave Reid’s picture

Yeah the data is all there. It'd be fairly easy to change the output format.

apaderno’s picture

I'm sure through a simple if statement somwhere, XML sitemap could skip the XML formatting and just output a list as URL List does.

That is true, but then it would not be XML sitemap anymore.
As long as the output format doesn't change, the use of some kind of cache mechanism doesn't transform a module into another one.
Said this, mine is a proposal; I can be absolutely wrong in what I am saying, and what I am suggesting is not something that must be absolutely implemented. If the changes I am suggesting are not desired, then simply they will not be implemented.

deekayen’s picture

There are lots of popular contrib projects that include multiple modules (i.e. Views, Adsense, workflow, etc) as part of the package. Maybe in preparation for those edge cases where caching might be helpful, if there was a urllist_cache.module, which was entirely optional, that would be something more acceptable and could be included in the URL List repo.

deekayen’s picture

Status: Active » Closed (won't fix)