One of the things I ran into while trying to make a 'News sitemap' sub-module using the new context system was that there will be a conflict as several different modules try to provide different sitemap types that cannot be combined. We'll need an elegant way to handle a global 'type' context (for XML sitemap types: normal, news, video, mobile, etc.).

I'm also running into the problem that we need to allow the XML delivery callback to take an argument of contexts as well if we wanted to add a new menu item like 'sitemap-news.xml' that would automatically have the 'news' type context be TRUE.

Comments

Dave Reid’s picture

Dave Reid’s picture

Issue tags: +Release blocker
Dave Reid’s picture

This context should probably be provided by the base module itself. Other modules would only then need to hook into the sitemap edit form alter to provide the additional type values.

Dave Reid’s picture

Another thing to consider is being able to provide a partial context array to the sitemap page delivery callback. So a xmlsitemap_news.module could add a new menu router item:

  $items['sitemap-news.xml'] = array(
    'page callback' => 'xmlsitemap_output_chunk',
    'page arguments' => array(array('type' => 'news')),
    'access arguments' => array('access content'),
    'type' => MENU_CALLBACK,
    'file' => 'xmlsitemap.pages.inc',
    'file path' => drupal_get_path('module', 'xmlsitemap'),
  );
Dave Reid’s picture

Priority: Normal » Major
willmoy’s picture

Sub but will try to be helpful in Oct if still hanging around

willmoy’s picture

I hope this is helpful...

There are seven types of sitemap:

  No. of URLs Content included Separate file?
Normal Millions All-ish Yes
Mobile ? can't find an example but in principle = normal sitemap All-ish Must
News Hundreds (even CNN = 232) Narrow range Should
Video Thousands (CNN = 3-4000, USA Today ~1400) but ~10m for Vimeo Node type? Can do
Image Millions ? Node type ? No
Geo Millions ? Millions ? No
Code Millions ? Node type ? Can do

The point being that they have to integrate in very different ways.

  • News is the extreme case, having almost nothing to do with a normal sitemap; it need be no more complicated than the current 6.x module.
  • Mobile basically has to have and video and code all might use the same infrastructure (own file, massive scale) as the main sitemap.
  • Image and geo must be able to, and video, code and even news also could just extend the main sitemap.

Contexts, as implemented in i18n and domain_access, are simply extra constraints on the SELECT query on xmlsitemap. It is not obvious how that can be extended to handle these cases.

The blunt selection of entities in the module at the moment will do for a sitemap that we presume ought to have just about everything in it, but it isn't really flexible enough for news, images etc.

So I think the thing to do is:

  • Extend the xmlsitemap table with one column per type of supported sitemap, equivalent to the status (in or out) column for the general sitemap, so that each sitemap knows what urls it applies to (this done by a xmlsitemap_specialist module)
  • Those extra status columns xmlsitemap are populated through rules + actions (and VBO if necessary). This gives a simple and outsourced interface, flexibility when needed and simplicity for most use cases.
  • The xmlsitemap_specialist module defines any extra tables it needs to store metadata in and query_alters to JOIN them when necessary in generate_sitemap()
  • _specialist also implements hook_cron() and then passes off to xmlsitemap to do the actual generation
  • An extra hook_xmlsitemap_element_alter($link, &$element) to allow adding namespaced metadata before the element is sent to writeSitemapElement(), because xmlsitemap only knows about its own properties
  • Contexts will continue to work as they do now, and can work on all different kinds of sitemap, which they probably need to, so you can get your localised news sitemap without any changes needed to i18n module.

Imagine a site with news content containing significant images. You want perhaps three sitemaps:

  1. General one, containing all pages including news pages, with image detail and video detail
  2. News one, containing only news pages for the past two days, no other extensions
  3. Mobile sitemap, containing mobile content (quite possibly the same set of urls as the general one) but no other extensions

The general one happens more or less as now. When articles are saved, a rule checks if they contain relevant image/video content and if so adds the appropriate sitemap status and metadata. When the general sitemap is being generated, xmlsitemap_specialist notices (by testing $sitemap in query_alter) and JOINs and extends with hook_sitemap_element_alter where applicable.

The news one is its own sitemap, and its own file. Its cron kicks off on every run and it uses a fairly complex set of criteria to decide what goes in (blog posts with certain tags are news for this site), as well as news articles.

The mobile one is like the general one. It has blanket rules so it includes all content and a blanket setting that all mobile content is in, for argument's sake WAP. It does a straight query off xml_sitemap and spits it out, no JOIN.

As I say, massively overkill for news but will work for the others too.

I should also say there may be better ways but it looked like an effort was in order, so here's mine.

digi24’s picture

Extend the xmlsitemap table with one column per type of supported sitemap, equivalent to the status (in or out) column for the general sitemap, so that each sitemap knows what urls it applies to (this done by a xmlsitemap_specialist module)

Wouldn't it be better to have a separate table for each sitemap type? Each table could store its relevant metadata, thus keeping up performance and joining on numerical ids is not any significant overhead for the db.

willmoy’s picture

2 reasons I didn't think so:
(1) I think this would be easier to integrate with the existing codebase, because it would just need a small change to the function that does the query on the _links table
(2) Several sitemap formats don't need any per-link metadata (mobile and news come to mind)

digi24’s picture

News does need additional data: title, keywords, type access for example, so do images in sitemaps. If you do not store this data in a separate table you will have to retrieve it for each sitemap generation. But then you could just use views instead of the xmlsitemap module to generate them on the fly. Please correct me if I am wrong.

willmoy’s picture

Sorry, haven't been able to look at this before now. I think I reckoned that for many use cases, the extra properties of news sitemaps would be the same for every article, so you could save the overhead by just specifying them once in the sitemap config—but that doesn't deal with the 'strongly recommended' title property.
http://www.google.com/support/news_pub/bin/answer.py?answer=116037

And you're definitely right about image.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=178636
(Perhaps the smoothest way to handle image sitemaps would be to parse nodes for img tags, and pick up src, title and alt attributes for loc, title and caption respectively?)

The only one that would work is mobile, I think, and even then not for some use cases.

So yeah, separate tables joined on ID and type.

Rhino’s picture

Subscribing.

BenK’s picture

Subscribing

70111m’s picture

Subscribing

bensnyder’s picture

Sub

anavarre’s picture

Subscribe

lemming’s picture

I think I have been posting:

http://drupal.org/node/451234#comment-6083094

to the wrong thread maybe? It might be more relevant here?

DamienMcKenna’s picture