At #448000-9: Big, Over-Arching Sitemap Architecture Discussion (tm) netaustin writes:

Any site with a respectable pagerank has its sitemap.xml file read frequently--I think this is partly to blame for quicksketch's above warning to his clients. I'd much prefer to serve an XML file that only changes on cron straight from Apache; no sense in having a PHP process open just to buffer a couple megabytes of XML.

Currently, because we serve the cache file from the Drupal path http://www.example.com/sitemap[X].xml, requesting the file incurs a full Drupal bootstrap. If you are a popular site and getting pounded by search engines daily, that's going to eat an awful lot of your CPU and database.

ImageCache module handles this in quite an interesting way:

/**
 * Implementation of hook_menu().
 */
function imagecache_menu() {
  $items = array();

  // standard imagecache callback.
  $items[file_directory_path() .'/imagecache'] = array(
    'page callback' => 'imagecache_cache',
    'access callback' => TRUE,
    'type' => MENU_CALLBACK
  );
  // private downloads imagecache callback
  $items['system/files/imagecache'] = array(
    'page callback' => 'imagecache_cache_private',
    'access callback' => TRUE,
    'type' => MENU_CALLBACK
  );

  return $items;
}

file_directory_path() points to your files directory (e.g. sites/default/files). This has the awesome side-effect of Apache serving the file if it exists at sites/default/files/xmlsitemap/sitemap.xml, and if it doesn't, Drupal will kick in and create it.

Now, I don't think XML Sitemap should move back to the model of "create the sitemap while you're looking at it" but I wonder if there's not something interesting we could use here.

Comments

apaderno’s picture

The problem is with the links that can be put in a site map, which are directly related with the site map URL; if the site map URL is http://example.com/system/files/sitemap.xml, then it cannot contain links like http://example.com/node/2, or http://example.com/this-is-an-alias-for-a-node.

That is the reason the site map is being served with a URL like http://example.com/sitemap (in the 1.x branch).

apaderno’s picture

The only way to make this feature possible is to create a sitemap.xml file directly in the Drupal base directory; if the front page is seen as http://example.com/dr6/, then the file should be dr6/sitemap.xml.

I am not sure the running PHP has write permissions on that directory, and that is the reason I didn't use that approach.

hass’s picture

Title: Let sitemap.xml be served from the web server » Let Apache serve sitemap.xml file

Apache is not the only server we need to support. Don't forget IIS, please. The basedir is typically not writeable, but this could be solved with a rewrite rule pointing to the files directory. But editing a .htaccess file to get sitemap module running is something that is not user friendly and may not work on multi-site installations as we have different files directories and would add extra complexity to the .htaccess file with domain name detection.

apaderno’s picture

Title: Let Apache serve sitemap.xml file » Let sitemap.xml be served from the web server

I guess that webchick meant the web server, not a specific web server.
I agree with hass that the workaround to make the proposed menu definition work is not user-friendly, and it would add complexity when it is not strictly needed.

webchick’s picture

Title: Let Apache serve sitemap.xml file » Let sitemap.xml be served from the web server

Kiam is correct about the issue title.

And unfortunately, Kiam is also correct that the location of the sitemap dictates what links may be contained within it. Bummer. :( So we really do need the sitemap to be located in the Drupal root directory.

Giving PHP write access to Drupal's root directory is unacceptable from a security standpoint, and forcing users to edit .htaccess (or its IIS equivalent ;) ;)) to add a rewrite rule to point requests from sitemap.xml to sites/default/files/xmlsitemap/sitemap.xml is unacceptable from a usability standpoint.

I think this is won't fix (well, really more like "can't fix") unless someone else has any bright ideas.

Dave Reid’s picture

Status: Active » Postponed

I am just going to leave this as postponed for now. If we can't come up with a solution by the time a stable 6.x-2.x release is ready, I'll come back and mark this won't fix.

apaderno’s picture

Can a search engine accept a redirect, when it is accessing the sitemap? In that case, what would it use like URL of the sitemap (the original URL, or the redirected URL)?

gregarios’s picture

Subscribing.

I also feel this module would work way better if it just saved a static file anywhere on the server. Let the (human) module installer make a root-level symlink to it if there are permissions problems. Maybe it would just need a little more documentation. (very little)

hass’s picture

I'm not sure if "redirect" works, but "rewrite" will work for sure for advanced users. :-)

apaderno’s picture

"rewrite" will work for sure for advanced users

See #5.

Anonymous’s picture

I would think that /sitemap.xml would be a callback that just reads and prints a file from the files directory as is. No processing other than opening the file for read and printing its contents in full.

apaderno’s picture

I would think that /sitemap.xml would be a callback that just reads and prints a file from the files directory as is.

In some cases that doesn't work, especially if the number of links per sitemap chunk is too high; in those cases, the PHP script could cause a timeout.

Anonymous’s picture

What about creating a directory sitemap.xml in the root directory? This directory would be writable by the web server. Then /sitemap.xml would serve index.xml instead? A .htaccess in that directory would change the index file to index.xml and the rewrites would show the URL as entered.

gregarios’s picture

@Kiam:

"See #5."

Rewrite can work for this. It can make it appear that the file is at the root no matter where it really is stored.

@webchick:

"And unfortunately, Kiam is also correct that the location of the sitemap dictates what links may be contained within it. Bummer. :( So we really do need the sitemap to be located in the Drupal root directory."

The sitemap file doesn't have to be anywhere in particular, it just has to appear to be at the root level of any directories listed in it. This could be done with a Rewrite in an .htaccess file, or with a Symlink, or with a directory named "sitemap.xml" with an index.html file in it.

apaderno’s picture

In such case, the sitemap would appear as http://example.com/sitemap.xml/ (when using a directory with that name).
I am dure the search engines would not be foolish by that.

Dave Reid’s picture

Status: Postponed » Closed (won't fix)

For high performance sites, it's best for them to add a rewrite rule or a symbollic links to the cached file in the files directory. However this basically won't work for multisite or multilingual installs. Need a rewrite guru to add to this XML sitemap documentation page on "Tips for high-volume sites" at http://drupal.org/node/483476. As such, marking this as won't fix since I don't see any solution that will work for majority of module users besides the current method of using a simple callback and using file_transfer() to simply read and print the output of the cache file.

chrisdfeld’s picture

I added some Apache mod_rewrite code for this to the appropriate documentation page here: http://drupal.org/node/483476#comment-3069860

Feedback/improvements welcome.