The following are some tips for high-load sites that are using the XML sitemap module (2.x version).

  • Enable the following options in admin/settings/xmlsitemap:
    • Minimum sitemap lifetime: 1 day (or higher/less frequent). The module is smart enough to only update the cache files only if the sitemap data has changed. But if you have a site that changes often, the sitemap may be rebuilt every time cron is run. Since generating the sitemap files is very expensive operation, and since search engines only request your site's sitemap.xml every so often, it is best to change this setting to 1 day or higher.
  • Enable some mod_rewrite rules to avoid a full Drupal bootstrap.

Comments

SlyK’s picture

"Enable some mod_rewrite rules to avoid a full Drupal bootstrap."
Please can you tell me what these SOME rules? I don't understand... is it a secret?

chrisdfeld’s picture

Ok, I'll step up to the plate on this one. I added the following rules to .htaccess to detect requests for sitemap.xml and serve them directly from the filesystem. This offers better performance than the default arrangement, which loads the Drupal bootstrap in order to serve the sitemap.

These rules go inside .htaccess above Drupal's default <IfModule mod_rewrite.c> section. You need to look in your /sites/default/files/xmlsitemap folder to find the name of the hash directory created by the xmlsitemap module, then paste the directory name in the code below where it says INSERT_HASH_DIRECTORY_NAME_HERE. It will be a string of 32 hex letters and digits.

<IfModule mod_rewrite.c>
  RewriteEngine on

  # The folder containing your sitemap XML files, relative to DOCUMENT_ROOT
  # This folder is created by the xmlsitemap Drupal module
  RewriteRule .* - [E=xmlsitemap_path:/sites/default/files/xmlsitemap/INSERT_HASH_DIRECTORY_NAME_HERE]

  # Handle requests for /sitemap.xml directly, without Drupal bootstrap
  # Multi-page sitemaps: point to index.xml
  RewriteCond %{QUERY_STRING} ^$
  RewriteCond %{REQUEST_URI} ^/*sitemap\.xml$
  RewriteCond %{DOCUMENT_ROOT}%{ENV:xmlsitemap_path}/index.xml -f
  RewriteRule .* %{ENV:xmlsitemap_path}/index.xml [L]

  # Single-page sitemaps: point to 1.xml (index.xml does not exist)
  RewriteCond %{QUERY_STRING} ^$
  RewriteCond %{REQUEST_URI} ^/*sitemap\.xml$
  RewriteCond %{DOCUMENT_ROOT}%{ENV:xmlsitemap_path}/1.xml -f
  RewriteRule .* %{ENV:xmlsitemap_path}/1.xml [L]

  # Serve a specific page as referenced from index.xml: /sitemap.xml?page=N
  RewriteCond %{REQUEST_URI} ^/*sitemap\.xml$
  RewriteCond %{QUERY_STRING} ^page=(\d+)$
  RewriteRule .* %{ENV:xmlsitemap_path}/%1.xml? [L]
</IfModule>

This code performs a transparent redirect so that the sitemap content will appear to come from a file called /sitemap.xml. The actual physical file location of the sitemap XML files is not exposed to search spiders.

I tested this on a single-site Drupal 6.17 installation with xmlsitemap 6.x-2.0-beta1 using Apache 2.2. This code would need to be modified for a multi-site installation.

Exploratus’s picture

What about Domain Access and this technique?

premanup’s picture

I've successfully tested this config snippet for nginx/1.2.3:

location = /sitemap.xml {
root SITEMAP_HASH_DIRECTORY;
try_files /$arg_page.xml /index.xml =404;
}

SITEMAP_HASH_DIRECTORY is a full path to hash directory created by the xmlsitemap module.