Last updated 31 August 2016. Created on 31 July 2005.
Edited by cilefen, mvc, eosrei, doitDave. Log in to edit this page.

How to produce a static mirror of a Drupal website?

Note: You should certainly only use this on your own sites...

Prepare the Drupal website

Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving. Consider including a link to the future versions of the site (e.g. if you are archiving a 2008 event, link to the URL of the next event).

Disable interactive elements which will be nonfunctional in the static HTML version.

Use the Disable All Forms module to disable all forms.

  • login block
  • who's online block
  • registration
  • anonymous commenting
  • links to the search module and/or any search boxes in the header
  • comment controls which allow the user to select comment display format
  • Disable ajax requests such as views pagers.
  • Remove Views exposed filters
  • Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
    update node set comment = '1';
  • It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.

Create a static clone

Wget (UNIX, Linux, OSX, ...)

Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URLs properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:

wget -q --mirror -p --adjust-extension -e robots=off --base=./ -k -P ./ http://example.com

wget respects the robots.txt files, so might not download some of the files in /sites/ or elsewhere. To disable this, include the option -e robots=off in your command line.

wget includes all query strings such as image file "?itok=qRoiFlnG". Recursively remove all query strings with:

find -name "*.*\?*" | while read filename; do mv "$filename" "${filename%%\?*}"; done

HTTrack (UNIX and Windows and Mac/homebrew)

HTTrack. The Windows GUI client version will produce the mirror with almost no configuration on your part. One potential command to use is:

httrack http://2011.example.com -K -w -O . -%v --robots=0 -c1 -%e0

Note the -K option creates absolute links - this is only sometimes useful if you are hosting a public mirror on the same domain. Otherwise omit -K to produce relative links

The -c1 options makes only 1 request at a time so this becomes rather slow. The default is -c10, so you might considering something more like this value when archiving your own site.

With HTTrack properly configured, you don't have to hack on common.inc to get all of your stylesheets to work correctly. However, with the default robots.txt settings in Drupal 5 and the "good citizen" default HTTrack settings, you won't get any module or theme CSS files or JavaScript files.

If you're working from a local installation of Drupal and want to grab ALL of your files in a way that you can just copy them up to a server, try the following command:

httrack http://localhost/ -W -O "~/static_cache"  -%v --robots=0 

Advanced httrack dumps

KarenS has created a very helpful description on how to "statify" a Drupal site with httrack, where she suggests the following code (on a Linux console):

httrack "http://${root_uri}" -O "$targetdir" -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0

(where $root_uri is the place to start grabbing, most likely your public Drupal root, and $targetdir the start point for the backup files.
In the latter, she furtherly suggests to run a regex on all files to fix link issues with index.html; where her regex can even be improved to:

find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe '/((?<![\'"])\/index.html|(?<=[\'"]\/)index.html)\b//g'

This way it would leave Drupal's non-trailing-space paradigma intact and avoid "duplicate content" issues while preserving absolute paths. Note that this only works with a web server configured to add the necessary trailing slashes again and resolve to the actual index.html file.

You can even copy [dump_root]/[yourhost.tld]/index/index.html file to [dump_root]/[yourhost.tld]/index.html, where in the latter, all '../' must be removed from the source. If you do this, you can change your DOCUMENT_ROOT from [dump_root] to [dump_root]/[yourhost.tld]. This way you preserve even more of the former site structure and make sure that "/" requests will not fail. Note that this requires (Which, then again, could also be done by .htaccess rules.)

Sitesucker (OSX)

Site Sucker. This is a Mac GUI option for downloading a site.

Drupal modules

You can use a Drupal module to export some or all of your site as static HTML.

Verify that the offline version of your site works

Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.

Why create a static site archive?

  • Perhaps over time your website have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together.
  • You want to ensure that the site is preserved on Drupal.org infrastructure (without direct cost to you)
  • Alternatively, you may want to produce an offline copy for archiving or convenient reference when you don't have access to the Internet. Before simply removing the site, consider another alternative: a Drupal site is maintained inside a firewall, and then the output of the site is periodically cached to static HTML files and copied to public servers.

Looking for support? Visit the Drupal.org forums, or join #drupal-support in IRC.

Comments

hughbris’s picture

I've always found wget pretty reliable when you figure out the right flags ;)

Before I found this documentation, I also had problems with site-relative stylesheet links, but not all of them. If there's a pattern, it seems wget successfully converted @href values in the "/sites/[mysite]" directory to "sites/...", but actually converted others to full URLs including protocol and domain (http://example.com/...). For example, "/modules/node/node.css?b" came out as "http://example.com/modules/node/node.css?b". Hmmph??! I guess the problem links came from sites/all on the filesystem, but I expect this to be irrelevant to a client (like wget).

Here are the flags I used, -Erkp are probably the only relevant ones:

wget -w 3 --random-wait --user-agent=hugh -Erkp http://example.com -o example.wget.log

I can work around this with other tools for now. Just on the off-chance, does anyone happen to know why this even happens? I think -np ("no parent") is default behaviour just in case this has anything to do with it, though it shouldn't.

greg.1.anderson’s picture

wget uses the full URL to save the file -- e.g. modules/node/node.css?b becomes the literal filename, including the ?b. When you try to fetch the page that contains a reference to that file, the browser will stop before the ?, and request the file modules/node/node.css, which will not match with modules/node/node.css?b. Your options are:

  1. Alter Drupal (temporarily?) to not include the ?b in css links
  2. Post-process the files downloaded by wget, renaming all foo?bar to foo
  3. Use a different tool, like httrack
tangobravo’s picture

I think it's because the robots.txt file disallows crawlers from going into the /sites/ directory, among others. I included -e robots=off in my command line and that seemed to pull in all the expected files. I've edited the wget section above to add this info.

My full command line was wget -w 1 -Erkp -e robots=off [URL] -o wget.log

krueschi’s picture

Thanks a lot for this command line options! This command works fine for me for delivering a local development version of the site as a zip archive to my client for viewing purposes only.

reptilex’s picture

If you had the problem like me, that sitesucker is not downloading css or the site is not being displayed correctly try to enable the "ignore robots exclusions" option in the settings. After that the site was downloaded and shown as expected.

joachim’s picture

Does anyone have any suggestions of how to deal with filtered views on a site to archive?

the_g_bomb’s picture

I would imagine they would have to be disabled. Httrack handles sorted table views but doesn't handle the exposed filters as they require a form response.

--
G

brentratliff’s picture

There most likely won't be anything to handle the callback. Ensure you disable Views Ajax pagers in particular.

jcisio’s picture

Is it possible to keep url unchanged? Current with the default httrack options, node/12 will be changed to node/12.html, test will be changed to test.html, query string (the one added in JS files for versionning) is merged into filename etc.

Ideally, we should be able to:
- node/12 is saved as node/12/index.html instead of node/12.html
- Remove simple query string: misc/drupal.js?m4kqgj is kept as misc/drupal.js instead of misc/drupald4c4.js. Currently we can use -N "%h%p/%n.%t"

yan’s picture

I also wanted to keep the URLs unchanged and I found this nice article by KarenS where she writes:

One of the biggest problems of transforming a dynamic site into static pages is that the urls must change. The 'real' url of a Drupal page is 'index.php?q='/news', or 'index.php?q=/about', i.e. there is really only one HTML page that dynamically re-renders itself depending on the requested path. A static site has to have one HTML page for every page of the site, so the new url has to be '/news.html' or '/news/index.html'. The good thing about the second option is that incoming links to '/news' will automatically be routed to /news/index.html' if it exists, so that second pattern is the one I want to use.

The -N flag in the command will rewrite the pages of the site, including pager pages, into the pattern "/about/index.html". Without the -N flag, the page at "/about" would have been transformed into a file called "about.html".

I followed her instructions using
httrack http://example.com -O . -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0

And it worked, at least with the correction she also suggestst:
find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s/\/index.html/\//g"

The downside: images and other files are also put to their respective [filename]/index.[filetype] directories, so their URLs do change.

doitDave’s picture

Thanks a lot for this important link, I have instantly added it to the document body because KarenS' solution is (in my eyes) closest to no-frills. I am not sure however to what you may refer with images, I really had no issue with that, or are you talking about image nodes?

I also added some hints on tweaking the regex and stuff. With that done, I managed to get a good-working static site dump of an averagely-built site with almost no issues.

It may also be worth a try to run a second regex to get totally rid of the lousy /index/index.html construct (although this might be configurable in httrack directly). This should not even be too complicated, but I have to check side effectsCorrection: This won't work. the /index folder is referred to by relative links (which makes sense). So it's probably better left as is. However, creating an /index.html is still not a bad idea.

hth

yan’s picture

It's been a while now, but regarding the images, I think I meant that it does work, but that their URL changes, i.e. if they were referenced somewhere, that might lead to a broken link. But I don't think that it's such a big problem.

chyatt’s picture

I researched the wget command and preferred the following instead:

wget -q --mirror -p --no-check-certificate --html-extension -e robots=off --base=./ -nd -k -P ./ <URL>

Here's what each argument means:

-q                      Don't write any wget output messages
--mirror                Turn on options suitable for mirroring, i.e. -r -N -l info --no-remove-listing
-p                      Download images, scripts, & stylesheets so that everything works offline
--no-check-certificate  Ignore certificate warnings
--html-extension        Append .html to any downloaded files so that they can be viewed offline. E.g. www.example.com/example becomes example.html
-e robots=off           Disable robot exclusion so that you get everything Drupal needs
--base=./               Set the base URL to best resolve relative links
-nd                     Do not create a hierarchy of directories
-k                      Convert links to make them suitable for local viewing
-P ./                   Download here
yan’s picture

Deleted because this comment appeared in the wrong place. See comment #9068699

bwooster47’s picture

So I've followed the instructions above, and all user creating content links are gone. Except for one.
After the above steps, the Forum module still showed "Post new forum topic" for anonymous users.
Easily fixed that - go into User Permissions and remove Create Topic from Anonymous users.
But then that link changed to "Login to post new content in the forum."!

Not yet sure how to disable that...

bwooster47’s picture

This was a very helpful page and helpful comments, have now completed transition of a site to an archive.
Here are some other useful links:
Park your old Drupal site
and
Creating a static Drupal site
The latter shows how to handle the tricky Acidfree module, as well as shows .htaccess rules to keep Drupal archive in a sub-dir and still allow some other package to be used at the root URL.

doitDave’s picture

Since my actual workflow for a larger site (60k nodes) was a mix of multiple links/howtos I found here, I'd like to share a log of it. Maybe it helps one or two of you.

Thanks again to all helpers I already had!

wangzb265’s picture

I tried the advanced solution and it changes all the url without html extension. How can we keep the original url with .html extension?

jimafisk’s picture

Thanks for this guide! I used HTTrack to deploy a former D7 site to GitHub Pages. Here's a quick video tutorial I made in case it's helpful to someone: https://www.youtube.com/watch?v=SDEUW4UVS8c&list=UUpGmkFt8EgnMAaZ2eJ8mRi...