By Codeblind on
I'm looking for a module that will permanently archive older content as static files. Ideally, it would do the following if a node is n days old:
1. Make a static HTML rendering of the content.
2. Push the static file to a directory or external site for safe keeping.
3. Unpublish and (optionally) prune the cached node from the various node tables.
It would also redirect users to archived pages as needed.
Comments
=-=
there is no module that does this.
I couldn't find one either.
I couldn't find one either. I'm also looking for anyone who may be interested in contributing labor or funds toward the development of such a module if nothing turns up.
Why?
I'm curious why you want to do this?
I have a news site where ~90%
I have a news site where ~90% of the traffic is for nodes under 1 month old. Meanwhile I have roughly 100 new nodes posted every day. The assorted node tables (and various caching mechanisms) grow very large, very quickly, with most of it being dead weight that seldom gets looked at. For these pages a static file served up by apache and/or varnish (or whatever) with a simple header and footer include would be fine.
Another reason is that because it's a news site it gets a base of several thousand page views an hour during the day, but if a story breaks we'll get a surge of tens of thousands of views an hour. So I was thinking all the functionality used for archiving older pages with little or no traffic could also be triggered to statically cache articles in extremely high demand and thereby keep the server from exploding. Granted, at first glance you'd think Boost cache and maybe a few servers would suffice, but with this proposed method we could quickly archive a very stripped down version of the page layout to increase content delivery speed (thus getting the requests out of the queue quicker) and reducing overall http request per page.
Or maybe there's a better way?
Proportion
I'd be careful about making assumptions about "very large." MySQL does suffer performance degradation with very large datasets, but very large means a LOT more than a couple of hundred nodes per day.
The major performance issues are related to writes and not reads. You don't want to go solving the "problem" before you're absolutely certain that it's a problem.
It can be. We initially loaded the Augusta Chronicle database with every story dating back to 1996, but in the course of performance-tuning the site, we chose to remove the 20th-century stuff and migrate it to an archive system.
The size of the database was one factor in making the editorial user interface too slow. Reads weren't so bad (we had some heinous SQL queries that needed rewrites, but that's another issue). Writes and updates were sucking, so we migrated very old material into a system that operates as a read-only database.
This trimming and migrating is more a problem that needs to be addressed every year or so, not frequently. Given the pace of Drupal core development you might wind up upgrading your module more often than you'd use it!
I believe our techs handle this issue by copying the entire database and doing extraction offline, and clean up the production database with custom scripts that hit the database directly.
If you're thinking about going down this road you should plan ahead (and not do what we did): Craft your pathalias rules so that you can distinguish between datasets by looking at the URL.
If you move all your 2007 data to some other environment (flat files or whatever), you should be able to proxy passthrough to that environment based on the URL (using Apache mod_proxy, Squid or Varnish). That will let you export those resources, move them to an archive server, and still maintain all the URLs that are known to the search engines.
I just saw this. The new
I just saw this. The new Drupal.org dashboard is very helpful. Thanks for your insight. I think processing the archive outside of Drupal might be a great idea. It certainly reduces the problem domain a great deal.
An archive module is a must
An archive module is a must have feature especially when doing multiple domains under a single database. Imagine 20 websites under one database that's a lot of content. It would be very handy to be able to archive and remove say 2 year old content to keep the database snappy.
Here on Drupal.org there are old nodes dating back several years that's insane. Sure it's great for SEO but I could care less about something posted back in 2003 it's no longer relevant to today. Or at least have a way to separate old content into it's own separate database for storage.
I am also looking for an
I am also looking for an archive solution via a module instead of dealing with Apache, mySQL, etc.
My database got so large that the ISP is complaining and twice my site went down. I am thinking of getting a dedicated server, but then my workload will increase and not sure if I will still not have the same situation.
An archive can create a secondary database hence the main one will be faster to load and less prone to disaster.