Currently all cached pages are served to anonymous visitors with the date of the creation of that cached copy. This lets a lot of search engines believe that you have a lot of pages of fresh content. While this can be welcome for some people it gets a problem for sites such as drupal.org. We currently serve well over 50GB to Google and friends and the month isn't over yet. The attached patch tries to outline a way to set a more reasonable date for pages.

The patch is a study, it lacks a database update. The new field could use a better name than "timestamp".

The patch works for nodes with and without comments. Every loaded node will set the cache date to its "changed" date, comments will set the date base on the data in node_comment_statistics.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

kbahey’s picture

Adding link to previous discussion on infrastructure list.

killes@www.drop.org’s picture

Vlado has pointed out to me that this could be integrated into the upcomign theme engine rewrite that Adrian is working on.

killes@www.drop.org’s picture

I also want to point out that about 25% of the drupal.org bandwidth (or 70GB) are gobbled up by search engine bots, so it would be good to do something about it. This issue can't be resolved by invalidating the cache less often, since this would only lessen the CPU load but not the bandwith requirement.

dikini’s picture

well, if the thing is accepted :)

but, the issue is that we need a good mechanism to identify the age of a page. And a page is not only the node proper, quite a few things contribute to the age, and we need to be able to determine which of all should be considered important or not. And keep things simple while doing it.

We can only know all contributing factors just before a page is rendered, when all of the information is assembled - i.e. nodes, blocks, etc... By the way that is valid for all rendering, not only html pages but various feeds and similar animals.

This is the essence of what I was showing to killes earlier.

killes@www.drop.org’s picture

The cached age of the page is important for the anonymous users. Since they get a cached page, the place in the workflow where we set the final data does not matter much. It is either taken from the cached copy or we will need to process everything in order to put the date into the cache.

kbahey’s picture

but, the issue is that we need a good mechanism to identify the age of
a page. And a page is not only the node proper, quite a few things
contribute to the age, and we need to be able to determine which of all
should be considered important or not. And keep things simple while
doing it.

We can have an automatic or semi-automatic way for each piece of content
(e.g. node, comments, blocks, ...etc.) to "push" their date up a "register".

For example.

Module 1 pushes its date 2005-05-01 14:00:00, there is nothing in the register yet, so the common function/hook that does that just places this here.

Module 2 pushes its date 2005-04-29, the common function finds that this is older than what is there, and discards it.

Module 3 pushes date 2005-05-05, the common function picks that up since it is newer.

This process repeats until we end up with the date of the page being the date of the newest "piece" of content displayed on it (e.g. comment, node, ...etc.)

Some stuff has to be exempt from this (e.g. banners), and some can be elective/optional. But for the main content stuff, this overall solution seems workable. The question is: should this be an explicitly called function? If so, then where? Or can it be automated somewhere?

Gerhard Killesreiter’s picture

kbahey, my patch implements what you describe, only it isn't complete. I will complete it once we agree the approach is a good idea.

Dries’s picture

Recent comment blocks, forum topic blocks, news aggregator blocks, popular content
... all have to be sprinkled with drupal_set_cache_date()? Also some blocks look incompatible; the recent poll block shows the votes. We'd need to add timestamps to votes?

This approach looks prone to errors.

Dries’s picture

 function comment_nodeapi(&$node, $op, $arg = 0) {
   switch ($op) {
     case 'load':
-      return db_fetch_array(db_query("SELECT last_comment_timestamp, last_comment_name, comment_count FROM {node_comment_statistics} WHERE nid = %d", $node->nid));
+      $comment = db_fetch_array(db_query("SELECT last_comment_timestamp, last_comment_name, comment_count FROM {node_comment_statistics} WHERE nid = %d", $node->nid));
+      drupal_set_cache_date($comment['last_comment_timestamp']);
+      return $comment;
       break;

This looks wrong. Just because we load a node, doesn't mean we're going to display it ... ?

killes@www.drop.org’s picture

Dries, thanks for the review.

Yes, blocks would need to set their cache age. the recent votes block could just set it to "now" which would give the same behaviour as we have now.

If we load a node we will maybe not display it, but I think we can assume that the cache age should not be older than the "last changed" age of a node. This is what the function call in comment_nodeapi does.

adixon’s picture

This patch seems very worthwhile to me. I can think of some sites I work with that have lots of old content, it'd be nice to save all that wasted bandwidth and cpu...

Re: the concerns about blocks updating the cache date:

I think the "date" you're trying to fix made complete sense in the days of static html, but doesn't anymore for many pages, and isn't always respected. For example, I'm pretty sure that most browser's default settings are to reload pages regardless of the date at the beginning of a new session. I guess it's still quite important for images and maybe css files.

My point is that it doesn't need to be 'perfect' in the sense that there could be a different page sent and still you don't want to update the date (the ad banners is the example already cited).

Instead, I think the goal should really be: when has the content changed significantly enough to require a new date? In that context, only modules that are essentially changing the content of the page need to care about this. And maybe it even depends on the particular site? For example, many newspapers run their comments on a separate server and perhaps only include the list of comments via some javascript - in their case, comments don't change the date, and they probably don't want them to. But of course on drupal.org, where comments often have important information on the issue at hand, you would want google to respider a page when it gets a new comment.

With that perspective, I think it's a worthwhile patch. Some administrator options to allow a site to disable the cache-date-update for specific modules would be nice as well.

m3avrck’s picture

Part of this bandwith problem can be solved by a sensible robots.txt as well: http://drupal.org/node/75916

Some combination of this patch + a robots.txt would really go a long way in saving bandwith to crawlers.

killes@www.drop.org’s picture

Unfortunately, most robots don't repect the time delay setting. I've tried it on drupal.org and no decline in eg googlebots activity was found.

killes@www.drop.org’s picture

FileSize
5.88 KB

Patch updated.

This patch is against 4.7 to facilitate testing.

Changes:
- moved setting of cache date to view hook instead of load (thanks chx)
- added support for tracker page.

To test, you need to run

alter table cache add timestamp int(11) NOT NULL default '0';

killes@www.drop.org’s picture

FileSize
7.54 KB

added date logic to comment_render

moshe weitzman’s picture

I gave this one a good testing and it works as advertised. I had to fix one SQL bug. New patch for 4.7 attached.

Yes, various blocks are going to have to set cache age. They can do set age to time() if thats what makes sense. For something like polls, perhaps we bump the last_modified of the node upon each vote. that would solve the need.

I would call the column 'last_modified' and the function drupal_set_cache_last_modified

Looks like a good idea to me, in light of the mammoth resources consumed by crawlers. At leatsd I cannot think of an alternative.

BioALIEN’s picture

Version: x.y.z » 4.7.x-dev

Subscribing to this as it seems interesting.

Whats being done about this and whats the conclusion?

killes@www.drop.org’s picture

Version: 4.7.x-dev » 6.x-dev

I'll probably work again on it once development for Drupal 6 opens up again. Note that the development of a common standard for search engine maps may make this unneccessary,

moshe weitzman’s picture

@killes (or anyone else) - make sense to revive this one for D6?

kbahey’s picture

For cached pages, there is a related patch that does better headers for caching.

It is fairly simple, and would have a positive impact on a site.

Check it here http://drupal.org/node/147310

killes@www.drop.org’s picture

Yes, I think this patch still makes sense. I'd be willing to work on it if I knew I wouldn't work on something that is deemed unimportant and thus won't nd up in core...

moshe weitzman’s picture

Version: 6.x-dev » 7.x-dev

sigh - move to 7

Chris Johnson’s picture

Sigh, indeed. This should have made it into D5 and then D6.

ChrisKennedy’s picture

Status: Needs review » Needs work

This is an old issue that doesn't have a working implementation... why should it have gotten into d5 and d6? Moshe's update in #16 is also missing the revised patch.

Chris Johnson’s picture

I didn't say an out of date patch should have gotten applied to D5 and D6. I meant that correct, fixed last-changed dates should have been in D5 and D6. This issue and a suggested fix was raised back in 4.7 days, so there was plenty of time to incorporate it into core. Sheesh.

killes@www.drop.org’s picture

Category: feature » bug
Status: Needs work » Needs review
FileSize
10.57 KB

Seems this patch has fans. I've updated it and added a DB upgrade path.

I am reclassifying this as a bugfix.

moshe weitzman’s picture

I still think this language is clearer: db column: 'last_modified' and the function: drupal_set_cache_last_modified.

We now have block caching. I think that we can incorporate blocks into this system with just a single change. We don't have to change every block anymore.

Gerhard Killesreiter’s picture

I've changed the name of the column on the cache table to last_modified and the function name to drupal_get/set_page_last_modified. The function is used by the cache but also in general for generating pages.

Gerhard Killesreiter’s picture

FileSize
9.63 KB

I've changed the name of the column on the cache table to last_modified and the function name to drupal_get/set_page_last_modified. The function is used by the cache but also in general for generating pages.

I'll investicate to coupe it with the block cache.

Gerhard Killesreiter’s picture

FileSize
11.2 KB

Ok, patch updated, some bugs fixed.

If a block is loaded from cache, $cache->last_modified is used to set the page's page_last_modified value.

Gerhard Killesreiter’s picture

Version: 7.x-dev » 6.x-dev
FileSize
12.04 KB

Ok, another update.

Seing this is now a bugfix, I am putting this in the D6 queue again.

If you are a core committer, feel free to move it back.

m3avrck’s picture

Relatedly... last night I discovered another benefit of this patch. When you grab an RSS feed from Drupal, the LAST-MODIFIED always has the date you grabbed the feed, which is bad. Feed readers think there is always new content, even if there isn't, so it has to double check everything again and grab the whole feed again.

On patch review... drupal_set_page_last_modified() why do we compare if greater than? What if the last_modified is set to say "10" and then I want to set it to "8" -- it would be less than and fail. Sure that probably won't happen, but wouldn't isset() be better?

Some debug code in there there watchdog('debug', "block cid $block->module $block->delta $cid"); (2 instances)

catch’s picture

Status: Needs review » Needs work

This looks like it'd have a lot of benefits, I know a load of my traffic is from search engine crawlers. Given m3avrck's review setting to CNW.

chx’s picture

m3avrck, if I have some piece of content that was modified a week ago then there should be no way for a further call to set the page age a month old. That's why we never set the time back.

killes@www.drop.org’s picture

Status: Needs work » Needs review
FileSize
10.94 KB

Indeed. and here's the updated patch.

dmitrig01’s picture

Status: Needs review » Needs work
       'created'    => array('type' => 'int', 'not null' => TRUE, 'default' => 0),
+      'last_modified'  => array('type' => 'int', 'not null' => TRUE, 'default' => 0),

That's a code style error...

chx’s picture

Status: Needs work » Needs review
FileSize
10.51 KB

Added spaces (do not credit me).

keith.smith’s picture

Status: Needs review » Needs work

Two, two, two! hunks failed...

$ patch -p0 < reasonable_age-65017-37.patch
patching file includes/bootstrap.inc
Hunk #1 succeeded at 566 (offset -1 lines).
Hunk #3 succeeded at 629 (offset 1 line).
patching file includes/cache.inc
patching file modules/block/block.module
Hunk #1 succeeded at 441 (offset 3 lines).
patching file modules/comment/comment.module
Hunk #1 succeeded at 592 (offset 19 lines).
Hunk #2 succeeded at 914 (offset -79 lines).
Hunk #3 succeeded at 1030 (offset 19 lines).
Hunk #4 succeeded at 996 (offset -86 lines).
Hunk #5 succeeded at 1142 (offset 19 lines).
Hunk #6 succeeded at 1325 with fuzz 2 (offset 16 lines).
patching file modules/node/node.module
Hunk #1 succeeded at 974 (offset 69 lines).
patching file modules/system/system.install
Hunk #1 FAILED at 344.
Hunk #2 succeeded at 2385 with fuzz 1 (offset -1608 lines).
1 out of 2 hunks FAILED -- saving rejects to file modules/system/system.install.rej
patching file modules/tracker/tracker.pages.inc
Hunk #1 succeeded at 36 (offset -12 lines).

kbahey’s picture

Version: 6.x-dev » 7.x-dev

And this needs to go into 7.x now.

beginner’s picture

subscribing

Jaza’s picture

Version: 7.x-dev » 8.x-dev

Bump. +1.

catch’s picture

Version: 8.x-dev » 7.x-dev

This is a bug, and not really an API change, leaving at 7.x for now.