When the cloud module is set to retrieve content for a specific website every hour, it seems in reality to fetch the syndicated page every 15 minutes.

Comments

al’s picture

Are you sure about this? I have crontab setup to access cron.php every fifteen minutes. I have a BBC feed which is supposed to update every hour.

Looking at the admin page I see this:
BBC Headlines (UK) | news, bbc | 15 items 31 min 56 sec ago | 28 min 4 sec left

Which seems to imply that everything's fine?

flow’s picture

I was referring to the cloud module, not the import module.

I recieved a complaint from a site in my blogrolling list. Apparently the blogrolled page was fetched every 15 minnutes, instead of every hour as it was configured to do.

Dries’s picture

I have looked at this and it currently works as follows: Drupal checks for updates whenever cron is called unless there has been an update in the specified interval. Thus, if you set the interval to 6 hours, Drupal will continue to check the site until an update has been detected. Once an update has been detected, the site won't be checked during the next 6 hours.

I'm afraid it has always been like this. Fixing this behavior requires database changes. The proper fix is probably to use the "E-Tag" and "Last modified" headers. We could then drop the "update interval" setting as well as the "change threshold" setting. (Then again, dynamic websites don't always support these headers. Not our problem?)

Of course you can also change your setup to run cron only once every hour, or increase the update intervals now you know what they do.

flow’s picture

If it's necessary, then a change in the database is definately what is required.

Somebody running the cloud module on Slashdot, with a cron update that runs every 15 minutes will discover his IP banned from the site in no-time. I happen to know the person that complained about my excessive pinging, but it's not hard to imagine not everybody being that enthousiastic about this waste of bandwith.

I think the way the cloud module works now is wrong. The new site configuration clearly states "The refresh interval indicating how often you want to check this site for updates", suggesting that a site will only be contacted once every specified interval.

al’s picture

Assigned: Unassigned » al

As reflected by the CRITICAL status, I think we should fix this properly for 4.2. Dynamic sites don't generally honour the last-modified header. If they have enough of a clue to do that, they probably have an RSS feed, which kind-of negates the point of the cloud module in the first place.

Surely this is just a trivial patch to the database to add a field for when the site should next be checked? I.e. "expires" (which is set every time you check it to time() + $checkinterval). You only check sites that have $site["expires"]

Anonymous’s picture

Assigned: al » Kjartan

Fixed in CVS