Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue category Bug because the intended behavior of the line is misleading and no longer needed.
Issue priority Normal
Unfrozen changes Unfrozen because it only changes "markup".
Prioritized changes The main goal of this issue is to remove a now unnecessary "code" which can be seen as reducing fragility and/or improving usability/performance.

The Crawl-delay: 10 was introduced in the robots.txt file a long time ago and the following issue from 2004 outlines some background information on the issue; #14177: Introduce crawl delay in robots.txt in help pages

Many external factors such as server capacity, caching mechanisms and crawler "intelligence" has changed since, and the intended use of the delay instruction is now obsolete. It has certainly been discussed before (e.g. #528926: Remove Crawl Delay Line from Robots.txt file - Its a Death Trap), but the following statement from the official Google Webmaster help section sums up the argument to now finally remove it.

From https://support.google.com/webmasters/answer/48620?hl=en on the topic of manually setting Crawl Rate in the Webmaster Tools admin:

Changing the crawl rate can cause some problems (for example, Google will not be able to crawl at a faster rate than the custom rate you set), so don't do this unless you are noticing specific problems caused by Googlebot accessing your servers too often.

In fact, if you scan Drupal's robots.txt for errors, the results say "Rule ignored by Googlebot" for the Crawl-delay line.

Setting a manual crawl delay should thus only be done in "specific" circumstances, not as a default solution. Google, and most likely any other reputable crawler, has enough sophisticated methods to manage their crawl rate themselves, without the general webmaster needing to explicitly set it. "Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server's bandwidth." (from https://support.google.com/webmasters/answer/182072) Further, this article implies that the preferred method of "requesting" a slower crawl rate is now done from the Google Webmaster Tools admin UI.

I have not looked into the policies of the other major search engines, and there are certainly other robots.txt compliant crawlers out there too, yet I believe that forcing this setting in the Drupal core is the wrong approach both for the continuation of the 7.x life-cycle and the future releases.

CommentFileSizeAuthor
#11 2492191-11.patch251 bytesSivaji_Ganesh_Jojodae
#3 robots.patch247 bytesdroplet
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

cilefen’s picture

Version: 7.x-dev » 8.0.x-dev
Issue tags: +Needs backport to D7
Related issues: +#180379: Fix path matching in robots.txt

According to policy, this must be fixed in 8.x then backported.

cilefen’s picture

Issue summary: View changes

In fact, if you scan Drupal's robots.txt for errors with Google Webmaster Tools, the results say "Rule ignored by Googlebot" for the Crawl-delay line.

droplet’s picture

Status: Active » Needs review
FileSize
247 bytes

If your server cannot handle both Search engine & visitors, you should get a better one instead.

We have an example recently that Google sending huge traffic to our client's website and our server response time slow down a bit. Just within few hours, our traffic dropping down and Google ranking drop at same time. And during the time, we finished our server upgrades. Then after few more hours, our ranking go back and Google keeps search large traffic to our websites.

If a site has 5000 pages at 10 seconds per crawler request it would take 50k seconds to crawl the site, or 13.89 hours. Not that bad, really.

Making assumption like this are totally wrong! Sometimes, 500s would affected your ranking.

klase’s picture

Status: Needs review » Reviewed & tested by the community

Great! Thanks for the quick edits and the patch. I applied it to both drupal-8.0.x-dev and drupal-8.0.0-beta10 with success. Since it is so simple, I changed status on this issue to RTBC. What is the next step to get it into 7.x?

cilefen’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs beta evaluation

A beta evaluation is needed. Use dreditor or https://www.drupal.org/node/2373483.

klase’s picture

Issue summary: View changes
Status: Needs work » Reviewed & tested by the community

Added a Drupal 8 beta phase evaluation template. Please review/update as needed as someone with more knowledge of this process might have a different view. Overall, I feel strongly for this change as it will reduce technical debt, adhere to "less is more" and better reflect the current crawler behavior/situation.

alexpott’s picture

Version: 8.0.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

I agree that we should remove this. And not just for google - read the following 2009 article from bing: https://blogs.bing.com/webmaster/2009/08/10/crawl-delay-and-the-bing-cra... - they say that 10 is extremely slow

As robots.txt is outside ./core sites should customise as they see fit - just like .htaccess

Committed 7d8eb15 and pushed to 8.0.x. Thanks!

Thanks for adding the beta evaluation.

  • alexpott committed 7d8eb15 on 8.0.x
    Issue #2492191 by droplet: Remove "Crawl-delay" in robots.txt
    
hass’s picture

We have seen Crawlers (not Google if I remember correctly) overloading sites without the Crawl delay. It was hitting the site every second more than once. And this is not years ago. I think this was in Feb this year. Therefore I think this should be left in better.

droplet’s picture

@hass,

block bad bots :)

Sivaji_Ganesh_Jojodae’s picture

Status: Patch (to be ported) » Needs review
FileSize
251 bytes

Patch for Drupal 7.

hass’s picture

I'm not aware that drupal blocks bad robots by default.

Aside - one developer tried blocking them in htaccess, what ended in blocking yandex on .ru pages :-(((

klase’s picture

Status: Needs review » Reviewed & tested by the community

I have tested the patch in #11 on both Drupal 7.38 and the latest dev and it applies without problems. Since it is such a small change, I'll go ahead an change status to RTBC.

klase’s picture

David_Rothstein’s picture

Status: Reviewed & tested by the community » Needs review

I think @hass's concerns need more discussion here. The links provided in this issue so far are about Google and Bing, but robots.txt is for a lot more than that.

I'm not sure that for a typical Drupal site (considering size and server resources) removing this is really the right thing to do. We might at least consider lowering it though?

  • alexpott committed 7d8eb15 on 8.1.x
    Issue #2492191 by droplet: Remove "Crawl-delay" in robots.txt
    

  • alexpott committed 7d8eb15 on 8.3.x
    Issue #2492191 by droplet: Remove "Crawl-delay" in robots.txt
    

  • alexpott committed 7d8eb15 on 8.3.x
    Issue #2492191 by droplet: Remove "Crawl-delay" in robots.txt
    
Antti J. Salminen’s picture

Google or other major search engines are not the only bots that crawl your site. Google has completely ignored Crawl-delay for a long time: with a quick search I found a post that seems to indicate Google started doing this around 2008 so using Google's recommendations as a reason for removing it seems misguided. Google's behavior doesn't make the setting obsolete in any way and should not be taken to mean that this setting should now be removed because there are many other bots still parsing and obeying it.

Even when we are only talking about search engines other than Google which honor Crawl-delay I think in general you won't be affected by it much unless your site is fairly large. This is shown pretty well by the example with crawl-time of 5000 pages still being just under 14 hours with a 10 second delay in the previous issue. If you have a site that changes at a rate sufficient for that to become a problem it seems to me that you're only risking some of the less prominent pages being out of date due to delays in crawling them.

As for the numerous other bots doing crawling I have much less confidence in them behaving well since I've seen even GoogleBot get confused by faceted search interfaces and crawl a basically infinite number of pages very fast on an otherwise small site. I don't think you can just dismiss this with "get a better server" because these kinds of crawling glitches can result in a site that would otherwise see just moderate traffic suddenly getting a huge number of requests from someone that has far more resources than you do. Much more common than that is that the crawler just isn't really doing any rate limiting or doing it very poorly. Depending on your hosting plan this might become expensive even if the infrastructure is able to handle it. Creating specific blocks for these kinds of bots doesn't seem like the best strategy either since the list is constantly changing and Crawl-delay takes care of all past, present and future ones with (in my opinion) very little downside.

In summary, I'm not seeing any evidence of having the directive be there doing much harm and I know for a fact it can be useful in some cases. Would like to see a bit more concrete information than given in the issue summary as right now it quotes a search engine that doesn't even honor the setting and I don't see how this means it is obsolete either. In general the whole robots.txt situation is a bit of a mess of course since there's no real standard for it.

Antti J. Salminen’s picture

  • alexpott committed 7d8eb15 on 8.4.x
    Issue #2492191 by droplet: Remove "Crawl-delay" in robots.txt
    

  • alexpott committed 7d8eb15 on 8.4.x
    Issue #2492191 by droplet: Remove "Crawl-delay" in robots.txt
    
longwave’s picture

Status: Needs review » Fixed

This issue should have been marked fixed a long time ago, this was committed in 7d8eb15a2d.

longwave’s picture

Status: Fixed » Reviewed & tested by the community

Oh whoops I didn't see this was for backport to D7.

For parity with D8/9 this should probably be committed to D7, but then again this has been in place since robots.txt was initially added to Drupal, and taking this out of D7 now seems like a big change at such a late stage; users who want to change Crawl-delay will already have done this, and nobody should be building new sites on D7.

The patch itself is trivial so marking RTBC for a D7 core committer to decide whether to commit or close this.

izmeez’s picture

Added the issue to #3207851: [meta] Priorities for 2021-06-02 release of Drupal 7 for greater visibility.

mcdruid’s picture

Status: Reviewed & tested by the community » Closed (won't fix)

It's tempting to commit this to align with D8+ but I think there are good reasons not to. See for e.g. #2837128: Reintroduce "Crawl-delay" in robots.txt .

As has been mentioned here already, any D7 sites that really need a different value will have customised their robots.txt already.

I don't see much benefit to changing this for existing D7 sites now, and there are some potential downsides.

So I'm going to close this issue as "won't fix".