Reintroduce "Crawl-delay" in robots.txt [#2837128]

Problem/Motivation

The Crawl-delay setting was removed from Drupal 8's default robots.txt after having been in place for a long time in previous versions. This means the site doesn't place any limits on crawl rate and the crawler bots can decide it entirely by themselves. In practice there have been cases where poorly behaving bots continue to hammer a site with many requests per second from many remote addresses even long after it's clear that the site is experiencing issues and becomes unresponsive or otherwise consumed a significant amount of resources for very little reason.

We should revisit the removal of Crawl-delay from Drupal 8's default robots.txt and discuss reinstating it. The decision to remove it wasn't thought through and the arguments that were made to support it may not stand up to scrutiny. Experience tells that there are real cases where having a default delay is useful. Hass also raised this issue when the change was being thought of for Drupal 7. There are many bots around that can't be trusted to crawl at a sensible rate but that do respect Crawl-delay. Below are the arguments made when deciding to remove Crawl-delay and an examination of each of them.

These arguments were presented as reasons to remove

1. Google doesn't support Crawl-delay anymore thus making it obsolete
2. Google recommends not adjusting GoogleBot's crawl rate through Google Webmaster Tools
3. Lowering crawl rate for Google will make your search ranking drop
4. Server capacity has changed
5. Caching mechanisms have changed
6. Crawler "intelligence" has changed
7. Bing calls a 10 second Crawl-delay value "extremely slow"
8. robots.txt is supposed to be customised

Examination of the arguments

1. Google doesn't support Crawl-delay anymore thus making it obsolete
There has been no recent change in Google's behavior regarding Crawl-delay. It has completely ignored Crawl-delay for a long time: Here's Google's Matt Cutts in 2006 stating they ignore it. Here's a forum thread from 2004 mentioning GoogleBot doesn't support it. This would seem to predate the previous decision to include it for Drupal's default robots.txt. As far as I can tell GoogleBot may have never supported Crawl-delay so it's not all that relevant to the discussion.

The reason for Google's lack of support is also not because they consider it obsolete. Here's what Matt Cutts gave as the reason in an interview:

And, the reason that Google doesn’t support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.

2. Google recommends not adjusting GoogleBot's crawl rate through Google Webmaster Tools
Google's recommendation to not change the crawl rate is not meant to be taken as a general rule applied to all bots, after all it's not even talking about Crawl-delay. Google may be smart enough to not flood your site with requests when it can't handle them but you can't just go ahead and assume that the same is true for every single crawler bot out there based on that. In a previous issue it was already estimated that even the 10 second Crawl-delay setting that has been the Drupal default for years should not be a problem unless your site is fairly large. This is shown pretty well by the example with crawl-time of 5000 pages still being just under 14 hours with a 10 second delay in the previous issue. If you have a site that changes at a rate sufficient for that to become a problem it seems to me that you're only risking some of the less prominent pages being out of date due to delays in crawling them.

3. Lowering crawl rate for Google will make your search ranking drop
Again, this doesn't even concern Google but even if it did there was no evidence shown for this being the case. The mentioned case where the SERP ranking for Google had dropped actually sounded like a situation where Crawl-delay might have helped (had Google supported it) if the client site was being penalised because of poor response times.

4. Server capacity has changed
There has been no change relative to the resources afforded by search engines and other crawlers. You can still be heavily outmatched by a crawler and it's typically much cheaper to fire off HTTP requests in rapid succession than it is to serve them. Having the capacity doesn't always mean that you want to spend in on serving a bunch of poorly behaving bots that provide little to no benefit to you either.

5. Caching mechanisms have changed
Drupal 8 caching mechanisms may have taken some important steps forward but I don't think it really changes the overall picture that much. Crawlers are likely to find the parts of your site that are not cached and cause more load than a typical visitor.

6. Crawler "intelligence" has changed
The claim that crawler "intelligence" has improved requires some evidence. There are definitely "dumb" bots still out there and this happening across the board doesn't seem obvious.

The important thing to realize here is that we are not talking only about reputable crawlers like Bing, we are talking about all of them. It seems unlikely that all bots out there would even want to behave well by default or being sophisticated enough to do it. Even GoogleBot has gotten confused by faceted search interfaces and crawl a basically infinite number of pages very fast on an otherwise small site and other bots (for example AHRefsBot and SemRushbot) have brought sites down repeatedly until they get blocked when there is no Crawl-delay in place. You can't just dismiss this with "get a better server" either because the crawler failing to do any rate limiting or doing it very poorly can result in a site that would otherwise see just moderate traffic suddenly getting a huge number of expensive requests from someone that has far more resources than you do. Even if it is manageable it may not be something you're particularly interested in using your resources for. Creating specific blocks for these kinds of bots doesn't seem like the best strategy either since the list is constantly changing and Crawl-delay takes care of all past, present and future ones with (in my opinion) very little downside. I'd much rather have exceptions for the few bots we really care about if there's enough concern to warrant that.

7. Bing calls a 10 second Crawl-delay value "extremely slow"
This doesn't seem like an argument for removal. It could be an argument for reducing the value (at least for Bing) and maybe that is something that should indeed be done. However, just going by a label they give with no reasoning whatsoever doesn't seem particularly convincing. It doesn't seem like Crawl-delay is considered to be a SERP ranking killer in SEO circles anywhere, also rschwab stated he hasn't found any claims that it would kill rankings when rejecting the change for Drupal 6.

8. robots.txt is supposed to be customised
Seems like customisation can be done to remove the Crawl-delay just as easily as adding one, it's Drupal's responsibility just to provide a safe and sensible defaults and these defaults probably should be geared towards smaller sites as those are much less likely to change these settings.

Proposed resolution

Add a Crawl-delay line back to Drupal's robots.txt.

Remaining tasks

Discuss.

Comments

Comment #1

18 December 2016 at 23:15

Antti J. Salminen created an issue. See original summary.

Comment #2

cilefen CreditAttribution: cilefen commented 18 December 2016 at 23:30

What is the effective delay if crawl-delay is unset? Details around that seem missing from the issue summary.

What are the deleterious effects of leaving it unset?

edited

Comment #3

droplet CreditAttribution: droplet commented 19 December 2016 at 08:38

I'd like to talk in real world usages instead of researching on the internet.

We managed pretty a lot of websites (including Drupal, Complicated WordPress sites, WooCommerce and Magento) and many of them only hosted in cheap ($5 ~ $10 per month) shared hosting. All of them has no "Crawl-delay". And none of one taken down by search bots traffic. ( In fact, in my own experience of all these popular CMS & PHP frameworks. Drupal is almost the only one has robots.txt and "Crawl-delay". Many sites should be down for that if it's happening )

But we believed & confirmed the crawling speed affected our ranking. None just about the page's response time. It's also about REAL TIME searching. We expected our new content appear in the search engine and everywhere less than an hour or so.

Anyone experiencing this problem and ONLY change crawl-delay helping you solve the problem? Please share your real experience and figure out what that bots are. (If you only have one site in your whole life, that's not a good example. We have sites people coded their own bots to steal our contents. and they pretended as Google..etc )

Cheers!

Comment #4

Antti J. Salminen CreditAttribution: Antti J. Salminen as a volunteer commented 19 December 2016 at 11:33

@cilefen:
I believe the default is to give no limits and leave it up to the crawler to decide what is a good crawl rate. The benefit of setting a crawl-delay is that you're specifying a safe limit for the bots that vary wildly in their implementations and have a tendency to default to crawling faster rather than slower. Sites that have limited resources and sites already experiencing problems won't be as likely to experience further performance degradation. This is a soft measure that seems to me to have very little downside.

@droplet:
How did you confirm that limiting crawl speed via Crawl-delay affected your SERP ranking? In the previous issue you mentioned the ranking on Google dropping because of GoogleBot causing slowdown and being restored after adding resources to the server. Even if we ignore the fact that GoogleBot doesn't even care about Crawl-delay, a case like that is fundamentally different from specifically asking the search engine to slow down it's crawl rate because it means the search bot has experienced slow response times from the site. This might lead it to lower the site's ranking. Indeed, Google does say page speed affects ranking and this article suggests that "time to first byte" might be what they really mean by that. A Drupal website being overloaded would be slow at serving the first byte and a crawl-delay could actually help with that.

As for whether or not site slowdown from this is a real world threat I'd say it absolutely is. Hass mentioned having seen crawlers cause issues with a too fast crawl rate in the previous issue. I've personally seen it cause issues for multiple sites. While I don't doubt that bot traffic has contributed to a fair number of small sites and sites having issues going down, it's not a particularly visible issue that you'd hear about all the time and not every website owner will be able to pinpoint as being caused by bots. The consequences are often also less dramatic such as just creating unnecessary load and slowdown. When around half of all website traffic is caused by bots according to some reports it doesn't seem implausible that they could also create too much traffic at least some of the time.

I'd think frameworks are unlikely to include a robots.txt at all and it's not surprising if many projects don't since this is such a small detail with limited (but in my opinion positive) effect.

Comment #5

droplet CreditAttribution: droplet commented 19 December 2016 at 12:59

How did you confirm that limiting crawl speed via Crawl-delay affected your SERP ranking?

You can remove it and monitor your site's performance. Again, we should not only focus on SERP ranking, Pos 1 or 2. the timing also matters. Let's say today's news, we don't want the bots crawling it at tomorrow night. Bots are not only the Google-like search engine.

In the previous issue you mentioned the ranking on Google dropping because of GoogleBot causing slowdown and being restored after adding resources to the server.

Nope. that was REAL traffic, not the bots traffic. That's why I'd like to hear real story on this post rather than other articles. Many important info was hidden.

I've personally seen it cause issues for multiple sites.

What did you do then? Just adjust the Crawl-delay? and what bots they are?

Comment #6

Antti J. Salminen CreditAttribution: Antti J. Salminen as a volunteer commented 19 December 2016 at 19:19

@droplet:

Nope. that was REAL traffic, not the bots traffic. That's why I'd like to hear real story on this post rather than other articles. Many important info was hidden.

Ok, I misunderstood that part of the case. The points I made about it still apply though, I really don't understand how this example has anything to do with specifying an explicit Crawl-delay.

What did you do then? Just adjust the Crawl-delay? and what bots they are?

I gave examples in Google itself (not because Crawl-delay helps with Google but to demonstrate that even GoogleBot is not as "intelligent" as they'd like you to believe), AHRefsBot and SemRushBot having caused problems. Those are only the few I happen to remember.

Google has done it with two sites for me, both included faceted search which caused GoogleBot to go on an infinite crawling craze. Modules implementing faceted search for Drupal have attempted to combat this by setting rel="nofollow" on facet links but for some reason this didn't help in the cases I've observed. For Google the solution obviously could not be Crawl-delay so I blocked GoogleBot from some of the facet pages in robots.txt.

The other bots I've indeed dealt with by adding a Crawl-delay although it's not a bad idea to also have harder measures such as something like denyhosts/fail2ban kicking into action if something is being particularly obnoxious or completely ignoring Crawl-delay.

Comment #7

hass CreditAttribution: hass commented 20 December 2016 at 00:13

For us it was yandex if I remember correctly. Over 10-20 hits per second. I also believe there is a difference in how many pages your site has.

The picture may be not the same with 500 pages compared to 4.000.000 pages.

Comment #8

20 December 2016 at 00:13

Version:

8.3.x-dev

» 8.4.x-dev

11 February 2017 at 08:50

Version:

10.1.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.