I've recently been using Nutch (a web crawler) to build a Drupal-specific search engine. I get to watch how web-crawlers behave when they look at Drupal sites. It is appalling. We let crawlers search our login page, our request new password page, different sorted views of our tables, and lots of other stuff that is just wasteful. Furthermore, if you ever sit and watch your Apache logs roll by, you'll know that search engine traffic is a very large percentage of all traffic for many sites. The situation could be fixed by including a robots.txt file.

Or could it? What about multisite configurations? Well, just putting a robots.txt file in the top level directory locks every site into using one file, which clearly won't suffice.

Thus I propose that we adopt the strategy I used for my robotstxt module for core. We add an alias "robots.txt", add a variable of the same name, and output it when robots.txt is asked for. You can edit the variable as an administrator in a textarea and we provide sensible defaults tailored to Drupal sites.

I'll roll a patch and it won't be more than about 10 LOC (minus the actual robots.txt we ship with), but I want to know from a core committer that there is interest.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

robertDouglass’s picture

Change in thinking has occurred. I now think that Drupal should ship with a default robots.txt and let the robotstxt module suffice for people with multisite needs. The search is on for the optimal robots.txt file. Here's a start:

User-agent: *
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password
robertDouglass’s picture

The kind reviewers of this "patch" will need to create the file with the above text rather than apply a patch. The file should be called robots.txt and be in the root directory.

Chris Johnson’s picture

I think this is a very good idea.

Here is the robots.txt file I've been using with my 4.5 site. Obviously, some paths have changed in 4.6, 4.7 and 4.8. But perhaps it will give you a couple more ideas.

User-agent: *

Crawl-Delay: 10

Disallow: */add/
Disallow: /?q=admin
Disallow: /admin/
Disallow: /database/
Disallow: /includes/
Disallow: /modules/
Disallow: /scripts/
Disallow: /themes/
Disallow: /xmlrpc.php
Disallow: ?q=admin
Disallow: cron.php
Disallow: error.php
Disallow: xmlrpc.php
Dries’s picture

Your patch assumes that clean URLs are enabled?

mariagwyn’s picture

I don't know much about robots.txt, but this is what I use, partly as a result of the threads on hiding feeds and print pages:
Disallow: /node/feed
Disallow: /blog/feed
Disallow: /aggregator/sources
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /archive
Disallow: /trackback

I added the recommended pieces above since I didn't have any of them in my file.
Maria

robertDouglass’s picture

It just goes to show that the above approach is all wrong. Dries, what do you think of building this into the menu system, so that modules can do this:


hook_menu()...

$item[] = array(
  'path' => 'some/path',
  'crawl' => false,
);

That way we could generate robots.txt dynamically and take into account all modules' paths, as well as things like clean urls.

bertboerland’s picture

please take a look at an old bookpage i wrote up at http://drupal.org/node/22265

esp the option to use:

User-agent: *
Crawl-Delay: 10

Dries was against including a robots.txt functionality in 2005 ( http://drupal.org/node/14177 ) but I think it is very very standard to ship with a default robots.txt and we should. in fact, i would rather ship Drupal with a robots.txt then a favicon.

see also http://cvs.drupal.org/viewcvs/drupal/drupal/robots.txt?hideattic=0&rev=1...

Dries’s picture

I'm OK with a _simple_ robots.txt.

1. Keep it short and simple.
2, Add some documentation so people can extend it as they see fit.

bertboerland’s picture

how about:

# small robots.txt
# more information about this file can be found at
# http://www.robotstxt.org/wc/robots.html
# lines beginning with the pund ("#") sign are comments and can be deleted.

# if case your drupal site is in a directory
# lower than your docroot (e.g. /drupal)
# please add this before the /-es below

# to stop a polite robot indexing an exampledir
# add a line like (without the #'s)
# user-agent: polite-bot
# Disallow: /exampledir/

# a list of know bots can be found at
# http://www.robotstxt.org/wc/active/html/index.html
# see http://www.sxw.org.uk/computing/robots/check.html
# for syntax checking

User-agent: *
Crawl-Delay: 10
Disallow: /comment/reply
Disallow: /node/add
Disallow: /files
Disallow: /search
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password
Disallow: /?q=admin
Disallow: /xmlrpc.php
Disallow: /?q=admin
Disallow: /cron.php
Disallow: /error.php
Disallow: /xmlrpc.php

It might be a bit longish but covers most basic options an it documented. Note that wildcards in robots.txt dont work so lines like */add* wont work

Rewted’s picture

You've got Disallow: /?q=admin in there twice.

robertDouglass’s picture

# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
# 
# For more information about the robots.txt standard, see:
#    http://www.robotstxt.org/wc/robots.html
#
# To stop a polite robot from indexing an exampledir,
# add a uncommented line (without the #'s), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/

# A list of know 'bots can be found at:
#   http://www.robotstxt.org/wc/active/html/index.html
# 
# See this site for syntax checking:
#  http://www.sxw.org.uk/computing/robots/check.html

User-agent: *
Crawl-Delay: 10

# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout
figaro’s picture

These are usually added as a standard:

# W3C Link checker
User-agent: W3C-checklink
Disallow:

# Exclude stress-testing tools
User-Agent: stress-agent
Disallow: /
Dries’s picture

There are some typos in Robert's text.

robertDouglass’s picture

FileSize
1.65 KB

<code>
# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# To stop a polite robot from indexing an exampledir,
# add an uncommented line (without #), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/

# A list of known 'bots can be found at:
# http://www.robotstxt.org/wc/active/html/index.html
#
# See this site for syntax checking:
# http://www.sxw.org.uk/computing/robots/check.html


User-agent: *
Crawl-Delay: 10

# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout

kbahey’s picture

I have been using this since 4.5 or so.

The ideas about are great (clean vs. regular URLs, excluding feeds, print pages, ...etc.)

User-agent: *
  Crawl-Delay: 10
  Disallow: /database
  Disallow: /includes
  Disallow: /modules
  Disallow: /scripts
  Disallow: /themes
  Disallow: /aggregator
  Disallow: /tracker
  Disallow: /comment/reply
  Disallow: /node/add
  Disallow: /search
robertDouglass’s picture

FileSize
1.7 KB

added aggregator

Dries’s picture

Status: Needs review » Needs work

Some lines end with a trailing slash while others don't. Is that intentional?

The robots.txt file doesn't validate at all. Test with http://www.sxw.org.uk/computing/robots/check.html.

robertDouglass’s picture

FileSize
1.71 KB

Crawl-delay is non-standard but obeyed by at least a couple major spiders. I removed line breaks per the validator's suggestion. I read also that directories must be followed by a trailing slash, so I added that to both the clean and non-clean URLs section, though it is a question (an probably not consistent) how spiders will handle the non-clean directives.

robertDouglass’s picture

Status: Needs work » Needs review
Dries’s picture

Status: Needs review » Fixed

Committed to CVS HEAD. Thanks.

ideviate’s picture

should we add the lines
Disallow: /user/login and Disallow: /?q=user/login

robertDouglass’s picture

Status: Fixed » Active

Yeah, good catch. Dries, do you need a patch?

bertboerland’s picture

if we protect *.TXT files we dont need to list them here anymore

see http://drupal.org/node/79018

knugar’s picture

I agree with Robert http://drupal.org/node/75916#comment-123192, it would be great if the robots.txt could be created automatically as part of the menu system.

For now, manually editing my robots.txt is just fine but letting modules define defaults crawl/not-crawl for menu paths seems like a good idea.

Maybe the crawlability of all such paths could be administred on a special admin page if people don't like the defaults.

pcwick’s picture

Version: x.y.z » 5.x-dev
FileSize
602 bytes

Patch adds lines suggested in #21

I have very little experience making patches. Hope it works :-)

pcwick’s picture

Category: feature » task
Status: Active » Needs review
Dries’s picture

Status: Needs review » Fixed

Committed to CVS HEAD. Thanks.

sillygwailo’s picture

Status: Fixed » Needs review
FileSize
462 bytes

What's the rationale for disallowing the aggregator? I consider that content, not administrative functions like the other items.

drumm’s picture

Version: 5.x-dev » 6.x-dev

I would like to see this go in the development version first.

cooperaj’s picture

I've just been using the google webmaster tools to test out various aspects of the site incl. the robots.txt file. I've come to a startling conclusion.

Disallow: /user/password != Disallow: /user/password/

and

Disallow: /user/password/ *does not include* Disallow: /user/password

I'm running a 5.1 site and I noticed all the things that shouldn't be indexed are being. i.e. /contact. and /user/login

To properly protect certain paths it is necessary to:

Disallow: /admin
Disallow: /admin/

Gábor Hojtsy’s picture

Status: Needs review » Patch (to be ported)

Well, aggregator could be content you would like to get indexed (like content gathered from your subsites), or foreign content you would not like to have indexed. I changed the default now to let it be indexed, as you suggest, but this decision is different from site to site. I am not entirely sure this should be ported back, but setting it to that state as drumm indicated.

yaph’s picture

FileSize
1.36 KB

Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391#comment-15648

yaph’s picture

FileSize
1.36 KB

Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391

Drupalzilla.com’s picture

There is another patch for robots.txt here:
http://drupal.org/node/180379

Someone recommended that I open a new issue for it. It's my first submitted patch -- I hope I did it right...

Freso’s picture

Version: 6.x-dev » 5.x-dev
Assigned: robertDouglass » Unassigned
Status: Patch (to be ported) » Reviewed & tested by the community
FileSize
697 bytes

The attached patch removes the aggregator entries from the robots.txt in Drupal 5. It would seem that the patch has, in all other respects, already been applied to Drupal 6, except for the trailing slashes issue, which I'd say is more at home in #180379: Fix path matching in robots.txt. This bug is about providing a default robots.txt, and that very robots.txt is now available in both D5, D6, and D7. As soon as D5 has been updated to be similar to the robots.txt of D6 this issue ended up with, please mark this fixed and/or closed.

(#28 still applies as well, though with a wee bit of fuzz.)

Edit: Updated patch. Had some old stuff in it.

drumm’s picture

Status: Reviewed & tested by the community » Fixed

Committed to 5.x.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.