I've recently been using Nutch (a web crawler) to build a Drupal-specific search engine. I get to watch how web-crawlers behave when they look at Drupal sites. It is appalling. We let crawlers search our login page, our request new password page, different sorted views of our tables, and lots of other stuff that is just wasteful. Furthermore, if you ever sit and watch your Apache logs roll by, you'll know that search engine traffic is a very large percentage of all traffic for many sites. The situation could be fixed by including a robots.txt file.
Or could it? What about multisite configurations? Well, just putting a robots.txt file in the top level directory locks every site into using one file, which clearly won't suffice.
Thus I propose that we adopt the strategy I used for my robotstxt module for core. We add an alias "robots.txt", add a variable of the same name, and output it when robots.txt is asked for. You can edit the variable as an administrator in a textarea and we provide sensible defaults tailored to Drupal sites.
I'll roll a patch and it won't be more than about 10 LOC (minus the actual robots.txt we ship with), but I want to know from a core committer that there is interest.
Comment | File | Size | Author |
---|---|---|---|
#35 | robots.txt.d5.allow_aggregator.patch | 697 bytes | Freso |
#33 | robots-patch.1.9_0.txt | 1.36 KB | yaph |
#32 | robots-patch.1.9.txt | 1.36 KB | yaph |
#28 | allow-aggregator-robots-txt.patch.txt | 462 bytes | sillygwailo |
#25 | robots.txt_3.patch | 602 bytes | pcwick |
Comments
Comment #1
robertDouglass CreditAttribution: robertDouglass commentedChange in thinking has occurred. I now think that Drupal should ship with a default robots.txt and let the robotstxt module suffice for people with multisite needs. The search is on for the optimal robots.txt file. Here's a start:
Comment #2
robertDouglass CreditAttribution: robertDouglass commentedThe kind reviewers of this "patch" will need to create the file with the above text rather than apply a patch. The file should be called robots.txt and be in the root directory.
Comment #3
Chris Johnson CreditAttribution: Chris Johnson commentedI think this is a very good idea.
Here is the robots.txt file I've been using with my 4.5 site. Obviously, some paths have changed in 4.6, 4.7 and 4.8. But perhaps it will give you a couple more ideas.
Comment #4
Dries CreditAttribution: Dries commentedYour patch assumes that clean URLs are enabled?
Comment #5
mariagwyn CreditAttribution: mariagwyn commentedI don't know much about robots.txt, but this is what I use, partly as a result of the threads on hiding feeds and print pages:
Disallow: /node/feed
Disallow: /blog/feed
Disallow: /aggregator/sources
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /archive
Disallow: /trackback
I added the recommended pieces above since I didn't have any of them in my file.
Maria
Comment #6
robertDouglass CreditAttribution: robertDouglass commentedIt just goes to show that the above approach is all wrong. Dries, what do you think of building this into the menu system, so that modules can do this:
That way we could generate robots.txt dynamically and take into account all modules' paths, as well as things like clean urls.
Comment #7
bertboerland CreditAttribution: bertboerland commentedplease take a look at an old bookpage i wrote up at http://drupal.org/node/22265
esp the option to use:
Dries was against including a robots.txt functionality in 2005 ( http://drupal.org/node/14177 ) but I think it is very very standard to ship with a default robots.txt and we should. in fact, i would rather ship Drupal with a robots.txt then a favicon.
see also http://cvs.drupal.org/viewcvs/drupal/drupal/robots.txt?hideattic=0&rev=1...
Comment #8
Dries CreditAttribution: Dries commentedI'm OK with a _simple_ robots.txt.
1. Keep it short and simple.
2, Add some documentation so people can extend it as they see fit.
Comment #9
bertboerland CreditAttribution: bertboerland commentedhow about:
It might be a bit longish but covers most basic options an it documented. Note that wildcards in robots.txt dont work so lines like */add* wont work
Comment #10
Rewted CreditAttribution: Rewted commentedYou've got Disallow: /?q=admin in there twice.
Comment #11
robertDouglass CreditAttribution: robertDouglass commentedComment #12
figaro CreditAttribution: figaro commentedThese are usually added as a standard:
Comment #13
Dries CreditAttribution: Dries commentedThere are some typos in Robert's text.
Comment #14
robertDouglass CreditAttribution: robertDouglass commentedComment #15
kbahey CreditAttribution: kbahey commentedI have been using this since 4.5 or so.
The ideas about are great (clean vs. regular URLs, excluding feeds, print pages, ...etc.)
Comment #16
robertDouglass CreditAttribution: robertDouglass commentedadded aggregator
Comment #17
Dries CreditAttribution: Dries commentedSome lines end with a trailing slash while others don't. Is that intentional?
The robots.txt file doesn't validate at all. Test with http://www.sxw.org.uk/computing/robots/check.html.
Comment #18
robertDouglass CreditAttribution: robertDouglass commentedCrawl-delay is non-standard but obeyed by at least a couple major spiders. I removed line breaks per the validator's suggestion. I read also that directories must be followed by a trailing slash, so I added that to both the clean and non-clean URLs section, though it is a question (an probably not consistent) how spiders will handle the non-clean directives.
Comment #19
robertDouglass CreditAttribution: robertDouglass commentedComment #20
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks.
Comment #21
ideviate CreditAttribution: ideviate commentedshould we add the lines
Disallow: /user/login and Disallow: /?q=user/login
Comment #22
robertDouglass CreditAttribution: robertDouglass commentedYeah, good catch. Dries, do you need a patch?
Comment #23
bertboerland CreditAttribution: bertboerland commentedif we protect *.TXT files we dont need to list them here anymore
see http://drupal.org/node/79018
Comment #24
knugar CreditAttribution: knugar commentedI agree with Robert http://drupal.org/node/75916#comment-123192, it would be great if the robots.txt could be created automatically as part of the menu system.
For now, manually editing my robots.txt is just fine but letting modules define defaults crawl/not-crawl for menu paths seems like a good idea.
Maybe the crawlability of all such paths could be administred on a special admin page if people don't like the defaults.
Comment #25
pcwick CreditAttribution: pcwick commentedPatch adds lines suggested in #21
I have very little experience making patches. Hope it works :-)
Comment #26
pcwick CreditAttribution: pcwick commentedComment #27
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks.
Comment #28
sillygwailoWhat's the rationale for disallowing the aggregator? I consider that content, not administrative functions like the other items.
Comment #29
drummI would like to see this go in the development version first.
Comment #30
cooperaj CreditAttribution: cooperaj commentedI've just been using the google webmaster tools to test out various aspects of the site incl. the robots.txt file. I've come to a startling conclusion.
Disallow: /user/password != Disallow: /user/password/
and
Disallow: /user/password/ *does not include* Disallow: /user/password
I'm running a 5.1 site and I noticed all the things that shouldn't be indexed are being. i.e. /contact. and /user/login
To properly protect certain paths it is necessary to:
Disallow: /admin
Disallow: /admin/
Comment #31
Gábor HojtsyWell, aggregator could be content you would like to get indexed (like content gathered from your subsites), or foreign content you would not like to have indexed. I changed the default now to let it be indexed, as you suggest, but this decision is different from site to site. I am not entirely sure this should be ported back, but setting it to that state as drumm indicated.
Comment #32
yaph CreditAttribution: yaph commentedRemoved all trailing slashes, see also:
http://groups.drupal.org/node/5391#comment-15648
Comment #33
yaph CreditAttribution: yaph commentedRemoved all trailing slashes, see also:
http://groups.drupal.org/node/5391
Comment #34
Drupalzilla.com CreditAttribution: Drupalzilla.com commentedThere is another patch for robots.txt here:
http://drupal.org/node/180379
Someone recommended that I open a new issue for it. It's my first submitted patch -- I hope I did it right...
Comment #35
Freso CreditAttribution: Freso commentedThe attached patch removes the aggregator entries from the robots.txt in Drupal 5. It would seem that the patch has, in all other respects, already been applied to Drupal 6, except for the trailing slashes issue, which I'd say is more at home in #180379: Fix path matching in robots.txt. This bug is about providing a default robots.txt, and that very robots.txt is now available in both D5, D6, and D7. As soon as D5 has been updated to be similar to the robots.txt of D6 this issue ended up with, please mark this fixed and/or closed.
(#28 still applies as well, though with a wee bit of fuzz.)
Edit: Updated patch. Had some old stuff in it.
Comment #36
drummCommitted to 5.x.
Comment #37
Anonymous (not verified) CreditAttribution: Anonymous commentedAutomatically closed -- issue fixed for two weeks with no activity.