In short: robots.txt disallows

/sites/

which, in turn, prevents Google (amongst other SE) from indexing, for example, the images directory under the files directory. This has the unwanted effect of not including images from Drupal sites in search results.

One of the problems with that patch, is that it doesn't take care of multisite environment, or any environment where the default site is not "default".

See here #364781: seo implications / google images for a related issue.

Files: 
CommentFileSizeAuthor
#18 robots-2.patch811 bytesc960657
PASSED: [[SimpleTest]]: [MySQL] 23,312 pass(es). View
#9 robots.txt.patch444 bytesz.stolar
Failed: Failed to apply patch. View
#7 robots.txt.patch443 bytesz.stolar
Failed: 11558 passes, 0 fails, 1 exception View
#5 robots.txt.patch447 bytesz.stolar
Failed: Failed to run tests. View
robots.txt.patch547 bytesz.stolar
Failed: Failed to apply patch. View

Comments

steff2009’s picture

Hi, thank you for your suggestion.

I also had noticed that none of my images are indexed in Google, so I added the following line in my robots.txt:

Allow: /sites/default/files/images

Google webmaster tool checking robots.txt says that now the directory is allowed.

I only have one concern: what's Drupal's reason to disallow /sites directory? Maybe a security reason? I then left "Disallow: /sites/" in my robots.txt, so that only the image directory is crawled.

What's your opinion?

Tks!

z.stolar’s picture

Well, I've seen people mentioning "Allow" rules, but I was under the impression it's not a standard robots.txt rule.
I'll add it and wait for results.

steff2009’s picture

Hi, actually Google Webmaster tools mention more than once that a recommended robots file should "Allow all". However, given the fact that Drupal stores lots of information that are not relevant to users search, it is advisable to disallow some directories and files.

You can do this to test the behavior of your robots in Google:

1. Go to Google > Webmaster tools > Site configuration > Crawler access > Test robots.txt,
2. Copy and paste the text of your robots file in the related field (what you already see is the current robots that is on your website - if you have recently updated it, it might not be the latest version yet)
3. Specify which directory (for example, the one containing images) you would like to be indexed
4. Click on Test.

In the test results you will see if that specific directory is allowed and supposedly Google should crawl it soon (by the way, how often does Google crawl your site?).

Cheers.

z.stolar’s picture

I tried - Google does allow the "Allow" rule.
Adding Allow: /sites/default/files/images achieves the desired result (for Google!)

I'll submit a new patch.

z.stolar’s picture

Status: Active » Needs review
FileSize
447 bytes
Failed: Failed to run tests. View

Attached.

Dries’s picture

I don't think this is sufficient because on my site, images go under sites/buytaert.net/files/images. I recommend we remove the sites rule.

z.stolar’s picture

FileSize
443 bytes
Failed: 11558 passes, 0 fails, 1 exception View

Here's a new patch

Status: Needs review » Needs work

The last submitted patch failed testing.

z.stolar’s picture

Status: Needs work » Needs review
FileSize
444 bytes
Failed: Failed to apply patch. View

I don't understand why it failed. In any case, I re-checked-out latest CVS version, and recreated the patch.

z.stolar’s picture

@Dries: just a reminder... :-)

yhager’s picture

Status: Needs review » Reviewed & tested by the community

+1 for removing the /sites directory. Not sure what it was doing there in the first place.

Ozeuss’s picture

+1 for this patch.

c960657’s picture

Not sure if this requires a separate bug, but I think we should also remove /modules, /themes, and /misc from robots.txt. There is no reason why a robot should not fetch files in here, in particular for stylesheets and images.

Note that robots.txt is not only used by Googlebot et al. but also by various other web spiders, e.g. software that spiders a website for making an offline copy.

For instance, on the Wayback Machine, note how the archived version of the CNN.com contains stylesheets etc., while the Drupal.org and Ubuntu.com versions do not.
http://web.archive.org/web/20071127060255rn_1/www.cnn.com/
http://web.archive.org/web/20071125103616/http://drupal.org/
http://web.archive.org/web/20080212114445/http://www.ubuntu.com/

On Ubuntu.com, not the four images below “Ubuntu Editions” on the right. Only the latter is displayed in the archive. This is because it is saved in /files, while the others reside in /themes/.

WRT performance, serving static files is very cheap compared to serving pages generated by Drupal. I assume that search engines spiders like Googlebot are clever enough to not fetch your js and css files with the same frequency as it fetches the front page and other volatile parts of your site.

samj’s picture

It would be interesting to hear the justification for inclusion in the first place as excluding path segments via robots.txt is a fairly drastic action, especially as a default. The unintended consequences for e.g. archiving are hard to grasp until they manifest themselves as problems for users. Sure you could exclude e.g. admin interfaces but these are authenticated anyway, while otoh doing so can clean up search results.

In summary I think there should be fairly solid justification for adding anything to robots.txt

FlemmingLeer’s picture

Issue tags: +google, +images, +robots.txt

I recommend that /sites/ to be kept in the robots.txt

And that you add this in robots.txt

User-agent: Googlebot-Image
Disallow:
Allow: /*

This will index all images no matter where they are.

(Some of us still use HTMLarea under the /HTMLarea/ folder .... :/)

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Committed to HEAD, since this fixes the bug in the initial thread.

However, we might need to do some follow-up work here, per c960657 in #13. Since lots of interested parties are already subscribed to this issue, marking this one down to "needs work."

z.stolar’s picture

There should probably be a good reason to include something in robots.txt, so just listing all of Drupal's directories there is not the best approach. This file should prevent files and web pages from being indexed, so your site will be better indexed (as @samj says: no reason to index icons of buttons in FCK editor & Co.).
However, samj's solution isn't ideal, since it prevents CSS files from being cached, and other file types from being indexed (PDF etc). If I understand it well, then no module.info or INSTALL.txt would get indexed, since there's no pointers to those files from a working site. One should actively look for them.

Perhaps the thing to do, is to remove as much as possible from robots.txt, but allowing modules to add entries to the file, so for example RTF editors, will have a chance to say: "Don't index my buttons icons or background images".

Can any SEO expert tell what will be the effect of having background and UI images from themes and modules, indexed by search engines? Will it be on Drupal's side, or will it damage a site's overall ranking (or will it have no effect at all)?

c960657’s picture

Status: Needs work » Needs review
FileSize
811 bytes
PASSED: [[SimpleTest]]: [MySQL] 23,312 pass(es). View

This patch also removes misc/, modules/, and themes/ as suggested in #13.

Can any SEO expert tell what will be the effect of having background and UI images from themes and modules, indexed by search engines? Will it be on Drupal's side, or will it damage a site's overall ranking (or will it have no effect at all)?

I doubt it will have any effect at all. Also, note that robots.txt does not target search engines exclusively but is used for all non-human agents fetching stuff from your server.

cburschka’s picture

Status: Needs review » Reviewed & tested by the community

This is a very good idea - PHP scripts are already protected by .htaccess, no search engine can index them anyway.

cosmicdreams’s picture

Status: Reviewed & tested by the community » Needs review

There doesn't seem to be much discussion about this change. In my opinion, I don't think this patch is helpful since none of these directories "should" contain images that I would want indexed. The though I keep going back to when considering having the misc, modules, and themes, directories indexed, is a giant haystack of "garbage" images that I wouldn't want to find in an image search. Imagine if a default theme had an image named drupal.jpg and you did an image search for drupal. You'd get spammed by all those dummy images.

So in short, I think the patch that was applied in #16 is sufficient and the patch in #18 should not be committed. To open the discussion a bit more I'll drop the status down to needs review.

JohnForsythe’s picture

This issue needs attention. None of my uploaded product images were getting indexed, and I had no idea why. Fortunately, I took another look at my robots.txt file, and was surprised to see that everything in /sites/ is blocked by default. I just wrote an article to let other people know about the problem:

http://blamcast.net/articles/drupal-seo-mistake

There's no reason to block /sites/ by default. #13 also makes some good points about /modules/ and the others, but /sites/ is especially critical as the default location for uploaded images.

rszrama’s picture

Just as a heads up, John - this has been fixed in D7, but what you want is a backport of the change to D6. It looks like the reason this is still unresolved is to get the proper rules in D7, so perhaps you should open an alternate issue to backport the initial fix to D6. : ?

adrianmak’s picture

subscribe

geerlingguy’s picture

Should we open a new issue for D6, or shall we mark this as patch (to be ported), since there was a patch fixing the original/title issue for this thread...

I think the /sites/ rule should be removed altogether.

DamienMcKenna’s picture

An alternative solution - change where the files are stored.

In almost every site I build you'll find the following in the settings.php file:

$conf['file_directory_path'] = 'files';

For some sites, e.g. true multisite installs, I do variations, e.g.

$conf['file_directory_path'] = 'files/public';
// or
$conf['file_directory_path'] = 'files/intranet';
Scott Reynolds’s picture

An alternative solution - change where the files are stored.

This solution does not work for all instances. For example, I have a bunch of images that are important that I track outside the files directory in source control. These include important badges and site icons.

geerlingguy’s picture

Drupal, up to version 5 or 6 (I can't remember which), used the 'files' directory for file storage. With 5 or 6, Drupal switched to using the 'sites' directory instead, which could've introduced this bug originally. The sites folder is supposed to be there for ease of maintenance; a site owner can wipe and reupload/upgrade the rest of the files, but as long as the sites folder remains, all the site files, settings, modules, etc. are preserved...

Of course, on a few sites I run, we have the files in /files just to keep file paths more sane (and sometimes, we have a symbolic link set up to use /files anyways). But Drupal's documentation encourages people to put their files in sites/example.com/files, so our robots.txt shouldn't restrict crawler access there.

:)

jooplaan’s picture

#18: robots-2.patch queued for re-testing.

wik’s picture

subscribing

idflood’s picture

subscribing

pcoucke’s picture

I've added this at the bottom of robots.txt which uses wildcards which are supported by Google:

# Allow images to be crawled
Allow: /sites/*.jpg
Allow: /sites/*.png

You can test this in Google Webmaster tools under site configuration > crawler access

This solution still blocks the /sites/ directory but allows the jpg and png extensions.

fabianderijk’s picture

subscribing

robertDouglass’s picture

Priority: Normal » Major
Status: Needs review » Needs work

#31 looks like a good approach, but wouldn't it be more efficient to make a blacklist using that technique? What we really want to avoid being indexed, at all costs, is *.php. *.info, *.module, *.theme, *.inc also belong on the list. *.css too.

As this has a huge affect on SEO I'm bumping to major. http://blamcast.net/articles/drupal-seo-mistake

Damien Tournoud’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs work » Patch (to be ported)

Let's consider 7.x fixed.

There is no real point in indexing anything in modules/, misc/ and theme/.

Back to D6 to consider a backport.

z.stolar’s picture

@Robert: the files you mentioned shouldn't normally be indexed, since search engines don't crawl directories, they crawl websites. Unless a php/module/info/etc file is directly linked from a web page, there is no risk for it to be indexed at all.

droplet’s picture

Version: 6.x-dev » 7.x-dev
Status: Patch (to be ported) » Needs work

Disallow: /contact/

I think we should remove this too.
no reason to block contact form, prevent spamming ?? we may have some important message on contact page for users.

robertDouglass’s picture

Version: 7.x-dev » 6.x-dev
Priority: Major » Normal
Status: Needs work » Patch (to be ported)

@droplet: I suggest you open a new issue for /contact/ and let this one remain for sites and D6

Here's what we've got in D7 for anyone who wants to review: http://drupalcode.org/viewvc/drupal/drupal/robots.txt?view=markup

grendzy’s picture

Status: Patch (to be ported) » Reviewed & tested by the community

#9 is the patch that webchick committed to HEAD, and the same patch applies cleanly to 6.x.

fenstrat’s picture

RTBC x 2 for #9. Applies cleanly to 6.x

butler360’s picture

Subscribing.

genox’s picture

shubshcribe

BeatnikDude’s picture

`(Òvó) - subscribing

milkmiruku’s picture

sub

sunward’s picture

I manually changed the file and will make a note not to replace on the next update.

I do see a parse error from google:

Line 21: Crawl-delay: 10 - Rule ignored by Googlebot

Now to get the directory re-indexed.

mikl’s picture

I hope this will be included in 6.20. This seems to be a major problem for Drupal-sites.

droplet’s picture

@sunward,
bing & yahoo supports Crawl-delay

Cyberwolf’s picture

Subscribing.

FlowerOS’s picture

Please do it already. I get tired of manually removing it on every release, on every site.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Thanks all! #9 now committed to Drupal 6 too. Will be included with the next release.

mike dodd’s picture

I am not 100% sure allowing robots complete access to sites/, to the user uploaded files area is fine but this means that all of the images contained within all of your modules will be indexed. This means hundreds of extra images being indexed. granted these will link to your site and that can been seen as an SEO boost however from a user experience do you really want all of these graphics included as well. Would it not make more sense to allow access to /sites/default/files (or where ever your files are located) or restrict access to /sites/all/ (or whatever sites you have here that don't contain the user uploaded files).

I don't really have an answer but I am slightly reluctant to get all of my modules images be indexed.

I realize that it may be helpful for some people and it will index the ones that are directly called from the page and this may only be a dozen or so images . . .still I just wanted to flag this as a potential issue.

JohnForsythe’s picture

Glad this is fixed, thanks everyone.

teezee’s picture

A side note maybe, but why is /contact/ in robots.txt?

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/

I mean, the contact module is core, but on most sites something like a contact section (contact press department, contact webmasters) with important address information will not be indexed if u would have aliases like /contact/press or /contact/webmaster.

Is it really up to Drupal core to enforce /contact/ as a disallowed path just because core contains a module that uses /contact ?

geerlingguy’s picture

@teezee - Please open up a new issue (one might already exist, too) for that.

Fabianx’s picture

Subscribing

hedac’s picture

I prefer to add the Allow for my specific folders I want to be indexed. such as imagecache or so... instead of removing the disallow and let google index everything under sites... which I don't want.

geerlingguy’s picture

Well, this patch takes care of the main issue. If you have files in /sites/* in particular that you'd like to hide, you can modify your rules.txt file to allow them... this patch simply sets a new sensible default, which will apply to thousands, if not millions of websites.

sunward’s picture

Title: Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt » time

I changed the file and went to google webmaster to resubmit the sitemap to try to get the site crawled again.

The site has been crawled by googlebot and the new robot.txt is loaded. Images have still not been loaded yet. So this will take time to affect websites.

rszrama’s picture

Title: time » Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt
droplet’s picture

hedac’s picture

interesting article about robots.txt and drupal. it mentions some other errors that there are in robots.txt
http://tips.webdesign10.com/robots-txt-and-drupal

betancourt’s picture

Will it be solved for Drupal 7?

I am just thinking it it's complettly safe to allow /sites to be crawled, any downsides or risks?

Is there any official response/fix from Drupal to this issue?

Many Thanks

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

fgm’s picture

@betancourt: see Ryan's confirmation in #22 : this has been solved for D7.

ANDRZEJ SOSNOWSKI’s picture

PROSZĘ DOŁĄCZYĆ!!!

VisualFox’s picture

This is the robot.txt I am using for drupal 6. It works for me. It's a little annoying I have to copy it every time I update drupal core. But I guess that why batch/sh is for...

This version use a lot of tips and fixes I found in the diver thread about this issue around the drupal website.

http://www.visualfox.me/blog/drupal-6x-robottxt

Enjoy!

anusornwebsite’s picture

Title: Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt » Problem 404 (Not found) and Duplicate title tags in Google Webmaster Tools
Version: 6.x-dev » 7.0-rc4
Assigned: Unassigned » anusornwebsite
Priority: Normal » Major

- I use alias(sitename/oldalias/x) instead of URL path(sitename/node/x) and I changed alias(sitename/oldalias/x) into lias(sitename/newalias/x). Google don't found my page in search engine and Webmaster Tools. I want to salve this problem.

- I have Duplicate title tags for example,
newalias/1 - this is what I want to use but below not
newalias/1?language=en - Duplicate title tags
newalias/1?language=th - Duplicate title tags
oldalias/1 - 404 (Not found)
oldalias/1?language=en - 404 (Not found)
oldalias/1?language=th - 404 (Not found)
node/1 - Duplicate title tags
node/1?language=en - Duplicate title tags
node/1?language=th - Duplicate title tags

If I can go to robots.txt and write this
Disallow: /oldalias/
Disallow: /node/
Will url path I don't want it disappear?

grendzy’s picture

Title: Problem 404 (Not found) and Duplicate title tags in Google Webmaster Tools » Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt

This queue is for issues in the Drupal core code. Please visit http://drupal.org/support to see what your support options are if you need more assistance.

anusornwebsite’s picture

Thank you for help me. I think aliase invole with URL aliase module. And robot.txt is created by drupal. I'm sorry that misunderstand. Thank again.