Drupal tries to make the default files directory /sites/default/files. This directory is blocked from search engines buy robots.txt and the rule that blocks everything in the sites directory so none of the files uploaded with be read by search engines or show up in search results.

Thinking through this briefly, the allow directive in robots.txt isn't respected by all search engines. So, shouldn't we leave the sites directory block so all multi-site install files are properly blocked and have the files directory be /files/default/ or /files/example.com

I know this may be revisiting the issue. But, it's really not search engine friendly. And, your average user isn't going to know to update their robots.txt file.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

ScoutBaker’s picture

Priority: Critical » Normal

There was a lot of discussion over where to put the default files directory over at http://drupal.org/node/194369. This particular point did not come up that I remember.

Based on my understanding of the priority levels, I don't think this is critical as it does not break Drupal. Someone else can change the priority back if they disagree.

AFAIK, Allow is a Google extension. So I agree that's not a solution. Based on the discussion noted for the issue above, the current location of sites/default/files was the best option anyone came up with. I don't have a better idea for a solution right now, but I'll think on it.

catch’s picture

Priority: Normal » Critical

There were plenty of reasons to go with sites/default/files. Original issue is here: http://drupal.org/node/191310

Changing the directory to something else would require at least one additional step in the installer - since they'd have to grant write permissions to their root drupal directory for it to write directly to /files or make the files directory manually and grant permissions to that.

I'd support documenting this in INSTALL.txt, and possibly a change to robots.txt if we don't mind an invalid one, since Google and Yahoo at a minimum respect the "Allow" directive. But I don't agree this is critical.

catch’s picture

Priority: Critical » Normal

Cross-posted, back to normal.

ScoutBaker’s picture

Since Allow is not part of the robots.txt standard, I would rather not use it in the robots.txt supplied with the installation.

We still need to document this and have the person setting up the site make changes to support all of the other search engines. It might confuse them to say that Google and Yahoo will work, but go make changes so all the others will too.

+1 for updating INSTALL.txt with the appropriate information on this.

catch’s picture

Status: Active » Needs review
FileSize
791 bytes

OK here's a first run.

mfer’s picture

I think this is a bigger issue for newbie drupal users. This is a problem of usability.

Basically, we are saying that for a newbie user to have the files that they added via drupal show up in search engines they have to edit the robots.txt file.

We will get a lot of requests for support on this and it will annoy a lot of people. Who reads the install file? Who reads the instructions? Most newbie users don't. They download drupal, open it up, and go to the directory to see what happens when drupal loads. And, drupal should work well like this.

I'll read up on the other issue and review the patch in a bit.

catch’s picture

mfer, it was a critical usability issue that led to the directory being created in sites/default/files in the first place. The current open issue looking at alternatives is here: http://drupal.org/node/98824 (also long).

Also, I should mention this location is specified in the RC2 announcement as of last night, so no way we can change it for D6.

If Disallow: /sites/default is only to save bandwidth, would it be too serious to remove the restriction? It's not a security issue, and the main reason for excluding it appears to be to avoid search engines unnecessarily crawling that directory if there's nothing worth looking at in there, that might be invalidated. We could still disallow sites/default/modules and sites/default/themes explicitely.

JirkaRybka’s picture

You mean sites/all/modules and sites/all/themes, right?

catch’s picture

I meant both actually - I was under the impression some people use sites/default/themes (or modules) as well.

So:

Disallow: /sites/default/themes/
Disallow: /sites/default/modules/
Disallow: /sites/all/modules/
Disallow: /sites/all/themes/

??

If someone's setting up multisite, then they can sort out their own robots.txt file (or use the robots module).

mfer’s picture

I just did a little reading to try and get caught up on the issue...
http://drupal.org/node/98824
http://drupal.org/node/166169
http://drupal.org/node/191310
http://drupal.org/node/190283
http://drupal.org/node/194369

Personally, until a few days ago I was a fan of /sites/default/files and the equivalents for the multisite installs. But, a newbie user ran into a search engine friendly (SEF) issue when doing this setup on D5 and it prompted me to rethink think.

In these 5 issues above the only mention of search engines I saw was one that said there might be a SEO issue. Well, here it is.

I currently see 2 possibilities:
1) We change the structure from /sites/default/files and /sites/example.com/files to /files/default and /files/example.com.
Pro: This would leave files to available to be indexed by search engines no matter how many domains you have on a multi-site install.
Con: There is a chance someone using OS X will wipe out the files directory when they upgrade.

2) We modify the robots.txt file and update the documentation for the default install and multi-site. We start by changing the robots.txt file to remove the Disallow: /sites/. Then we add:

Disallow: /sites/default/themes/
Disallow: /sites/default/modules/
Disallow: /sites/default/settings.php
Disallow: /sites/default/default.settings.php
Disallow: /sites/all/modules/
Disallow: /sites/all/themes/

In the instructions for a multi-site setup we add instructions explaining how to add aditional information to the robots.txt file for each multi-site setup.

The big con of this style is you have to update your robots.txt file every time you add a new site to a multi-site setup.

I'm sure there is more to this. But, I don't think we can rely on documentation to solve this problem. If you have not read the documentation or are a new user this is a usability shortcoming.

Personally, I think this is a critical regression issue. Drupal 6 will be less SEF than drupal 5 out of the box if we go forward with the current situation.

mfer’s picture

A 3rd option:

Use the Allow tag in the robots.txt file. http://en.wikipedia.org/wiki/Robots.txt#Allow_directive

Both Google and Yahoo support this directive. But, it's not standard and I'm sure there are searches that don't support it. I couldn't find anything on Microsoft supporting (though I only searched for a few minutes). Personally, I'd rather stick with the standard. But, this is a option.

catch’s picture

Status: Needs review » Needs work

With multisite you can still set your file directories to something like files/sitekey - it's just not documented very much. At the moment, I'd go with 2. Marking as needs work for those (sensible, IMO) changes to robots.txt

ScoutBaker’s picture

Status: Needs work » Needs review

#10 is the alternative that I had thought of suggesting, mfer beat me to the punch.

-1 to #11, my thoughts on this are in #4. Also, I feel this discrepancy would generate numerous tickets. If someone's going to notice search engines not seeing their download directory, they'll be just as upset if it works partially.

If we went with the patch in #5, we'd need an accompanying documentation page to refer to with more information, or make updates to a possibly existing page if there's something appropriate already. I don't have the time to go look right now.

catch’s picture

FileSize
818 bytes

Or I could just roll a patch. Also took out INSTALL.txt changes since they wouldn't be valid with the robots changes.

Gábor Hojtsy’s picture

Why do we block the sites folder at all in the first place?

catch’s picture

robots.txt says:

# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.

I guess it's to prevent 403 errors and hits on theme pngs and js files? maybe?

keith.smith’s picture

This seems to be the original issue, where /sites just sorta' appeared in toto in a list with the other directories off of a root Drupal install.

catch’s picture

FileSize
425 bytes

Well I can't think of a good reason for it to be in there, so here's one to take it out. If someone really wants to restrict it from crawlers, there's nothing stopping them.

ScoutBaker’s picture

FileSize
914 bytes

Since the standard is to place contrib modules and themes under the sites directory structure, why would we want to just open up the entire thing?

mfer's option to disallow the default folders included with core in #10 seems a better solution. It keeps the crawlers out of the recommended directories, but lets the files be picked up. This solution would require documenting the additional edits for multisites.

Alternative patch attached.

keith.smith’s picture

@ScoutBaker: I think the .htaccess directives protect most of the stuff in sites/ already (with the exception of images and such).

mfer’s picture

.htaccess doesn't work on all systems. IIS, for example. I don't think we should rely on .htaccess files for this.

ScoutBaker’s picture

@keith.smith: Aren't there hosting sites that have restrictions on .htaccess? In that case, /sites would lose even that protection. Even where it works, I don't think having the entries in robots.txt hurts.

If we're going to remove robots.txt coverage for contrib modules and themes, then why are we protecting core using robots.txt? I don't know what the original reasoning to add the core directories was, but it seems likely that, if the reason is still valid, it would apply to contrib directories as well.

(Edit: cross-posted with mfer)

keith.smith’s picture

Granted. But if your .htaccess file doesn't working to protect those files (and clearly it does not on some systems), then wasn't your only measure of protection from this stuff being indexed previously whether or not a bot obeyed the robots.txt file? Or is there a piece of the puzzle I'm missing here?

(Edit: And I cross-posted with ScoutBaker)

catch’s picture

Thing all .inc .php etc. files are going to be 403 for crawlers anyway. afaik this is to avoid unecessary 403 errors rather than security (and possibly to avoid some traffic on js and images).

JirkaRybka’s picture

Not everywhere, as said above. I'm on a hosting, where the .htaccess protection doesn't work (if I uncomment that in .htaccess, I get WSOD with internal server error). If I navigate (as anonymous user) to, say, includes/theme.inc on my site, I get the code as plain text to my screen. Nothing stops robots from indexing, once they get a link (not likely, but anyone can post it anywhere on the web). robots.txt is a chance to stop the robot, not for sure, but likely.

catch’s picture

Ok I didn't know that :)

In that case #19 looks like a good option to me.

keith.smith’s picture

FileSize
3.36 KB

Attached is a patch with an initial attempt at documenting what I understand to be the behavior in #19. Note that this is likely to be pretty fragile, in that we're going to be depending on people to add these entries.

And, though large multi-site configurations likely do not need "one more thing" to have to worry about, I suspect most of them can or will automate this. Its the people like me who just do these every now and again that will almost certainly forget.

On the other hand, I'm not certain the world comes to an end if they (or I) don't do it.

ScoutBaker’s picture

+1 for #27. The additional text describing the updates is just what we needed.

keith.smith’s picture

FileSize
3.54 KB

Bleh. Even though it looks better the other way, this patch eliminates the extra leading space I had between the "#" and the directives in the commented-out example guidelines in robots.txt

gpk’s picture

For users whose .htaccess file works as intended could this be a bit "belt-and-braces"? For these users, this is surely only necessary if they don't want theme files (and some module files, e.g. CSS and images) crawled. On the other hand, is it possible some users might actually want e.g. theme images to be crawled? Maybe it would be helpful to qualify this statement a bit: "As you add sites to your configuration, you should edit the robots.txt file to include your site-specific configuration files and any module and theme directories." to explain the ins and outs...

It's also advertising the existence of the settings.php file, which in ?most cases is in anyway invisible/inaccessible. Is this wise? If it can be accessed directly then you've probably got a security problem anyway.

Sorry, my understanding of all the issues is too limited to do more than throw in these half thought-through comments!

catch’s picture

Status: Needs review » Needs work

gpk's right about the settings.php file, no way that should be listed.

ScoutBaker’s picture

Version: 6.x-dev » 7.x-dev

This issue still applies to Drupal 7. The patch in #29 needs a reroll and to address gpk and catch's concerns in #30 and #31.

Owen Barton’s picture

Status: Needs work » Needs review
FileSize
699 bytes
642 bytes

Here is are 2 patches. The first is a very simple one that uses the '*' wildcard syntax that is accepted by the major search engines (at least) to exclude just modules and themes. This fixes the immediate problem with uploaded files (e.g. PDFs) not being search-able.

The second is a more radical variant that opens things up to the degree that I think is appropriate. I can't see any reason for limiting access to the theme and modules directories, since this allows search engines the ability to index all the images that make up the page (and image search is *very* popular - it's still the #2 link of Google's tools), and I would not be surprised if some search engines are actually building a rough idea of the page layout behind the scenes, so they can better determine relevancy - and allowing images helps them better do that. If your PHP source code is visible then you have a serious security problem, so I don't think that is something to target here.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

I think we can all agree on non radical patch so I am RTBC for that.

mfer’s picture

This change opens up settings.php to be allowed to crawlers. I'm not sure I like listing that in the .htaccess file though. Does this matter? Thoughts?

If you try to execute settings.php you'll find it just returns an empty page. I wonder how bots respond to that since they get a legitimate page found header message.

moshe weitzman’s picture

IMO, that does not matter. If a robot ever got there (how would it discover that URL? Very rare), they would see an empty page and probably not record it in their index. Whats your point? we already said that if you are disclosing your php you have bigger and out of scope problems.

mfer’s picture

My point was full disclosure. That's why I left it RTBC. I'm happy with this change. I ran it past the robots.txt spec and how the major players deviate from it. Looks good to me.

Status: Reviewed & tested by the community » Needs work

The last submitted patch failed testing.

codecowboy’s picture

Status: Needs work » Reviewed & tested by the community

I think we're happy with this one. It's been bike shedded. How changing robots.txt broke the tests I don't know.

Status: Reviewed & tested by the community » Needs work

The last submitted patch failed testing.

sun’s picture