Problem/Motivation
This is also an issue on Drupal 6 and Drupal 7.
Some of the paths in robots.txt appear to be incorrect, resulting in some pages being indexed incorrectly. They all currently have a trailing / which I believe is incorrect, and which the Google robots.txt Tester indicates is not working as expected (ie. it indicates pages such as /user/password is "allowed" rather than "blocked").
Disallow: /node/add/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
Disallow: /index.php/node/add/
Disallow: /index.php/user/password/
Disallow: /index.php/user/register/
Disallow: /index.php/user/login/
Disallow: /index.php/user/logout/
Related to this, the robots.txt file include the following lines in the header, which I don't think is definitive:
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.htmlProposed resolution
- Remove the trailing / from these paths, so they specify pages, and not directories.
- Add
Disallow: /user(no trailing space) - Change in the header:
From: "http://www.robotstxt.org/robotstxt.html"
To: "https://developers.google.com/webmasters/control-crawl-index/docs/robots..."
Remaining tasks
Should be reviewed by someone more familiar than me on this matter!
User interface changes
None.
API changes
None.
Data model changes
None.
Original report by [username]
n/a
| Comment | File | Size | Author |
|---|---|---|---|
| #16 | robots_paths_incorrect-2581637-16.patch | 717 bytes | spoit |
Comments
Comment #2
iantresman commentedRelevant issues:
Relevant resources:
Comment #3
firoz2456 commentedComment #4
iantresman commented@firoz2456
Can you just mention which issue it is a duplicate of, the other relevant issues are different.
Comment #5
firoz2456 commentedI think this issue is similar to https://www.drupal.org/node/1032234
Comment #6
iantresman commentedI think I would argue that that is a different issue. This one is trivially fixable, and doesn't require any changes of functionality. I agree that in the long term, using meta tags is a better solution, but until then, doesn't it make sense to fix the leak, before replacing the roof?
Comment #7
firoz2456 commentedAgreed. Re-open this issue
Comment #8
iantresman commentedComment #9
no2e commentedAbout changing the header
We should not link to https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en. It is Google’s own interpretation of the robots.txt standard, and it includes extensions that are not part of the original spec (and therefore not supported by all bots).
If consumer documentation is considered important, maybe we could link to a documenation page on drupal.org instead? On this page we could provide all related links: to the original specification, to documentation how search engines and other services parse it, to extensions etc.
About blocking
/userHaving
Disallow: /usercould lead to blocks of various user-created URLs, as this would block URLs likehttp://example.com/usershttp://example.com/user-storieshttp://example.com/user-meeting-2015Not to mention the default user profile URLs (
/user/1, etc., unless they have changed with D8?) which many sites may want to get crawled.Comment #10
iantresman commentedI think that Google robots.txt spec is good for Google, Bing, Yahoo, and Ask, which I would argue is a significant proportion of web robots, and more important than other interpretations. There does not appear to be an official standard.
Blocking /user does seem to have its drawbacks. There seems to me to be three solutions:
Any of these solutions is better than not implementing any of them! Frankly, I don't care about other robots.
Comment #11
catchThis could probably be changed in a patch release, unless we add dynamic robots.txt which is covered by other issues.
Comment #15
spoit commentedRegardless of specs, I feel that at least these should be added
As it is now google as well as bing are indexing those. While true that by disallowing
/user/registerevery other url/user/register/for-realwill be disallowed too but I feel then it's up to the developer to adjust the robots.txt if needed.Comment #16
spoit commentedComment #17
spoit commentedComment #21
philsward commentedComment #22
cilefen commentedYes, this seems a duplicate of #180379: Fix path matching in robots.txt.