Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
This is also an issue on Drupal 6 and Drupal 7.
Some of the paths in robots.txt appear to be incorrect, resulting in some pages being indexed incorrectly. They all currently have a trailing / which I believe is incorrect, and which the Google robots.txt Tester indicates is not working as expected (ie. it indicates pages such as /user/password is "allowed" rather than "blocked").
Disallow: /node/add/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
Disallow: /index.php/node/add/
Disallow: /index.php/user/password/
Disallow: /index.php/user/register/
Disallow: /index.php/user/login/
Disallow: /index.php/user/logout/
Related to this, the robots.txt file include the following lines in the header, which I don't think is definitive:
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
Proposed resolution
- Remove the trailing / from these paths, so they specify pages, and not directories.
- Add
Disallow: /user
(no trailing space) - Change in the header:
From: "http://www.robotstxt.org/robotstxt.html"
To: "https://developers.google.com/webmasters/control-crawl-index/docs/robots..."
Remaining tasks
Should be reviewed by someone more familiar than me on this matter!
User interface changes
None.
API changes
None.
Data model changes
None.
Original report by [username]
n/a
Comment | File | Size | Author |
---|---|---|---|
#16 | robots_paths_incorrect-2581637-16.patch | 717 bytes | spoit |
Comments
Comment #2
iantresman CreditAttribution: iantresman commentedRelevant issues:
Relevant resources:
Comment #3
firoz2456 CreditAttribution: firoz2456 as a volunteer and commentedComment #4
iantresman CreditAttribution: iantresman commented@firoz2456
Can you just mention which issue it is a duplicate of, the other relevant issues are different.
Comment #5
firoz2456 CreditAttribution: firoz2456 as a volunteer and commentedI think this issue is similar to https://www.drupal.org/node/1032234
Comment #6
iantresman CreditAttribution: iantresman commentedI think I would argue that that is a different issue. This one is trivially fixable, and doesn't require any changes of functionality. I agree that in the long term, using meta tags is a better solution, but until then, doesn't it make sense to fix the leak, before replacing the roof?
Comment #7
firoz2456 CreditAttribution: firoz2456 as a volunteer and commentedAgreed. Re-open this issue
Comment #8
iantresman CreditAttribution: iantresman commentedComment #9
no2e CreditAttribution: no2e commentedAbout changing the header
We should not link to https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en. It is Google’s own interpretation of the robots.txt standard, and it includes extensions that are not part of the original spec (and therefore not supported by all bots).
If consumer documentation is considered important, maybe we could link to a documenation page on drupal.org instead? On this page we could provide all related links: to the original specification, to documentation how search engines and other services parse it, to extensions etc.
About blocking
/user
Having
Disallow: /user
could lead to blocks of various user-created URLs, as this would block URLs likehttp://example.com/users
http://example.com/user-stories
http://example.com/user-meeting-2015
Not to mention the default user profile URLs (
/user/1
, etc., unless they have changed with D8?) which many sites may want to get crawled.Comment #10
iantresman CreditAttribution: iantresman commentedI think that Google robots.txt spec is good for Google, Bing, Yahoo, and Ask, which I would argue is a significant proportion of web robots, and more important than other interpretations. There does not appear to be an official standard.
Blocking /user does seem to have its drawbacks. There seems to me to be three solutions:
Any of these solutions is better than not implementing any of them! Frankly, I don't care about other robots.
Comment #11
catchThis could probably be changed in a patch release, unless we add dynamic robots.txt which is covered by other issues.
Comment #15
spoit CreditAttribution: spoit at Wieni commentedRegardless of specs, I feel that at least these should be added
As it is now google as well as bing are indexing those. While true that by disallowing
/user/register
every other url/user/register/for-real
will be disallowed too but I feel then it's up to the developer to adjust the robots.txt if needed.Comment #16
spoit CreditAttribution: spoit at Wieni commentedComment #17
spoit CreditAttribution: spoit at Wieni commentedComment #21
philsward CreditAttribution: philsward commentedComment #22
cilefen CreditAttribution: cilefen at Institute for Advanced Study commentedYes, this seems a duplicate of #180379: Fix path matching in robots.txt.