robots.txt paths incorrect [#2581637]

Problem/Motivation

This is also an issue on Drupal 6 and Drupal 7.

Some of the paths in robots.txt appear to be incorrect, resulting in some pages being indexed incorrectly. They all currently have a trailing / which I believe is incorrect, and which the Google robots.txt Tester indicates is not working as expected (ie. it indicates pages such as /user/password is "allowed" rather than "blocked").

Disallow: /node/add/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
Disallow: /index.php/node/add/
Disallow: /index.php/user/password/
Disallow: /index.php/user/register/
Disallow: /index.php/user/login/
Disallow: /index.php/user/logout/

Related to this, the robots.txt file include the following lines in the header, which I don't think is definitive:

# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

Proposed resolution

Remove the trailing / from these paths, so they specify pages, and not directories.
Add Disallow: /user (no trailing space)
Change in the header:
From: "http://www.robotstxt.org/robotstxt.html"
To: "https://developers.google.com/webmasters/control-crawl-index/docs/robots..."

Remaining tasks

Should be reviewed by someone more familiar than me on this matter!

User interface changes

None.

API changes

None.

Data model changes

None.

Original report by [username]

n/a

Comment	File	Size	Author
#16	robots_paths_incorrect-2581637-16.patch	717 bytes	spoit

Comments

Comment #1

7 October 2015 at 09:07

iantresman created an issue. See original summary.

Comment #2

iantresman commented 7 October 2015 at 10:37

Relevant issues:

Relevant resources:

Robots.txt Specifications (Google, Bing, Yahoo, and Ask)

Comment #3

firoz2456 commented 7 October 2015 at 09:25

Status:

Active

» Closed (duplicate)

Comment #4

iantresman commented 7 October 2015 at 09:38

@firoz2456
Can you just mention which issue it is a duplicate of, the other relevant issues are different.

Comment #5

firoz2456 commented 7 October 2015 at 09:55

I think this issue is similar to https://www.drupal.org/node/1032234

Comment #6

iantresman commented 7 October 2015 at 10:18

I think I would argue that that is a different issue. This one is trivially fixable, and doesn't require any changes of functionality. I agree that in the long term, using meta tags is a better solution, but until then, doesn't it make sense to fix the leak, before replacing the roof?

Comment #7

firoz2456 commented 7 October 2015 at 10:37

Status:

Closed (duplicate)

» Active

Agreed. Re-open this issue

Comment #8

iantresman commented 7 October 2015 at 10:43

Issue summary:

View changes

Comment #9

no2e commented 11 October 2015 at 01:44

About changing the header

We should not link to https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en. It is Google’s own interpretation of the robots.txt standard, and it includes extensions that are not part of the original spec (and therefore not supported by all bots).

If consumer documentation is considered important, maybe we could link to a documenation page on drupal.org instead? On this page we could provide all related links: to the original specification, to documentation how search engines and other services parse it, to extensions etc.

About blocking `/user`

Having Disallow: /user could lead to blocks of various user-created URLs, as this would block URLs like

http://example.com/users
http://example.com/user-stories
http://example.com/user-meeting-2015
etc.

Not to mention the default user profile URLs (/user/1, etc., unless they have changed with D8?) which many sites may want to get crawled.

Comment #10

iantresman commented 11 October 2015 at 13:31

I think that Google robots.txt spec is good for Google, Bing, Yahoo, and Ask, which I would argue is a significant proportion of web robots, and more important than other interpretations. There does not appear to be an official standard.

Blocking /user does seem to have its drawbacks. There seems to me to be three solutions:

Change the path /user to /user.htm (or some other unique URL)
Use the Google (Bing, Yahoo, and Ask) standard, eg. use disallow: /user$
Specify the user-agent

Any of these solutions is better than not implementing any of them! Frankly, I don't care about other robots.

Comment #11

catch

he/him

English

commented 18 November 2015 at 10:27

Version:	8.1.x-dev	» 8.0.x-dev
Priority:	Major	» Normal

This could probably be changed in a patch release, unless we add dynamic robots.txt which is covered by other issues.

Comment #12

18 November 2015 at 10:27

Version:

8.0.x-dev

» 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #13

18 November 2015 at 10:27

Version:

8.1.x-dev

» 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #14

18 November 2015 at 10:27

Version:

8.2.x-dev

» 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #15

spoit commented 24 March 2017 at 11:51

Regardless of specs, I feel that at least these should be added

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
Disallow: /user/logout

Disallow: /index.php/user/password
Disallow: /index.php/user/register
Disallow: /index.php/user/login
Disallow: /index.php/user/logout

As it is now google as well as bing are indexing those. While true that by disallowing /user/register every other url /user/register/for-real will be disallowed too but I feel then it's up to the developer to adjust the robots.txt if needed.

Comment #16

spoit commented 24 March 2017 at 11:59

Status	File	Size
new	robots_paths_incorrect-2581637-16.patch	717 bytes

Comment #17

spoit commented 24 March 2017 at 12:01

Status:

Active

» Needs review

Comment #18

24 March 2017 at 12:01

Version:

8.3.x-dev

» 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #19

24 March 2017 at 12:01

Version:

8.4.x-dev

» 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #20

24 March 2017 at 12:01

Version:

8.5.x-dev

» 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #21

philsward commented 15 January 2019 at 06:21

Comment #22

cilefen commented 1 July 2019 at 14:24

Status:

Needs review

» Closed (duplicate)

Yes, this seems a duplicate of #180379: Fix path matching in robots.txt.

robots.txt paths incorrect

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Original report by [username]

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

About changing the header

About blocking `/user`

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Related issues

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community

robots.txt paths incorrect

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Original report by [username]

Comments

About changing the header

About blocking /user

Related issues

Referenced by

About blocking `/user`