Problem/Motivation

This is also an issue on Drupal 6 and Drupal 7.

Some of the paths in robots.txt appear to be incorrect, resulting in some pages being indexed incorrectly. They all currently have a trailing / which I believe is incorrect, and which the Google robots.txt Tester indicates is not working as expected (ie. it indicates pages such as /user/password is "allowed" rather than "blocked").

Disallow: /node/add/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
Disallow: /index.php/node/add/
Disallow: /index.php/user/password/
Disallow: /index.php/user/register/
Disallow: /index.php/user/login/
Disallow: /index.php/user/logout/

Related to this, the robots.txt file include the following lines in the header, which I don't think is definitive:

# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

Proposed resolution

  1. Remove the trailing / from these paths, so they specify pages, and not directories.
  2. Add Disallow: /user (no trailing space)
  3. Change in the header:
    From: "http://www.robotstxt.org/robotstxt.html"
    To: "https://developers.google.com/webmasters/control-crawl-index/docs/robots..."

Remaining tasks

Should be reviewed by someone more familiar than me on this matter!

User interface changes

None.

API changes

None.

Data model changes

None.

Original report by [username]

n/a

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

iantresman created an issue. See original summary.

iantresman’s picture

firoz2456’s picture

Status: Active » Closed (duplicate)
iantresman’s picture

@firoz2456
Can you just mention which issue it is a duplicate of, the other relevant issues are different.

firoz2456’s picture

I think this issue is similar to https://www.drupal.org/node/1032234

iantresman’s picture

I think I would argue that that is a different issue. This one is trivially fixable, and doesn't require any changes of functionality. I agree that in the long term, using meta tags is a better solution, but until then, doesn't it make sense to fix the leak, before replacing the roof?

firoz2456’s picture

Status: Closed (duplicate) » Active

Agreed. Re-open this issue

iantresman’s picture

Issue summary: View changes
no2e’s picture

About changing the header

We should not link to https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en. It is Google’s own interpretation of the robots.txt standard, and it includes extensions that are not part of the original spec (and therefore not supported by all bots).

If consumer documentation is considered important, maybe we could link to a documenation page on drupal.org instead? On this page we could provide all related links: to the original specification, to documentation how search engines and other services parse it, to extensions etc.

About blocking /user

Having Disallow: /user could lead to blocks of various user-created URLs, as this would block URLs like

  • http://example.com/users
  • http://example.com/user-stories
  • http://example.com/user-meeting-2015
  • etc.

Not to mention the default user profile URLs (/user/1, etc., unless they have changed with D8?) which many sites may want to get crawled.

iantresman’s picture

I think that Google robots.txt spec is good for Google, Bing, Yahoo, and Ask, which I would argue is a significant proportion of web robots, and more important than other interpretations. There does not appear to be an official standard.

Blocking /user does seem to have its drawbacks. There seems to me to be three solutions:

  1. Change the path /user to /user.htm (or some other unique URL)
  2. Use the Google (Bing, Yahoo, and Ask) standard, eg. use disallow: /user$
  3. Specify the user-agent

Any of these solutions is better than not implementing any of them! Frankly, I don't care about other robots.

catch’s picture

Version: 8.1.x-dev » 8.0.x-dev
Priority: Major » Normal

This could probably be changed in a patch release, unless we add dynamic robots.txt which is covered by other issues.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

spoit’s picture

Regardless of specs, I feel that at least these should be added

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
Disallow: /user/logout

Disallow: /index.php/user/password
Disallow: /index.php/user/register
Disallow: /index.php/user/login
Disallow: /index.php/user/logout

As it is now google as well as bing are indexing those. While true that by disallowing /user/register every other url /user/register/for-real will be disallowed too but I feel then it's up to the developer to adjust the robots.txt if needed.

spoit’s picture

spoit’s picture

Status: Active » Needs review

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

philsward’s picture

cilefen’s picture

Status: Needs review » Closed (duplicate)

Yes, this seems a duplicate of #180379: Fix path matching in robots.txt.