Problem/Motivation

The log-in pages for my Drupal sites are being indexed by google. I'm not sure if this is due to recent changes in how google handles the robots file, or spiders sites, but it seems I am not alone in this problem: https://www.drupal.org/forum/general/general-discussion/2012-06-27/is-it...

Steps to reproduce

1) Google the domain name of one of your live sites
2) see if Google recommends the Log In page in the list of pages beneath the search result for the home page

Proposed resolution

Remove trailing slashes from the robots.txt file (or, add alternatives that don't include the trailing slashes)

From How Google interprets the robots.txt specification > URL matching based on path values:

Example path matches
/ Matches the root and any lower level URL.
/* Equivalent to /. The trailing wildcard is ignored.
/$ Matches only the root. Any lower level URL is allowed for crawling.
/fish

Matches any path that starts with /fish. Note that the matching is case-sensitive.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn't match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish
/fish*

Equivalent to /fish. The trailing wildcard is ignored.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn't match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish
/fish/

Matches anything in the /fish/ folder.

Matches:

  • /fish/
  • /fish/?id=anything
  • /fish/salmon.htm

Doesn't match:

  • /fish
  • /fish.html
  • /animals/fish/
  • /Fish/Salmon.asp
/*.php

Matches any path that contains .php.

Matches:

  • /index.php
  • /filename.php
  • /folder/filename.php
  • /folder/filename.php?parameters
  • /folder/any.php.file.html
  • /filename.php/

Doesn't match:

  • / (even if it maps to /index.php)
  • /windows.PHP
/*.php$

Matches any path that ends with .php.

Matches:

  • /filename.php
  • /folder/filename.php

Doesn't match:

  • /filename.php?parameters
  • /filename.php/
  • /filename.php5
  • /windows.PHP
/fish*.php

Matches any path that contains /fish and .php, in that order.

Matches:

  • /fish.php
  • /fishheads/catfish.php?parameters

Doesn't match:
/Fish.PHP

Remaining tasks

update the robots.txt file

User interface changes

none

API changes

none

Data model changes

none

Release notes snippet

TBD

Issue fork drupal-3167542

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

jenlampton created an issue. See original summary.

jenlampton’s picture

In our robots file we have a mix of URLs that look like both paths and directories in the "Paths" section -- most of these end in a trailing slash, like a directory. Only filter/tips does not include a trailing slash, like a path.

Since removing the trailing slash seems to work for keeping the pages out of the google index, I wonder if the adding the trailing slash indicates to google "Block all pages in this directory" but does not include the parent page -- the URL without the trailing slash.

Since this is my leading theory, I've added a new section to the robots page called Psuedo-Directory that includes all the items we want treated as though they were directories (all paths below blocked) and moved the items with trailing slashes here.

I've also removed the trailing slash from pages that we do not want treated as directories, where we want only the path itself blocked from google. I've added duplicates into the Path section from the Psuedo-Directory section, where the parent path is also a page we want blocked.

Patches for review.

jenlampton’s picture

Status: Active » Needs review

forgot to change issue status: NR.

The last submitted patch, 2: core-robots_update-3167542-3.patch, failed testing. View results

chi’s picture

larowlan’s picture

Status: Needs review » Closed (duplicate)
Issue tags: +Bug Smash Initiative
chi’s picture

This still needs D7 backport.

douggreen’s picture

Version: 8.8.x-dev » 9.1.x-dev
Component: other » base system
Status: Closed (duplicate) » Needs review
StatusFileSize
new1.45 KB

I'm re-opening because #3123285: Actually exclude user register, login, logout, and password pages from search results in robots.txt (current rules are broken) addressed the /user/ routes. But we still have the admin, search, and node/add routes which are routes by themselves as well as leading paths, and thus should be listed with both a slash and without one.

douggreen’s picture

Let's see if a new comment results in testing against the right branch. (I wonder if the testing system didn't recognize that the branch changed in the same comment that had a new patch)

quietone’s picture

Starting a test for 9.1.x

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

anybody’s picture

Re-running test against 8.9.x. This should be committed to 9.x and 8.x.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

rescandon’s picture

Status: Needs review » Needs work

The patch is not working for 9.3.x-dev

- Installing drupal/core (9.3.x-dev e632ada): Cloning e632ada52e from cache
- Applying patches for drupal/core
https://www.drupal.org/files/issues/2020-09-15/robots-txt-3167542-8.patch (Removing trailing slashes from robots.txt)
Could not apply patch! Skipping. The error was: Cannot apply patch https://www.drupal.org/files/issues/2020-09-15/robots-txt-3167542-8.patch

[Exception]
Cannot apply patch Removing trailing slashes from robots.txt (https://www.drupal.org/files/issues/2020-09-15/robots-txt-3167542-8.patch)!

dhirendra.mishra’s picture

StatusFileSize
new227.99 KB

Patch from #8 gets applied on 9.3.x

Please find screenshot below:

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

abhaypai’s picture

Landed here from Bug smash initiative.

+1 #8 Patch is applied successfully for version 11.x-dev too and starting test for 11.x-dev version.

anybody’s picture

Tests are failing for #8. @abhaypai could you perhaps turn this into a working MR against 11.x?

Prashant.c made their first commit to this issue’s fork.

prashant.c’s picture

Status: Needs work » Needs review

@Anybody
Created MR against 11.x by taking the changes from #8.

Thank you.

anybody’s picture

Status: Needs review » Reviewed & tested by the community

Thanks @Prashant.c! All tests are passing green, so I think we should get this out of the way! Marking this RTBC.

poker10’s picture

Status: Reviewed & tested by the community » Needs review

According to the Google robots.txt description here: https://developers.google.com/search/docs/crawling-indexing/robots/robot... , the /search should match any path that starts with /search. So I do not think we need both /search and /search/ (and similar for the second change). The rules from robots.txt should be the "starts with" rules.

The second question is, we have /admin/ in robots.txt. /admin path is valid as well, so why we are not changing this too?

smustgrave’s picture

Status: Needs review » Needs work

Can it be documented in the issue summary what paths were chosen and why.

ressa made their first commit to this issue’s fork.

ressa’s picture

Title: Removing trailing slashes from robots.txt » Remove trailing slashes from robots.txt

I agree @poker10, we should remove all trailing slashes, and I have updated the MR to reflect this.

I have added the table below in the Issue Summary, is that sufficient documentation @smustgrave, or do we need some more?

From How Google interprets the robots.txt specification > URL matching based on path values:

Example path matches
/ Matches the root and any lower level URL.
/* Equivalent to /. The trailing wildcard is ignored.
/$ Matches only the root. Any lower level URL is allowed for crawling.
/fish

Matches any path that starts with /fish. Note that the matching is case-sensitive.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn't match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish
/fish*

Equivalent to /fish. The trailing wildcard is ignored.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn't match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish
/fish/

Matches anything in the /fish/ folder.

Matches:

  • /fish/
  • /fish/?id=anything
  • /fish/salmon.htm

Doesn't match:

  • /fish
  • /fish.html
  • /animals/fish/
  • /Fish/Salmon.asp
/*.php

Matches any path that contains .php.

Matches:

  • /index.php
  • /filename.php
  • /folder/filename.php
  • /folder/filename.php?parameters
  • /folder/any.php.file.html
  • /filename.php/

Doesn't match:

  • / (even if it maps to /index.php)
  • /windows.PHP
/*.php$

Matches any path that ends with .php.

Matches:

  • /filename.php
  • /folder/filename.php

Doesn't match:

  • /filename.php?parameters
  • /filename.php/
  • /filename.php5
  • /windows.PHP
/fish*.php

Matches any path that contains /fish and .php, in that order.

Matches:

  • /fish.php
  • /fishheads/catfish.php?parameters

Doesn't match:
/Fish.PHP

PS. Personally, I would add Disallow: /node, since:

  1. The vast majority of sites use Pathauto, installs January 2025:

      Drupal core:	723,408
      Pathauto:	514,780

    From https://www.drupal.org/project/usage

  2. Getting paths such as /node/100 indexed instead of the human readable URL alias /my-alias is bad for SEO ...

... but that's for another issue :)

ressa’s picture

I created an issue about disallowing /node.

ressa’s picture

Issue summary: View changes

The table outlining the rules I wanted to add in the Issue Summary got lost, now it's actually added.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.