Remove trailing slashes from robots.txt [#3167542]

Problem/Motivation

The log-in pages for my Drupal sites are being indexed by google. I'm not sure if this is due to recent changes in how google handles the robots file, or spiders sites, but it seems I am not alone in this problem: https://www.drupal.org/forum/general/general-discussion/2012-06-27/is-it...

Steps to reproduce

1) Google the domain name of one of your live sites
2) see if Google recommends the Log In page in the list of pages beneath the search result for the home page

Proposed resolution

Remove trailing slashes from the robots.txt file (~~or, add alternatives that don't include the trailing slashes~~)

From How Google interprets the robots.txt specification > URL matching based on path values:

Example path matches

/ Matches the root and any lower level URL.

/* Equivalent to /. The trailing wildcard is ignored.

/$ Matches only the root. Any lower level URL is allowed for crawling.

/fish
Matches any path that starts with /fish. Note that the matching is case-sensitive.

Matches:

/fish

/fish.html

/fish/salmon.html

/fishheads

/fishheads/yummy.html

/fish.php?id=anything

Doesn't match:

/Fish.asp

/catfish

/?id=fish

/desert/fish

/fish*
Equivalent to /fish. The trailing wildcard is ignored.

Matches:

/fish

/fish.html

/fish/salmon.html

/fishheads

/fishheads/yummy.html

/fish.php?id=anything

Doesn't match:

/Fish.asp

/catfish

/?id=fish

/desert/fish

/fish/
Matches anything in the /fish/ folder.

Matches:

/fish/

/fish/?id=anything

/fish/salmon.htm

Doesn't match:

/fish

/fish.html

/animals/fish/

/Fish/Salmon.asp

/*.php
Matches any path that contains .php.

Matches:

/index.php

/filename.php

/folder/filename.php

/folder/filename.php?parameters

/folder/any.php.file.html

/filename.php/

Doesn't match:

/ (even if it maps to /index.php)

/windows.PHP

/*.php$
Matches any path that ends with .php.

Matches:

/filename.php

/folder/filename.php

Doesn't match:

/filename.php?parameters

/filename.php/

/filename.php5

/windows.PHP

/fish*.php
Matches any path that contains /fish and .php, in that order.

Matches:

/fish.php

/fishheads/catfish.php?parameters

Doesn't match:
/Fish.PHP

Example path matches
`/`	Matches the root and any lower level URL.
`/*`	Equivalent to `/`. The trailing wildcard is ignored.
`/$`	Matches only the root. Any lower level URL is allowed for crawling.
`/fish`	Matches any path that starts with `/fish`. Note that the matching is case-sensitive. Matches: `/fish` `/fish.html` `/fish/salmon.html` `/fishheads` `/fishheads/yummy.html` `/fish.php?id=anything` Doesn't match: `/Fish.asp` `/catfish` `/?id=fish` `/desert/fish`
`/fish*`	Equivalent to `/fish`. The trailing wildcard is ignored. Matches: `/fish` `/fish.html` `/fish/salmon.html` `/fishheads` `/fishheads/yummy.html` `/fish.php?id=anything` Doesn't match: `/Fish.asp` `/catfish` `/?id=fish` `/desert/fish`
`/fish/`	Matches anything in the `/fish/` folder. Matches: `/fish/` `/fish/?id=anything` `/fish/salmon.htm` Doesn't match: `/fish` `/fish.html` `/animals/fish/` `/Fish/Salmon.asp`
`/*.php`	Matches any path that contains `.php`. Matches: `/index.php` `/filename.php` `/folder/filename.php` `/folder/filename.php?parameters` `/folder/any.php.file.html` `/filename.php/` Doesn't match: `/` (even if it maps to /index.php) `/windows.PHP`
`/*.php$`	Matches any path that ends with `.php`. Matches: `/filename.php` `/folder/filename.php` Doesn't match: `/filename.php?parameters` `/filename.php/` `/filename.php5` `/windows.PHP`
`/fish*.php`	Matches any path that contains `/fish` and `.php`, in that order. Matches: `/fish.php` `/fishheads/catfish.php?parameters` Doesn't match: `/Fish.PHP`

Remaining tasks

update the robots.txt file

User interface changes

none

API changes

none

Data model changes

none

Release notes snippet

TBD

Comment	File	Size	Author
#15	Screenshot from 2021-07-22 18-02-33.png	227.99 KB	dhirendra.mishra
#8	robots-txt-3167542-8.patch	1.45 KB	douggreen
#2	core-robots_update-3167542-3-D7-do-not-test.patch	1.59 KB	jenlampton
#2	core-robots_update-3167542-3.patch	1.37 KB	jenlampton

Issue fork drupal-3167542

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3167542-removing-trailing-slashes changes, plain diff MR !5878
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

26 August 2020 at 18:55

jenlampton created an issue. See original summary.

Comment #2

jenlampton

she / her

commented 26 August 2020 at 19:21

Status	File	Size
new	core-robots_update-3167542-3.patch	1.37 KB
new	core-robots_update-3167542-3-D7-do-not-test.patch	1.59 KB

In our robots file we have a mix of URLs that look like both paths and directories in the "Paths" section -- most of these end in a trailing slash, like a directory. Only filter/tips does not include a trailing slash, like a path.

Since removing the trailing slash seems to work for keeping the pages out of the google index, I wonder if the adding the trailing slash indicates to google "Block all pages in this directory" but does not include the parent page -- the URL without the trailing slash.

Since this is my leading theory, I've added a new section to the robots page called Psuedo-Directory that includes all the items we want treated as though they were directories (all paths below blocked) and moved the items with trailing slashes here.

I've also removed the trailing slash from pages that we do not want treated as directories, where we want only the path itself blocked from google. I've added duplicates into the Path section from the Psuedo-Directory section, where the parent path is also a page we want blocked.

Patches for review.

Comment #3

jenlampton

she / her

commented 26 August 2020 at 19:21

Status:

Active

» Needs review

forgot to change issue status: NR.

Comment #4

26 August 2020 at 20:39

The last submitted patch, 2: core-robots_update-3167542-3.patch, failed testing. View results

Comment #5

chi commented 27 August 2020 at 17:24

That has been fixed just yesterday.

Comment #6

larowlan

🇦🇺🏝.au GMT+10

commented 28 August 2020 at 04:35

Status:	Needs review	» Closed (duplicate)
Issue tags:		+Bug Smash Initiative

Comment #7

chi commented 28 August 2020 at 07:33

This still needs D7 backport.

Comment #8

douggreen commented 15 September 2020 at 19:09

Version:	8.8.x-dev	» 9.1.x-dev
Component:	other	» base system
Status:	Closed (duplicate)	» Needs review

Status	File	Size
new	robots-txt-3167542-8.patch	1.45 KB

2 files were hidden/shown/deleted

Status	File	Size
hidden	core-robots_update-3167542-3.patch	1.37 KB
hidden	core-robots_update-3167542-3-D7-do-not-test.patch	1.59 KB

I'm re-opening because #3123285: Actually exclude user register, login, logout, and password pages from search results in robots.txt (current rules are broken) addressed the /user/ routes. But we still have the admin, search, and node/add routes which are routes by themselves as well as leading paths, and thus should be listed with both a slash and without one.

Comment #9

douggreen commented 22 September 2020 at 13:09

Let's see if a new comment results in testing against the right branch. (I wonder if the testing system didn't recognize that the branch changed in the same comment that had a new patch)

Comment #10

quietone commented 29 September 2020 at 02:38

Starting a test for 9.1.x

Comment #11

29 September 2020 at 02:38

Version:

9.1.x-dev

» 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Comment #12

anybody

German

Porta Westfalica

commented 9 March 2021 at 08:53

Re-running test against 8.9.x. This should be committed to 9.x and 8.x.

Comment #13

9 March 2021 at 08:53

Version:

9.2.x-dev

» 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #14

rescandon commented 23 June 2021 at 23:00

Status:

Needs review

» Needs work

The patch is not working for 9.3.x-dev

- Installing drupal/core (9.3.x-dev e632ada): Cloning e632ada52e from cache
- Applying patches for drupal/core
https://www.drupal.org/files/issues/2020-09-15/robots-txt-3167542-8.patch (Removing trailing slashes from robots.txt)
Could not apply patch! Skipping. The error was: Cannot apply patch https://www.drupal.org/files/issues/2020-09-15/robots-txt-3167542-8.patch

[Exception]
Cannot apply patch Removing trailing slashes from robots.txt (https://www.drupal.org/files/issues/2020-09-15/robots-txt-3167542-8.patch)!

Comment #15

dhirendra.mishra commented 22 July 2021 at 12:33

Status	File	Size
new	Screenshot from 2021-07-22 18-02-33.png	227.99 KB

Patch from #8 gets applied on 9.3.x

Please find screenshot below:

Comment #16

22 July 2021 at 12:33

Version:

9.3.x-dev

» 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #17

22 July 2021 at 12:33

Version:

9.4.x-dev

» 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #18

22 July 2021 at 12:33

Version:

9.5.x-dev

» 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #19

22 July 2021 at 12:33

Version:

10.1.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #20

abhaypai commented 19 December 2023 at 08:44

Landed here from Bug smash initiative.

+1 #8 Patch is applied successfully for version 11.x-dev too and starting test for 11.x-dev version.

Comment #21

anybody

German

Porta Westfalica

commented 19 December 2023 at 14:56

Tests are failing for #8. @abhaypai could you perhaps turn this into a working MR against 11.x?

Comment #22

19 December 2023 at 15:01

Prashant.c made their first commit to this issue’s fork.

Comment #23

19 December 2023 at 15:12

Prashant.c opened merge request !5878

Comment #24

prashant.c

he/him

English

Dharamshala

commented 19 December 2023 at 15:13

Status:

Needs work

» Needs review

@Anybody
Created MR against 11.x by taking the changes from #8.

Thank you.

Comment #25

anybody

German

Porta Westfalica

commented 19 December 2023 at 16:18

Status:

Needs review

» Reviewed & tested by the community

Thanks @Prashant.c! All tests are passing green, so I think we should get this out of the way! Marking this RTBC.

Comment #26

poker10 commented 19 December 2023 at 20:43

Status:

Reviewed & tested by the community

» Needs review

According to the Google robots.txt description here: https://developers.google.com/search/docs/crawling-indexing/robots/robot... , the /search should match any path that starts with /search. So I do not think we need both /search and /search/ (and similar for the second change). The rules from robots.txt should be the "starts with" rules.

The second question is, we have /admin/ in robots.txt. /admin path is valid as well, so why we are not changing this too?

Comment #27

smustgrave commented 19 December 2023 at 21:26

Status:

Needs review

» Needs work

Can it be documented in the issue summary what paths were chosen and why.

Comment #28

26 January 2025 at 12:41

ressa made their first commit to this issue’s fork.

Comment #29

ressa

he/him

commented 26 January 2025 at 13:01

Title:

Removing trailing slashes from robots.txt

» Remove trailing slashes from robots.txt

I agree @poker10, we should remove all trailing slashes, and I have updated the MR to reflect this.

I have added the table below in the Issue Summary, is that sufficient documentation @smustgrave, or do we need some more?

From How Google interprets the robots.txt specification > URL matching based on path values:

Example path matches

/ Matches the root and any lower level URL.

/* Equivalent to /. The trailing wildcard is ignored.

/$ Matches only the root. Any lower level URL is allowed for crawling.

/fish
Matches any path that starts with /fish. Note that the matching is case-sensitive.

Matches:

/fish

/fish.html

/fish/salmon.html

/fishheads

/fishheads/yummy.html

/fish.php?id=anything

Doesn't match:

/Fish.asp

/catfish

/?id=fish

/desert/fish

/fish*
Equivalent to /fish. The trailing wildcard is ignored.

Matches:

/fish

/fish.html

/fish/salmon.html

/fishheads

/fishheads/yummy.html

/fish.php?id=anything

Doesn't match:

/Fish.asp

/catfish

/?id=fish

/desert/fish

/fish/
Matches anything in the /fish/ folder.

Matches:

/fish/

/fish/?id=anything

/fish/salmon.htm

Doesn't match:

/fish

/fish.html

/animals/fish/

/Fish/Salmon.asp

/*.php
Matches any path that contains .php.

Matches:

/index.php

/filename.php

/folder/filename.php

/folder/filename.php?parameters

/folder/any.php.file.html

/filename.php/

Doesn't match:

/ (even if it maps to /index.php)

/windows.PHP

/*.php$
Matches any path that ends with .php.

Matches:

/filename.php

/folder/filename.php

Doesn't match:

/filename.php?parameters

/filename.php/

/filename.php5

/windows.PHP

/fish*.php
Matches any path that contains /fish and .php, in that order.

Matches:

/fish.php

/fishheads/catfish.php?parameters

Doesn't match:
/Fish.PHP

Example path matches
`/`	Matches the root and any lower level URL.
`/*`	Equivalent to `/`. The trailing wildcard is ignored.
`/$`	Matches only the root. Any lower level URL is allowed for crawling.
`/fish`	Matches any path that starts with `/fish`. Note that the matching is case-sensitive. Matches: `/fish` `/fish.html` `/fish/salmon.html` `/fishheads` `/fishheads/yummy.html` `/fish.php?id=anything` Doesn't match: `/Fish.asp` `/catfish` `/?id=fish` `/desert/fish`
`/fish*`	Equivalent to `/fish`. The trailing wildcard is ignored. Matches: `/fish` `/fish.html` `/fish/salmon.html` `/fishheads` `/fishheads/yummy.html` `/fish.php?id=anything` Doesn't match: `/Fish.asp` `/catfish` `/?id=fish` `/desert/fish`
`/fish/`	Matches anything in the `/fish/` folder. Matches: `/fish/` `/fish/?id=anything` `/fish/salmon.htm` Doesn't match: `/fish` `/fish.html` `/animals/fish/` `/Fish/Salmon.asp`
`/*.php`	Matches any path that contains `.php`. Matches: `/index.php` `/filename.php` `/folder/filename.php` `/folder/filename.php?parameters` `/folder/any.php.file.html` `/filename.php/` Doesn't match: `/` (even if it maps to /index.php) `/windows.PHP`
`/*.php$`	Matches any path that ends with `.php`. Matches: `/filename.php` `/folder/filename.php` Doesn't match: `/filename.php?parameters` `/filename.php/` `/filename.php5` `/windows.PHP`
`/fish*.php`	Matches any path that contains `/fish` and `.php`, in that order. Matches: `/fish.php` `/fishheads/catfish.php?parameters` Doesn't match: `/Fish.PHP`

PS. Personally, I would add Disallow: /node, since:

The vast majority of sites use Pathauto, installs January 2025:
```
  Drupal core:	723,408
  Pathauto:	514,780
```
From https://www.drupal.org/project/usage
Getting paths such as /node/100 indexed instead of the human readable URL alias /my-alias is bad for SEO ...

... but that's for another issue :)

Comment #30

ressa

he/him

commented 26 January 2025 at 13:16

I created an issue about disallowing /node.

Comment #31

ressa

he/him

commented 26 January 2025 at 14:09

Issue summary:

View changes

The table outlining the rules I wanted to add in the Issue Summary got lost, now it's actually added.

Comment #32

26 January 2025 at 14:09

Version:

11.x-dev

» main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Remove trailing slashes from robots.txt

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3167542

Comments

Related issues

Referenced by