Disallow crawling paths under /node by default in robots.txt [#3502360]

Problem/Motivation

The vast majority of sites use Pathauto, as seen by installs January 2025:
```
Drupal core: 723,408
Pathauto:    514,780
```
From https://www.drupal.org/project/usage
Getting a path such as /node/100 indexed instead of the human readable URL alias /my-alias is bad for SEO, since you want the human readable path indexed in the first place, not the node/NID path -- even if it gets redirected

Therefore, it makes sense to disallow all paths under /node from getting crawled by default.

There may be reasons why a site wants to allow paths under /node to get crawled, but they are the minority, and can edit robots.txt to allow this with https://www.drupal.org/project/robotstxt.

Steps to reproduce

See in search engines that paths such as /node/100 are getting indexed, instead of the intended human readable URL alias such as /my-alias, harming SEO.

Proposed resolution

Disallow all paths under /node from getting crawled by default.

Remaining tasks

Update the robots.txt file

Workaround

Until robots.txt gets updated.

Add this in composer.json:

"extra": {
    "drupal-scaffold": {
        "locations": {
            "web-root": "web/"
        },
        "file-mapping": {
            "[web-root]/robots.txt": {
                "append": "assets/my-robots-additions.txt"
            }
        }
    },

Add a file /assets/my-robots-additions.txt, with this in it:

# Do not crawl any nodes
Disallow: /node

See Using Drupal's Composer Scaffold > Altering scaffold files.

User interface changes

none

API changes

none

Data model changes

none

Release notes snippet

TBD

Issue fork drupal-3502360

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3502360-disallow-crawling-paths changes, plain diff MR !11008
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

26 January 2025 at 13:10

ressa created an issue. See original summary.

Comment #2

26 January 2025 at 13:12

ressa opened merge request !11008

Comment #3

ressa

he/him

commented 26 January 2025 at 13:13

Assigned:	ressa	» Unassigned
Category:	Task	» Feature request
Status:	Active	» Needs review
Parent issue:	#3167542: Remove trailing slashes from robots.txt	»
Related issues:		+#3167542: Remove trailing slashes from robots.txt

Comment #4

ressa

he/him

commented 26 January 2025 at 13:13

Category:

Feature request

» Task

Comment #5

ressa

he/him

commented 26 January 2025 at 13:14

Issue tags:

-Needs backport to D7, -Bug Smash Initiative

Comment #6

ressa

he/him

commented 26 January 2025 at 13:15

Comment #7

ressa

he/him

commented 26 January 2025 at 14:16

Issue summary:

View changes

Add Workaround in Issue Summary.

Comment #8

cilefen commented 26 January 2025 at 14:38

Although Google recommends descriptive URLs, there is nothing "wrong" with /node paths.

Because using Redirect to maintain canonical URLs, that is, <link rel="canonical" />, works, I wonder whether it is ideal not to index /node paths by default.

These are just my first thoughts after a few minutes' consideration.

Comment #9

ressa

he/him

commented 26 January 2025 at 14:48

Issue summary:

View changes

Although Google recommends descriptive URLs, there is nothing "wrong" with /node paths.

Thanks for weighing in. And while that's totally true, that is not my focus here, but I can see that the description of the problem in the issue Summary was a bit too vague.

The point is, that in most cases, you want the human readable path indexed in the first place, not the node/NID path -- even if it gets redirected.

I have updated the Issue Summary to make this point clearer.

Comment #10

cilefen commented 26 January 2025 at 15:21

Also note:

If you have multiple pages that have the same information, try setting up a redirect from non-preferred URLs to a URL that best represents that information. If you can't redirect, use the rel="canonical" link element instead. But again, don't worry too much about this; search engines can generally figure this out for you on their own most of the time.

https://developers.google.com/search/docs/fundamentals/seo-starter-guide...

Comment #11

ressa

he/him

commented 26 January 2025 at 15:58

Sure, and the Redirect module can take care of that, as far as I see, should a /node/100 path get exposed and indexed by mistake ...

Or do you have another point with sharing that sentence?

Again, the aim with this MR is to get the correct alias indexed in the the first shot, by blocking /node/100 from getting indexed in the first place.

Comment #12

cilefen commented 26 January 2025 at 16:26

In my experiences that article is correct: search engines respect Core's rel="canonical", with or without Redirect. I am trying to understand the downside of having a /node URL indexed to which later the author adds a path alias, which the search engines then accept.

Does the opposite ever happen?

Comment #13

ressa

he/him

commented 26 January 2025 at 17:25

I created this issue because /node/100 style paths got indexed, possibly before they had aliases. In my case, I didn't want that specific content type indexed.

But what would be the downside to doing this? I guess sometimes you do want node/100 aliases indexed ... my assumption is just that, whenever you have Pathauto installed, 99% of the time you want to use human readable paths, and block node/100 aliases, but I could be wrong?

Comment #14

cilefen commented 26 January 2025 at 17:59

I don't know, really. It just seems a strong default not to index /node. It's a bit of a singular case but this very website would be largely un-indexed with that default. 🙁

Comment #15

cilefen commented 26 January 2025 at 18:01

Actually that's not completely true because of the /issues auto path.

Comment #16

ressa

he/him

commented 26 January 2025 at 18:10

I do appreciate getting the tires of the MR kicked, don't get me wrong!

I just think that in the majority of new installations, you do not want node/100 paths indexed. And if that's true, we should make the preferred behaviour the default, non?

Comment #17

smustgrave commented 3 February 2025 at 15:38

Sorry if I'm just repeating something. But I'm thinking of a novice site builder (mom and pop shop). If you don't have pathauto and don't manually set the alias then the URL will be node/123. Think we can all agree that's bad practice and standards. But idk how to vote for this one lol. Robots.txt should be part of core maybe?

Comment #18

ressa

he/him

commented 3 February 2025 at 23:20

Heh, there has been some debate, but I think the gist of it was condensed into the last sentence of comment #16 -- that the majority would benefit from this.

Also, this change would probably not cause any big problems, but rather a theoretical challenge, for a select few.

Comment #19

catch

he/him

English

commented 27 March 2025 at 16:52

Status:

Needs review

» Postponed (maintainer needs more info)

I created this issue because /node/100 style paths got indexed, possibly before they had aliases.

And if this MR went in, and you never added aliases, then they'd never get indexed at all. rel="canonical" should cover this even when an node goes from unaliased to aliased.

Drupal CMS ships with robotstxt module now so I feel like 'default behaviour' is covered there, but we can't just break search indexing in core because people usually install contrib modules.

Comment #20

cilefen commented 27 March 2025 at 16:57

This is a "closed (won't fix)" for me.

Comment #21

ressa

he/him

commented 27 March 2025 at 19:33

And if this MR went in, and you never added aliases, then they'd never get indexed at all.

True, but really, how often does this scenario happen -- a web site without aliases, yet with an urgent need for indexing? It seems to me highly theoretical.

Not having node/100 pages indexed would be fine in most cases, and probably preferred. It was a case of premature indexing in my case, and I did not want pages with node/100 indexed. Also, how does Drupal CMS shipping with robotstxt module affect this change? As I still see it, in the majority of cases, you do not want node/110 pages indexed -- As I wrote about this earlier:

And if that's true, we should make the preferred behaviour the default, non?

I still see this change as largely beneficial in the grand picture, for the majority of use cases. Conversely, if someone REALLY wants node/100 pages indexed, they can easily install the robotstxt module, and correct this.

Comment #22

smustgrave commented 11 September 2025 at 15:32

Wanted to bump 1 more time before closing.

Comment #23

11 September 2025 at 15:32

Version:

11.x-dev

» main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Disallow crawling paths under /node by default in robots.txt

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

Workaround

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3502360

Comments

Referenced by