Problem/Motivation

  1. The vast majority of sites use Pathauto, as seen by installs January 2025:

    Drupal core: 723,408
    Pathauto:    514,780

    From https://www.drupal.org/project/usage

  2. Getting a path such as /node/100 indexed instead of the human readable URL alias /my-alias is bad for SEO, since you want the human readable path indexed in the first place, not the node/NID path -- even if it gets redirected

Therefore, it makes sense to disallow all paths under /node from getting crawled by default.

There may be reasons why a site wants to allow paths under /node to get crawled, but they are the minority, and can edit robots.txt to allow this with https://www.drupal.org/project/robotstxt.

Steps to reproduce

See in search engines that paths such as /node/100 are getting indexed, instead of the intended human readable URL alias such as /my-alias, harming SEO.

Proposed resolution

Disallow all paths under /node from getting crawled by default.

Remaining tasks

Update the robots.txt file

Workaround

Until robots.txt gets updated.

Add this in composer.json:

"extra": {
    "drupal-scaffold": {
        "locations": {
            "web-root": "web/"
        },
        "file-mapping": {
            "[web-root]/robots.txt": {
                "append": "assets/my-robots-additions.txt"
            }
        }
    },

Add a file /assets/my-robots-additions.txt, with this in it:

# Do not crawl any nodes
Disallow: /node

See Using Drupal's Composer Scaffold > Altering scaffold files.

User interface changes

none

API changes

none

Data model changes

none

Release notes snippet

TBD

Issue fork drupal-3502360

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

ressa created an issue. See original summary.

ressa’s picture

Assigned: ressa » Unassigned
Category: Task » Feature request
Status: Active » Needs review
Parent issue: #3167542: Remove trailing slashes from robots.txt »
Related issues: +#3167542: Remove trailing slashes from robots.txt
ressa’s picture

Category: Feature request » Task
ressa’s picture

ressa’s picture

ressa’s picture

Issue summary: View changes

Add Workaround in Issue Summary.

cilefen’s picture

Although Google recommends descriptive URLs, there is nothing "wrong" with /node paths.

Because using Redirect to maintain canonical URLs, that is, <link rel="canonical" />, works, I wonder whether it is ideal not to index /node paths by default.

These are just my first thoughts after a few minutes' consideration.

ressa’s picture

Issue summary: View changes

Although Google recommends descriptive URLs, there is nothing "wrong" with /node paths.

Thanks for weighing in. And while that's totally true, that is not my focus here, but I can see that the description of the problem in the issue Summary was a bit too vague.

The point is, that in most cases, you want the human readable path indexed in the first place, not the node/NID path -- even if it gets redirected.

I have updated the Issue Summary to make this point clearer.

cilefen’s picture

Also note:

If you have multiple pages that have the same information, try setting up a redirect from non-preferred URLs to a URL that best represents that information. If you can't redirect, use the rel="canonical" link element instead. But again, don't worry too much about this; search engines can generally figure this out for you on their own most of the time.

https://developers.google.com/search/docs/fundamentals/seo-starter-guide...

ressa’s picture

Sure, and the Redirect module can take care of that, as far as I see, should a /node/100 path get exposed and indexed by mistake ...

Or do you have another point with sharing that sentence?

Again, the aim with this MR is to get the correct alias indexed in the the first shot, by blocking /node/100 from getting indexed in the first place.

cilefen’s picture

In my experiences that article is correct: search engines respect Core's rel="canonical", with or without Redirect. I am trying to understand the downside of having a /node URL indexed to which later the author adds a path alias, which the search engines then accept.

Does the opposite ever happen?

ressa’s picture

I created this issue because /node/100 style paths got indexed, possibly before they had aliases. In my case, I didn't want that specific content type indexed.

But what would be the downside to doing this? I guess sometimes you do want node/100 aliases indexed ... my assumption is just that, whenever you have Pathauto installed, 99% of the time you want to use human readable paths, and block node/100 aliases, but I could be wrong?

cilefen’s picture

I don't know, really. It just seems a strong default not to index /node. It's a bit of a singular case but this very website would be largely un-indexed with that default. 🙁

cilefen’s picture

Actually that's not completely true because of the /issues auto path.

ressa’s picture

I do appreciate getting the tires of the MR kicked, don't get me wrong!

I just think that in the majority of new installations, you do not want node/100 paths indexed. And if that's true, we should make the preferred behaviour the default, non?

smustgrave’s picture

Sorry if I'm just repeating something. But I'm thinking of a novice site builder (mom and pop shop). If you don't have pathauto and don't manually set the alias then the URL will be node/123. Think we can all agree that's bad practice and standards. But idk how to vote for this one lol. Robots.txt should be part of core maybe?

ressa’s picture

Heh, there has been some debate, but I think the gist of it was condensed into the last sentence of comment #16 -- that the majority would benefit from this.

Also, this change would probably not cause any big problems, but rather a theoretical challenge, for a select few.

catch’s picture

Status: Needs review » Postponed (maintainer needs more info)

I created this issue because /node/100 style paths got indexed, possibly before they had aliases.

And if this MR went in, and you never added aliases, then they'd never get indexed at all. rel="canonical" should cover this even when an node goes from unaliased to aliased.

Drupal CMS ships with robotstxt module now so I feel like 'default behaviour' is covered there, but we can't just break search indexing in core because people usually install contrib modules.

cilefen’s picture

This is a "closed (won't fix)" for me.

ressa’s picture

And if this MR went in, and you never added aliases, then they'd never get indexed at all.

True, but really, how often does this scenario happen -- a web site without aliases, yet with an urgent need for indexing? It seems to me highly theoretical.

Not having node/100 pages indexed would be fine in most cases, and probably preferred. It was a case of premature indexing in my case, and I did not want pages with node/100 indexed. Also, how does Drupal CMS shipping with robotstxt module affect this change? As I still see it, in the majority of cases, you do not want node/110 pages indexed -- As I wrote about this earlier:

And if that's true, we should make the preferred behaviour the default, non?

I still see this change as largely beneficial in the grand picture, for the majority of use cases. Conversely, if someone REALLY wants node/100 pages indexed, they can easily install the robotstxt module, and correct this.

smustgrave’s picture

Wanted to bump 1 more time before closing.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.