Problem/Motivation
- The vast majority of sites use Pathauto, as seen by installs January 2025:
Drupal core: 723,408 Pathauto: 514,780 - Getting a path such as
/node/100indexed instead of the human readable URL alias/my-aliasis bad for SEO, since you want the human readable path indexed in the first place, not thenode/NIDpath -- even if it gets redirected
Therefore, it makes sense to disallow all paths under /node from getting crawled by default.
There may be reasons why a site wants to allow paths under /node to get crawled, but they are the minority, and can edit robots.txt to allow this with https://www.drupal.org/project/robotstxt.
Steps to reproduce
See in search engines that paths such as /node/100 are getting indexed, instead of the intended human readable URL alias such as /my-alias, harming SEO.
Proposed resolution
Disallow all paths under /node from getting crawled by default.
Remaining tasks
Update the robots.txt file
Workaround
Until robots.txt gets updated.
Add this in composer.json:
"extra": {
"drupal-scaffold": {
"locations": {
"web-root": "web/"
},
"file-mapping": {
"[web-root]/robots.txt": {
"append": "assets/my-robots-additions.txt"
}
}
},
Add a file /assets/my-robots-additions.txt, with this in it:
# Do not crawl any nodes
Disallow: /node
See Using Drupal's Composer Scaffold > Altering scaffold files.
User interface changes
none
API changes
none
Data model changes
none
Release notes snippet
TBD
Issue fork drupal-3502360
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #3
ressaComment #4
ressaComment #5
ressaComment #6
ressaComment #7
ressaAdd Workaround in Issue Summary.
Comment #8
cilefen commentedAlthough Google recommends descriptive URLs, there is nothing "wrong" with
/nodepaths.Because using Redirect to maintain canonical URLs, that is,
<link rel="canonical" />, works, I wonder whether it is ideal not to index/nodepaths by default.These are just my first thoughts after a few minutes' consideration.
Comment #9
ressaThanks for weighing in. And while that's totally true, that is not my focus here, but I can see that the description of the problem in the issue Summary was a bit too vague.
The point is, that in most cases, you want the human readable path indexed in the first place, not the
node/NIDpath -- even if it gets redirected.I have updated the Issue Summary to make this point clearer.
Comment #10
cilefen commentedAlso note:
https://developers.google.com/search/docs/fundamentals/seo-starter-guide...
Comment #11
ressaSure, and the Redirect module can take care of that, as far as I see, should a
/node/100path get exposed and indexed by mistake ...Or do you have another point with sharing that sentence?
Again, the aim with this MR is to get the correct alias indexed in the the first shot, by blocking
/node/100from getting indexed in the first place.Comment #12
cilefen commentedIn my experiences that article is correct: search engines respect Core's
rel="canonical", with or without Redirect. I am trying to understand the downside of having a/nodeURL indexed to which later the author adds a path alias, which the search engines then accept.Does the opposite ever happen?
Comment #13
ressaI created this issue because
/node/100style paths got indexed, possibly before they had aliases. In my case, I didn't want that specific content type indexed.But what would be the downside to doing this? I guess sometimes you do want
node/100aliases indexed ... my assumption is just that, whenever you have Pathauto installed, 99% of the time you want to use human readable paths, and blocknode/100aliases, but I could be wrong?Comment #14
cilefen commentedI don't know, really. It just seems a strong default not to index
/node. It's a bit of a singular case but this very website would be largely un-indexed with that default. 🙁Comment #15
cilefen commentedActually that's not completely true because of the
/issuesauto path.Comment #16
ressaI do appreciate getting the tires of the MR kicked, don't get me wrong!
I just think that in the majority of new installations, you do not want
node/100paths indexed. And if that's true, we should make the preferred behaviour the default, non?Comment #17
smustgrave commentedSorry if I'm just repeating something. But I'm thinking of a novice site builder (mom and pop shop). If you don't have pathauto and don't manually set the alias then the URL will be node/123. Think we can all agree that's bad practice and standards. But idk how to vote for this one lol. Robots.txt should be part of core maybe?
Comment #18
ressaHeh, there has been some debate, but I think the gist of it was condensed into the last sentence of comment #16 -- that the majority would benefit from this.
Also, this change would probably not cause any big problems, but rather a theoretical challenge, for a select few.
Comment #19
catchAnd if this MR went in, and you never added aliases, then they'd never get indexed at all. rel="canonical" should cover this even when an node goes from unaliased to aliased.
Drupal CMS ships with robotstxt module now so I feel like 'default behaviour' is covered there, but we can't just break search indexing in core because people usually install contrib modules.
Comment #20
cilefen commentedThis is a "closed (won't fix)" for me.
Comment #21
ressaTrue, but really, how often does this scenario happen -- a web site without aliases, yet with an urgent need for indexing? It seems to me highly theoretical.
Not having node/100 pages indexed would be fine in most cases, and probably preferred. It was a case of premature indexing in my case, and I did not want pages with node/100 indexed. Also, how does Drupal CMS shipping with robotstxt module affect this change? As I still see it, in the majority of cases, you do not want node/110 pages indexed -- As I wrote about this earlier:
I still see this change as largely beneficial in the grand picture, for the majority of use cases. Conversely, if someone REALLY wants node/100 pages indexed, they can easily install the robotstxt module, and correct this.
Comment #22
smustgrave commentedWanted to bump 1 more time before closing.