I get a lot of Google requests like "/node/comment/reply/13". This URL is invalid (it should be "/comment/reply/13"), yet invoking it returns the main page and a status of 200: OK. This leads to duplicated content, which is penalized by Google and others.

I suggest to instead return a 404: Not Found, this is because the page requested was - well - not found, actually.

To achieve this, lines 2109p in node.module have to be changed from

>
drupal_set_title('');
return node_page_default();

to

drupal_not_found();
return;
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

gerd riesselmann’s picture

Silly me, the case of /node needs to be handled. Thats what the code should look like:

    case '':
      drupal_set_title('');
      return node_page_default();
    default:
      drupal_not_found();
      return;
owahab’s picture

I had this issue too.
I think the way drupal handles the second argument (node/$arg) is confusing for me, I tried fixing this in the node.module file but it's not an easy job, at least for me.
Dozens of dependancies I can't foresee.

gerd riesselmann’s picture

Status: Active » Needs review
FileSize
333 bytes
killes@www.drop.org’s picture

Version: 4.7.4 » 5.x-dev

looks good, moving to HEAD for more reviews.

m3avrck’s picture

Would like to see this backported to 4.7 as well--- once it gets committed.

Code looks good and makes sense to me.

ChrisKennedy’s picture

Version: 5.x-dev » 6.x-dev
Category: bug » feature
Priority: Critical » Normal
m3avrck’s picture

Version: 6.x-dev » 5.x-dev
Category: feature » bug
Status: Needs review » Needs work

This is a bug, if a page is not found, it should not redirect to another page claiming it still exists :-p

This should go into 5 and be backported to 4.7. However, this patch doesn't apply to HEAD because node_page() is gone now.

m3avrck’s picture

Version: 5.x-dev » 6.x-dev
Component: node system » menu system
Priority: Normal » Critical

IMO, this is indeed a critical bug. It would be trivial to create a spamming solution using this. Easy enough to ruin a sites reputation in a search engine too.

Bumping this to 6 though, this would not be a trivial patch. This would be great to incorporate into the thinking going behind the new menu system proposed for 6 too.

gerd riesselmann’s picture

I don't know about 5 and 6, but in the meantime: can somebody just apply the existing patch to the 4.7 branch? ;-)

moshe weitzman’s picture

i proposed a more generic solution for this at http://drupal.org/node/74347. i think this is a duplicate of that one, but i'll wait for someone else to change status.

gerd riesselmann’s picture

Version: 6.x-dev » 4.7.x-dev
Component: menu system » node system
Status: Needs work » Reviewed & tested by the community

Yes, your patch handles the same problem, but I doubt it will be incorporated in the foreseeable future, since it proposes a rather big change. At least it won't be incorporated into 4.7.

Since everybody commenting here actually

  • acknowledged the current behavior is a bug and
  • my patch does solve this for the 4.7 branch,

I'd like to ask again for someone to apply it to this branch - please.

moshe weitzman’s picture

Version: 4.7.x-dev » 6.x-dev

i think this is a pretty good fix. this has to be applied to HEAD before it can be backported. moving ...

gerd riesselmann’s picture

As m3avrck pointed out in #7, this patch cannot be applied to HEAD, since HEAD menu system works different in many ways. So what's the sense of porting this patch to HEAD, which means to completely rewrite it, just to port it back afterwards?

My point of view is that this patch solves a 4.7-specific critical bug in a 4.7-specific way. If there's are similar unsolved problem in HEAD or 5.0, so it may be. But this cannot be an argument to not fix the original bug!

gerd riesselmann’s picture

I've investigated 5.0 and the patch does not apply to this, either, since a URL like http://drupal.org/node/ispamdrupalorg gets routed to node_default_page() without further argument checking. URLs with patter node/id however are mapped to node_page_view().

In 4.7 node_page() does handle both cases and does internal argument checking - and that is where my patch comes in: It modifies the argument checking case statement.

Therefore, for 5 or HEAD a completely different solution needs to be found.

gerd riesselmann’s picture

FileSize
633 bytes

So, to end moving responsibilities around, here is a proposal to fix this for 5.0. In HEAD however, menu system has changed again and seems to be in the flow, so I second the patch proposed by moshe in #10.

I suggest somebody applies the 4.7-fix, and afterwards we discuss the 5.0 patch ;-). Agree?

m3avrck’s picture

I think this might be a better general fix:

http://drupal.org/node/74347

So this might be a duplicate of that as Moshe pointed out above.

moshe weitzman’s picture

Version: 6.x-dev » 5.x-dev

seems that i've made this too complicated. lets fix DRUPAL-5 with ted's patch and then 4.7 with gerd's patch. reassigning accordingly.

BioALIEN’s picture

I agree with Moshe, lets get this applied for 4.7.x and 5.x and then concentrate on D6 maybe using the new menu system. Drupal is widely used for its good SEO features. This patch addresses a major bug with regards to SEO. I've seen Google and Yahoo! bots send out test checks on some of our client sites in the form of google+hash.html so they do carry out 404 tests on websites.

+1 to getting this committed asap.

m3avrck’s picture

I agree with the said comments :-)

However, I didn't write a patch yet :-p

Moshe, can you clarify *which* comments have the correct patches to be committed? I'm sure one of the core committers would ask this question, so might as well ask it for them :-p

moshe weitzman’s picture

oops again. gerd's most recent patches are the ones to go with. one for 5.0 and one for 4.7

dww’s picture

Status: Reviewed & tested by the community » Needs work

gah, hate to do this, but http://drupal.org/files/issues/node404.5.0.patch from #15 contains drupalnot_found(). someone should re-roll that.

http://drupal.org/files/issues/node.module.404.4-7.patch is indeed RTBC.

oh, and +1 for fixing bugs in older versions, even if HEAD is in flux or bugs aren't present in newer versions. ;)

dww’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
881 bytes

here you go. ;)

to repeat, http://drupal.org/files/issues/node.module.404.4-7.patch from #3 is also RTBC for 4.7.x.

cheers,
-derek

dww’s picture

Status: Reviewed & tested by the community » Needs work

alas, whoops. previous patch (and therefore, my latest for 5.x) had syntax errors, too. hehe, guess i was a little hasty with that RTBC. ;)

dww’s picture

Status: Needs work » Needs review
FileSize
880 bytes

i couldn't get empty() to play nicely with arg(). this seemed like the simplest approach. another pair of eyes, please. ;)

m3avrck’s picture

Status: Needs review » Needs work

I'm not clear where this $validateargs comes from -- maybe a note about that in a comment? Should it be called $validate_args for clarity as well? Lowercase "true" should be "TRUE".

Wouldn't !isset(arg(1)) be faster than != as well?

dww’s picture

FYI: i was just making a quick pass through the RTBC queue, and came upon this. i don't personally care much, and i have no time to work on it, which is why it's not Assigned to dww. ;) someone who cares will have to pick up the torch here. just wanted to make sure i hadn't created any false expectations...

cheers,
-derek

gerd riesselmann’s picture

Version: 5.x-dev » 4.7.x-dev
Status: Needs work » Reviewed & tested by the community

Since everyone agrees that 4.7-patch is RTBC, I marked this bug accordingly. I suggest to set it to 5.x.dev and status "code needs review" after the 4.7-patch has been applied.

RobRoy’s picture

Version: 4.7.x-dev » 6.x-dev
Status: Reviewed & tested by the community » Needs review

Should go on HEAD first and then be backported. Marking back to review to make sure this applies cleanly to HEAD.

gerd riesselmann’s picture

Version: 6.x-dev » 4.7.x-dev
Status: Needs review » Reviewed & tested by the community

Aargh! Please do not start moving responsibilities around, again! There is a solution of a critical bug for 4.7 around for 4 months, and it doesn't get applied because everyones wants it to be ported to higher versions first. This is not good.

Rob, please read a thread before changing status: porting this to 5.0, or HEAD means complete rewriting, backporting therefore obviously means rewriting the rewritten - or applying the actual patch, that everyone already agrees on.

chx’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
852 bytes

Who is the everyone that agreed this is RTBC? The code is downright ugly. Why the attached patch is not enough?

BioALIEN’s picture

I didn't have enough time to test chx's approach but once again,

lets get this applied for 4.7.x and 5.x and then concentrate on D6 maybe using the new menu system.
gerd riesselmann’s picture

Status: Needs review » Reviewed & tested by the community

Oooookay. Let's get things straight.

  1. There is a patch for 4.7 given in #3. This patch is ready to be committed
  2. There are several patches available for 5.0, which need to be reviewed
  3. There is a patch by chx that I don't know what version it applies to (6.4?) - but it is sparkling and shiny. This patch needs to be reviewed, too

However, since menu system has changed from 4.7 to 5.0 patches to either version are completely different. This means that talking about "backporting" is downright nonsense, since "porting" in this special case means "rewriting".

Now, since 5.0 and HEAD still need to be discussed but 4.7 is ready, and unfortunately the bug reporting system doesn't allow us to express this situation properly, I proposed in #27 to mark this bug as RTBC for 4.7, wait till it gets committed, and afterwards mark it as 5.0, "code needs review".

Can we agree on this, please? I fear this bug else will never get fixed for 4.7 but endlessly discussed.

Or maybe two bug should be filed, one for 4.7 and one for 5.0?

BioALIEN’s picture

Someone please commit patch in #3 to 4.7.x branch. Then we'll move on to the 5.x discussions.

killes@www.drop.org’s picture

Version: 4.7.x-dev » 6.x-dev

gerd, that's not the workflow we follow. We follow the sequence HEAD, 5, 4.7 in order to avoid inconsistencies.

Dries’s picture

When I try to access 'http://localhost/drupal-cvs/node/comment/reply/13' as per the original bug report, I get an access denied message. Could it be that this is no longer an issue for CVS HEAD?

Update: I also get an SQL error, I'm afraid. Not sure if that's related to the 'access denied' but I'm guessing it might be. We'll want to fix the 'access denied' message in CVS HEAD before we can properly test this. :/

Wesley Tanaka’s picture

"Yes Please" (subscribing)

I've found a bunch of hits from googlebot of the form:

/node/?page=&

because the default 'node' page has a pager at the bottom......

Wesley Tanaka’s picture

Err, that was supposed to say:
/node/[some string]?page=[large number]&[maybe other stuff]

This bug appears to be a duplicate of http://drupal.org/node/82764

To clarify the 5.x status:

Wesley Tanaka’s picture

Sorry. http://drupal.org/files/issues/node-not-found.2.diff doesn't work on non-numeric node paths.

So the only patch which works for me on 5.1 is http://drupal.org/files/issues/node404.5.0.patch_2.txt

(and to answer the question in #24, you could assign the return from arg() to a variable and test that variable with empty())

matt_paz’s picture

subscribing

Dries’s picture

This is a bit unwieldy. So, what is considered to be the best patch for CVS HEAD? :)

fractile81’s picture

http://drupal.org/files/issues/node404.5.0.patch_2.txt

This patch appears to work for me as well. It appears to handle this problem on my site correctly (5.1, had to apply by hand and move the changes to line 2372). Will keep testing it, of course.

kbahey’s picture

As per Dries in #35, I can't reproduce the original issue with today's HEAD.

If you type ?q=node/something, you get an access denied, plus the following errors:

    * warning: Invalid argument supplied for foreach() in /drupal/HEAD/drupal/modules/node/node.module on line 567.
    * notice: Undefined variable: cond in /drupal/HEAD/drupal/modules/node/node.module on line 571.
    * warning: implode() [function.implode]: Bad arguments. in /drupal/HEAD/drupal/modules/node/node.module on line 571.
    * user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 query: SELECT n.nid, n.vid, n.type, n.status, n.created, n.changed, n.comment, n.promote, n.sticky, r.timestamp AS revision_timestamp, r.title, r.body, r.teaser, r.log, r.format, u.uid, u.name, u.picture, u.data FROM node n INNER JOIN users u ON u.uid = n.uid INNER JOIN node_revisions r ON r.vid = n.vid WHERE in /drupal/HEAD/drupal/includes/database.mysqli.inc on line 151.

If you type a URL that references a non existent node such as q=comment/reply/123456, you get a clean access denied.

If you type an invalid URL such as q=node/comment/reply/123456, you get the access denied plus the above error.

So, what is the problem we are trying to solve? Doesn't exist anymore?

drumm’s picture

I went ahead and reviewed and committed #30 to Drupal 5 at chx's advice. In the future, I think multiple issues for different versions of Drupal with significantly different patches is okay.

chx’s picture

Version: 6.x-dev » 4.7.x-dev

Gerhard, #3 is OK and HEAD needs a totally different patch (menu.inc level, I am thinking allowing callback arguments to be FALSE not an array in cases like this).

Gerhard Killesreiter’s picture

Version: 4.7.x-dev » 6.x-dev
Status: Reviewed & tested by the community » Active

Applied #3 to 4.7.
moving back to HEAD

m3avrck’s picture

Where's the 6.x patch? Fixed in 4.7, 5.x but not 6.x? *doh*

xqus’s picture

As far as I can see this is not an issue in 6.x.

chx’s picture

Status: Active » Fixed

indeed. if node/dfghjk is not defined that any path beginning with that will come back with 404 thanks to node_load returning FALSE thus translate returning FALSE thus menu_get_item returning FALSE

Anonymous’s picture

Status: Fixed » Closed (fixed)
xsean’s picture

ver 6.x got this issue as when the link as below

valid url: http://www.example.com/node/4
invalid url: http//www.example.com/node/4/invalid

where the invalid show the same content as valid url. it show direct to 404:page not found and will cause duplicate content

i search the solution for whole day but still cannot find it...anyone can help here? thanks!

looney_toons’s picture

Status: Closed (fixed) » Active

I wonder why this issue is ignored.
On 6.17 I see the following:
example.com/"termalias"/"something-malformed" shows the content of
example.com/"termalias"
As a result we can have several urls pointing Google to the term's page if the url of some node of this term (which is example.com/"termalias"/"date"/"title") is changed and/or some other node gets deleted.
Why does Drupal in this case show the content of some url it has found while it should just redirect to this url?

looney_toons’s picture

I will keep bumping this issue till I find the solution.

dww’s picture

@looney_toons: Please don't do that. It's just annoying and adds noise. Only bump the issue if you have something useful to contribute to it. Thanks.

looney_toons’s picture

With boost module enabled this can also lead to wasting disk space if a lot of malformed URLs are accessed forcing boost to generate static files for these URLs.

Teddd’s picture

This is awkwardly from seo point it makes the frontpage looks as duplicated content
also the crawlers gets confused but keep coming back butt for no reason for pages that not exist anymore or never have been.
I know setup a disallow in robot.txt may help, but this not solves it.

grendzy’s picture

Status: Active » Closed (works as designed)

This is the way the menu router works; some modules actually rely on this behavior via the arg() function. Given that, it probably can't be changed until Drupal 8. Also note this isn't the only way to create duplicate URLs, you can add arbitrary query strings like node/1?foo and the query string will be ignored.

The real solution is to avoid outputting incorrect links in the first place.