The has been a major issue for me as I switched all my site from Joomla to Drupal.
Any previously indexed URL that contains index.php will return 200 status and redirect to the front page as opposed to returning 404 Page Not Found
Let me give you an example (links are broken purposely here)
h t t p :// w w w.shareyourexpertise.com/doesnotexist ---Return Page NOT Found
h t t p:// w w w .shareyourexpertise.com/index.php/doesnotexist ---Return 200 and front page
| Comment | File | Size | Author |
|---|---|---|---|
| #42 | proper_path_wildcards_7.x-432384-42.patch | 733 bytes | machee |
| #40 | proper_path_wildcards_8.x-432384-40.patch | 753 bytes | machee |
| #38 | proper_path_wildcards-432384-38.patch | 399 bytes | machee |
| #27 | menu.inc-404-D6.patch | 532 bytes | ngaur |
| #27 | 404.php.txt | 225 bytes | ngaur |
Comments
Comment #1
vm commentednot the node system.
Comment #2
netentropy commentedI did not see which issue to post under and when I check similar Rewrite threads they were posted to node system.
Comment #3
vm commentedno worries. I just want to try and help make sure the right eyes see this.
Comment #4
netentropy commentedThanks I hope the right eyes see it because this is causing me a world of trouble and I know nothing about Rewrite Rules
Comment #5
damien tournoud commentedThis is not a bug, as it is the standard Apache behavior.
I suggest you try to add the following in your
.htaccess, after Drupal's standard rewrite rules:Do you have some
/index.php/xxxxxURLs you want to save (for SEO reasons, for example)? If so, you might also want to define the corresponding URL aliases inside your Drupal site.Comment #6
netentropy commentedThere are no URLs containing index.php that I want to save, I would like to redirect them all to a proper 404 page.
Does go just after the Rewrite Rules
or somewhere else? I thought the "L" means to stop the rewrite rules do I need to remove it from the 4th line?
Thanks, some Joomla Pretty URLs mods got in the habit of including index.php in the URL string and now I am stuck with tons of these that do not exist anymore
Comment #7
damien tournoud commentedIf you just want to 404 those, I would suggest something a little bit stronger, like (untested):
Preferably before Drupal's own rewrite rules.
Comment #8
avpadernoComment #9
netentropy commentedbut what I don't understand is how does this Rewrite Rules deliver the headers that tell Google etc that is a 404 instance and not just pointing to another page
Comment #10
netentropy commentedIf this is standard apache behaviour then how does Wordpress get around this same issue?
Its clean URL function will direct any url that contains /index.php/notreallink to page not found unlike Drupal which loads the front page
Comment #11
netentropy commentedOk I have now tested this with Joomla and Wordpress. In both these CMS (and no I do not want to switch) whenver a URL that does not exist that contains index.php is entered, a 404 Page Not Found is served.
However in Drupal, if you present a url that does not exist and contains index.php it will load the front page.
Comment #12
vm commentedpushing to 6.x-dev as it would have to be fixed there first.
Comment #13
vm commentedat this point it can be switched to a bug report as well.
Comment #14
avpadernoThe query string seems to be ignored; trying a URL like http://example.com/index.php?password=pass causes the front page to be shown.
Comment #15
ngaur commentedThis doesn't look to me like a case of 'standard apache behaviour', and it has potential to cause more trouble than has been identified here so far. Consider this URL, which displays the same bug:
http:// drupal.org/LICENSE.txt/foobar
Apache is not just ignoring the bit of the path after the existing file. Rather because the file does not exist, the drupal supplied clean-url rewrite URL is triggered which rewrites this to:
http:// drupal.org/index.php?LICENSE.txt/foobar
Drupal is then responsible for determining whether or not the URL is valid or not and what status code should be returned.
I was about to publish this comment without disabling the above URLs, and then I thought about the potential consequences. From the above URL, click on the 'Modules' link a few times and watch the URL get longer each time. This means that an infinite number of 'valid' URLs (ie HTTP status=200) can be spidered on the site, albeit they get long and most spiders are smart enough to give up. Depending what URLs are on the front page of a site though the number of URLs of a given depth could be much higher, with a potential combinatorial explosion of directory paths.
eg if there are relative links to "about/contact", "products/specials" and "help/FAQ" on the front page, then a spider that got a link to http:// drupal.org/LICENSE.txt/foo would later reach URLs like http:// drupal.org/LICENSE.txt/about/help/help/about/products/help/FAQ amongst 3^6 other URLs of that length and the site would still be returning status=200 for all of them.
Comment #16
netentropy commentedThanks for the support here. I happen to be an analytical chemist and after testing this several times I thought it was a valid issue not just standard apache behaviour
Comment #17
damien tournoud commentedThe standard Apache behavior is to ignore the part after the filename an executable file, as in:
The part after the executable file is called the path information in PHP lingo, and is available in:
So this is *not* a bug. If you don't want that, you can use mod_rewrite to redirect all the URLs with a path info to
/. This report could be requalified as a feature request for Drupal 7, but for Drupal 6, it is not a bug.Comment #18
andreiashu commentedIf Damien is right then lets make it a feature request for D7 at least.
Even if it is a Drupal bug or just a feature request I think we should have this fixed/solved (in D7 and if it is possible maybe backport to D6).
If you read #11, it seems that Joomla and Wordpress don't replicate this behaviour (they send a correct 404 response it this case).
Comment #19
netentropy commentedyes I agree. I think this really defines what a *bug* is. It might not be an apache *bug* but as far as a properly working website it is a *bug*
Other top CMS scripts are built on PHP and Apache and report proper 404s
I strongly believe this should be fixed for Drupal 6x as module and sites are still catching up to Drupal 6
As far as the Status: active won't fix I just do not see this as the right direction since Google has made it known that they place great emphasis on proper error reporting.
Comment #20
ngaur commentedRE Damien's comment #17
Apache's PATHINFO type behaviour is not the cause of this bug.
You say that apache passes PATHINFO in the case of an executable file. It's nothing to do with whether there's an executable file involved See my examples in #15, where I used a non-executable file example (LICENSE.txt), and without drupal's mod_rewrite rule, apache would correctly produce a 404 response. Apache's behaviour in this case is over-ridden, and is relevant only insofar as it provides an example of what should happen. Note also that 'executable' in apache's view is not necessarily related to the flags in the file system, but rather relates to whether there is an apache handler for the file.
As I've pointed out, these URLs are being passed to drupal because drupal's .htaccess file has a rule saying that's what should happen when no file is found. In the case of /LICENSE.txt/foo, apache is not passing PATHINFO=/foo to /LICENSE.txt because the mod_rewrite rule tells it to pass that to drupal's index.php instead, and at that point it becomes Drupal's responsibility to handle the request sensibly.
Having implemented pretty much the whole of Drupal as a handler for file not found situations, If Drupal fails to deliver file not found responses where that's important, then Drupal has a bug. When combined with relative links on the generated page (in drupal's case it serves it's front page) to paths with one or more '/' in them, it becomes a serious bug.
If Drupal was to reimplement the functionality you refer to in Apache, passing control for requests that Drupal cannot handle back to, then that would be a functionally good solution. It would mean that scripts that utilise the PATHINFO behaviour could function within Drupal's directory. Besides fixing the bug, this would be a win for anyone who wanted to run web apps other than Drupal on a site where Drupal runs the top level directory. Instead, Drupal seems to be producing extra disk accesses to check the file system for the presence of a file that might be supposed to handle the response, and then serving the front page instead.
If Drupal was to serve up a 404 error page directly, then that would be OK, and would mean that the number of disk reads could be reduced by not searching for a file that accounts for part of the requested path.
Comment #21
ngaur commentedChanging the title because this is not limited to index.php.
The following mod_rewrite rules are possible fixes in the case where you know the prefix that's causing problems. eg if you've got a spider reading through /index.php/foo type files this will help, but it won't change the more general case where any other existing file in drupal's file system causes the same thing. eg my earlier example of /LICENSE.txt/foo. I don't think this is a good fix, but it will help if someone's getting a lot of unwanted traffic from web crawlers.
Choose any one of the approaches below.
What you need is another rewrite rule before the one that's already in your .htaccess file that handles any URL path starting with "/index.php/". Here's some possible approaches:
This first one produces an ugly apache error page. Doesn't use drupal, so doesn't load your server much.
It's a cachable result also, which helps reduce repeat requests for the same URL.
This one rewrites the URL to something which drupal then handles will give a 404 error for. Nicer display.
Drupal's 404 report isn't very useful here though as it logs the rewritten url.
If you care about drupal recording what got rewritten you could use this one:
Have a think though about whether a 404 error is really what you want to send.
Maybe you'd be better off getting those hits to where they are wanted.
Search engine traffic can be a considerable asset.
You could redirect the browser to the front page with a 301 redirect. The browser still sees the same page,
but search engines will get the message that the old page is no good,
people won't bookmark the wrong page, and relative links will point to the right place.
Comment #22
netentropy commentedThanks for the great list of option
Do you just place these before the current Drupal Rewrite Rules or after them.
I am not sure what the final Drupal .htaccess should appear like, say , if I used the last option
do I just paste it in there somewhere or is there something else I need to do?
Comment #23
avpadernoWould not it better if Drupal comes with a .htaccess file that already contains a way to resolve such cases? In this way, it would not be required to apply the changes to that file each time it gets changed from a commit made on Drupal code repository.
Comment #24
avpadernoComment #25
damien tournoud commentedThe change should go first in D7, and only then backported. Can someone come up with a patch?
Comment #26
ngaur commentedIn reply to #22, whichever rewrite rule you choose must go after "RewriteEngine on" and it's better to be after the "RewriteBase" rule if you've enabled that. It must go before the RewriteCond / RewriteRule block that Drupal uses to enable clean URLs. Ie in drupal 6.10 which is what I'm using, it should go just before the comment that reads " # Rewrite URLs of the form 'x' to the form 'index.php?q=x'."
In reply to #23, htaccess rules are really not the right way for Drupal to fix the problem because it's not just index.php that has the problem, and writing rules for a large and unpredictable set of files isn't viable. The .htaccess approach might help someone with a problem relating to specific urls that are in fact being crawled by a search engine or such like.
Re #25, I don't have drupal 7 installed. I've attached a patch for 6.10 though, which should be easy enough to apply to 7 as it basically doesn't interact with anything else in drupal except drupal_not_found(). This turned out to be pretty easy to write because it turns out that the .htaccess directive "ErrorDocument 404 index.php" is part of the picture. It seems the 404 handler gets to set some stuff up before the RewriteRule modifies the path that gets to index.php. It bothers me that I don't fully understand how mod_rewrite and the ErrorDocument directive are interacting. Testing required, and insight solicited.
The same sort of issue (as seen externally, different code bugs) is widespread in Drupal. Many arbitrary and silly urls fail to generate errors, leading to potential spider pits. eg:
http://drupal.org/project/issues/statistics/drupal/foo/bar/foo/bar
http://drupal.org/aggregator/foo/bar/foo/bar/foo/bar
http://drupal.org/user/12356/foo/bar/foo/bar/foo/bar
http://drupal.org/comment/reply/432384/1525254/foo/bar/foo/bar
These bugs will need different code fixes, but the fact that these continue to function as before is a good sign in terms of my patch not screwing things up. It would be easy (5 minutes + testing) to write a program to find urls on a site which can be extended in this manner.
Here's some relative links that will trigger this situation. Unless Drupal does some sort of filtering here, this bit of user input will create an infinite set of linked pages. Given that such bugs do, and presumably will exist, I looked around for people's fixes to the problems of relative links, and discovered that this is a very old discuussion. #13148: Problems with using relative path names.
Comment #27
ngaur commentedI've attached a tidied patch.
menu.inc-404-D6.patch attached.
It's functionally the same as my last patch, but a tidier place to put it.
I'm still not sure quite how the apache processing works. While I doubt this patch breaks a typical apache/drupal install, I don't know how robust this is when using a different web server, or perhaps with a different configuration approach under apache.
While less consistent with drupal's usual approach, it might be more stable to have a separate top level file for 404 errors. I've attached such a file. To use it, leave everything else alone, except the "ErrorDocument 404" line in your .htaccess file:
Comment #28
mikeytown2 commentedI second the idea of having a php file for just 404's. I worked around this issue in the boost module so it correctly sends out a 404.
#345484: 404 hits to /files directory cached as homepage with broken form actions
Comment #29
sdaams commentedHello all,
What is the status on this? I'm not that familiar with Drupal, but helping a friend's company sort out their new website and have now run into this. To be brutally frank, I was blown away that this could even be the case. This is a HUGE HUGE bug. In fact, this bug could be used by any competitor to penalize/de-rank/devalue any site running Drupal that hasn't plugged this hole, as you can link to infinite pages and just lean back and watch search engines give up trying to crawl infinite numbers of pages....
Currently using a url redirect module to plug some of the obvious links, but that does not fix the above issue, it just takes care of the ones we know for sure exist.
I tried setting 404.php as the errordocument, but no luck.
Ideas?
Comment #30
damien tournoud commentedJust to clean-up the air: Drupal doesn't use any relative URLs, and you are *strongly* advised not to use any in your content. Because there is no such URLs, there is no possible "infinite set of pages" generated.
Comment #31
sdaams commentedhttp://www.af83.com/index.php?foosdfsd
http://www.af83.com/index.php?foosdfsd2
http://www.af83.com/index.php?foosdfsd3
Well, I think you get the picture. Anyone can generate infinite duplicates of your home page, essentially nailing any drupal site. These links should ALL ALWAYS give a 404.
Comment #32
damien tournoud commented@sdaams, don't be stupid.
Does any of the following URLs return a 404?
http://www.microsoft.com/en/us/default.aspx?foosdfsd
http://www.microsoft.com/en/us/default.aspx?foosdfsd2
http://www.microsoft.com/en/us/default.aspx?foosdfsd3
Comment #33
sdaams commented@Damien, just because other sites have the same problem, doesn't mean it isn't a problem. A page that doesn't exist should ALWAYS return a 404, no exception. I'm not sure why that constitutes 'being stupid'. Okay, maybe my example isn't the best, but there's some good examples put forward by ngaur in http://drupal.org/node/432384#comment-1527438 that really shouldn't resolve.
It'd just be good to kick back the correct 404 header on these pages, rather than ignore the folks that are suggesting fixes like above. But hey, like I said, it's not my site. We'll just trust Google always does the right thing :)
Comment #34
mikeytown2 commented@sdaams
Your talking about 2 different thing here. With clean URL's enabled, issuing a 404 due to a bad query string in the URL is dumb.
http://www.google.com/webhp?fooooooooooooooooooooo
Set the Canonical URL. If your still crazy about this use case create a module that sends a 404 out if there is a query string present in the URL.
Issuing a 404 because the directory exists but there is no index file is an issue. I address it in the boost module, it gets addressed in a stand alone module as well if I remember correctly. This is in a hook_init call of the boost module.
My in depth look at this issue: #345484: 404 hits to /files directory cached as homepage with broken form actions
Comment #35
damien tournoud commentedDrupal 7 returns 404 for
/index.phpnow (see #711650: When index.php appears in the URL (or is automatically added by the server) users get a "page not found" message for a discussion on reverting this behavior). Reassigning back to 6.x.Comment #36
Z2222 commentedWhen modifying a website in a way that changes the URLs (like converting to Drupal), the best solution is to redirect all the old URLs to the corresponding new URLs with 301 redirects. But if you're *not* going to redirect the URLs, you should be able to fix this problem with robots.txt like this:
(But don't disallow index.php by itself -- it needs the trailing slash. Double-check with Google's robots.txt checker to make sure you aren't blocking live pages.)
I wouldn't *redirect* the pages to a 404 page -- the URL shouldn't change when a page doesn't exist; the bad URL should just send a 404 header.
Comment #37
machee commentedSo are the links listed in comment #26 above considered a separate issue? They still seem to be a problem: http://drupal.org/node/432384/why-no-404
#711650: When index.php appears in the URL (or is automatically added by the server) users get a "page not found" message only seems to address URLs containing "index.php".
I submitted #1482588: Prevents 404 pages from being served but after discussing it, seems like this is the real bug.
Comment #38
machee commentedHere's my fix for the URLs that trigger this without "index.php" in them. I'm not sure what side affects this may have. Looks like menu_get_ancestors is only called in one other function, and this should be a fix in that case too.
Specifically, this should fix paths like http://drupal.org/node/432834/badlink so that they properly serve a 404 page.
This is my first Drupal core patch, so do let me know if I screw that process up. I suppose I'll follow up with 7.x-dev and 6.x-dev patches.
Comment #40
machee commentedSo does 8.x always fail test right now or did I cause that much of a problem? Trying a different approach.
Comment #42
machee commentedApologies for my fumbling around to the other 9 of you following this issue. Patch for 7.x the same method as #40.
Comment #44
machee commentedSo after figuring out how to run tests locally I learned that Drupal relies on this behavior. Much like PHP functions and arguments, URLs can accept an arbitrary number of directories (treated as arguments).
I don't like this functionality. Query strings already do this and it's relatively well known that their contents can be arbitrary so it can be worked around in crawlers and such. I understand the benefit there and with PHP functions, but I don't think it should be extended to the path component of URLs. There is the "spider pit" potential, but honestly it just seems tacky to me.
http://drupal.org/node/432384/dont/you/think/this/should/404/or/at/least/redirect/or/something <-- HTTP 200. Valid page. Hey crawlers, index me!
If you're expecting a variable in your URL, specify it. It's a few extra characters of code and self documents that you're expecting that additional "directory" in the URL. If having arbitrary directories is that important of a feature, make it optional per router item, off by default, and don't use it in core.
Comment #45
jenlamptonI agree, I think this is a very important flaw in the way Drupal's menu system works, and we should find a way to 404 pages that are truly non-existant, even if the parent paths are valid.
Comment #46
cweiske commentedIf someones looks for an explanation of this "standard apache behavior": It's the AcceptPathInfo directive.
Comment #47
Exploratus commentedSo is there no solution to this? Man, that makes the subpath alias / extended alias modules rather unusable. I was rather excited about a couple solutions they were providing to ugly urls, but can't accept losing 404 pages for non existent content - thats a slippery slope. :(
Comment #48
Berliner-dupe commentedI need a solution too. I use extended alias module too but now i get no 404 - only 200 by non-exiting-path.
Puuhh ...
Comment #49
jmuzz commentedI agree with #44 but based on what I see at https://drupal.org/core/release-cycle it might be a bit late to make a change like this to Drupal 8. In particular I am looking at:
"API changes that require extensive follow-up issues are strongly discouraged during this phase, as the goal is not to add more risk to the release timeline."
I think the best bet to see it happen is to push for it when development of the next major version begins.
Comment #50
catchIs this still an issue with the new routing system? I have a feeling it's a duplicate of another issue.
Comment #51
awm commentedis this still considered a bug or works as designed? Another question is what should happen when a user visit a url that does not exist but have a valid parent. For example, node/1/something. At the moment, drupal (AFAIK) just renders the node. For example
https://www.drupal.org/node/432384/test/test Should this be a 404? if not how can it be?
Comment #52
cparkner01 commentedI also think this needs to be addressed, how can we fix this issue (in #51)? Seems to be an easy way for a competing site to sabotage a drupal site, they could just create links to non-existent pages such as www.example.com/node/123/fakepage which would show as a duplicate to www.example.com/node/123. Anyone have a solution for this?
Comment #53
alemadleiI did this patch for D7. Hopefully it might help other people.
https://www.drupal.org/node/2603992#comment-10508496
Comment #54
cparkner01 commentedI have implemented #42 from above and it seems to work however it is breaking some pages like the user password reset page and comment replies etc. So I have been searching for a good solution to this still and have come across a posting on Stack Exchange which states the following:
I have tried several times to create this custom module, but being that I have never created a module before and have limited coding capabilities I keep failing. Has anyone come across this and used it successfully, or is anyone able to explain more clearly how to place this in a custom module? I would love a solution to this problem as it has plagued my site for months now and I'm sure many others.
Comment #55
dawehnerCatch is right, this is solved with the new routing system.
Comment #57
lolandese commentedFor those looking for a solution for Drupal 7 there is a module called Force 404.
This module checks on node pages (aliased or not) if the current URL is a "known" router item using menu_get_item(). If not it returns a 404. Which pages to exclude can be configured. Also redirects are excluded.
Comment #58
lolandese commentedGrammar correction.
Comment #59
alexander.nachev commentedHello,
the module linked in #57 does not help if you have the same problem with a path from views. In my case I have domain.com//taxonomy/term/14/node/29200/node/29355. The drupal.org behavior might be returning a 200 header but it at least does not have a new canonical "domain.com//taxonomy/term/14/node/29200/node/29355" like in my case. How can I make these pages also return a 404 or at least make the canonical point at "term/14's actual path)" where it is supposed to be.