Because of apache's misbehavior following rewrite rules (that boost depends) are not being processed correctly;
RewriteCond %{DOCUMENT_ROOT}/cache/normal/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html -s
RewriteRule .* cache/normal/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]
When query string contains an url encoded parameter such as "a%2Fa" boost creates a filename as-is which is correct. On the other hand according to apache process's stack trace the "RewriteCondition" above matches and evaluates to true;
stat64("[path]/index.html_werf=sdf%20sdfg%20gdh.html", {st_mode=S_IFREG|0664, st_size=21654, ...}) = 0
It stats the file and finds, but somehow while apache evaluating the "RewriteRule" it asks "urldecoded" filename;
stat64("[path]/index.html_werf=sdf sdfg gdh.html", 0xbffcb3bc) = -1 ENOENT (No such file or directory)
and as you see it results to "no entity".
RewriteCondition => [path]/index.html_werf=sdf%20sdfg%20gdh.html [OK]
RewriteRule => [path]/index.html_werf=sdf sdfg gdh.html [FAIL]
To avoid this, I tried "NE" flag in rewriterule but it failed also. Because boost depends on apache's htaccess I think it is proper to workaround this in boost module rather than asking to apache (and probably they will tell "this is not a bug this is a feature" when you create an issue about this :) )
The patch is attached, hope this helps somebody. If any filename is encoded, it creates an extra un-encoded empty file, by this way both condition and rule evaluates to true.
May be related to:
- http://drupal.org/node/782274
- http://drupal.org/node/797136
Comment | File | Size | Author |
---|---|---|---|
#19 | boost-n1398578-19.patch | 739 bytes | DamienMcKenna |
Comments
Comment #2
bgm CreditAttribution: bgm commentedRelated issue: #1386166: PDOException with UTF-8 aliases / boost does not cache pages with UTF-8 aliases
@ mow: can you check if these issues are the same / test the patch in the comment #4?
Thanks
Comment #3
trante CreditAttribution: trante commentedsubscribe
Comment #4
bgm CreditAttribution: bgm commentedclosing as duplicate of #1386166: PDOException with UTF-8 aliases / boost does not cache pages with UTF-8 aliases
Comment #5
jamix CreditAttribution: jamix commentedI'm reopening this because it's only partly a duplicate of #1386166: PDOException with UTF-8 aliases / boost does not cache pages with UTF-8 aliases. The important distinction is that here, we're talking about special characters in the query string, and #4 from the abovementioned issue does not fix it.
Let's take this URL as an example:
http://www.example.com/?arrow=%3E
. When this page gets requested for the first time, Boost creates a file in the filesystem called_arrow=%3E.html
(I'm assuming default settings here). So far, so good.On a subsequent request, the following condition is tried in .htaccess:
RewriteCond %{DOCUMENT_ROOT}/cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html -s
%{QUERY_STRING}
doesn't get URL-decoded by Apache, so that equals to a search for_arrow=%3E.html
in the filesystem. This match is successful, of course.Following this condition is this rule:
RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]
What it does is issue an internal redirect to the following URL:
/cache/normal/www.example.com/_arrow=%3E.html
. And - here's the important part - this URL points to a file called_arrow=>.html
in the filesystem, not_arrow=%3E.html
as intended. The result is a cache miss. For a cache hit to occur, we'd have to rewrite the path as/cache/normal/www.example.com/_arrow=%253E.html
instead.This is not Apache misbehavior as suggested by OP but side effects of appending non-encoded query string to the URL path.
A simple solution to fix this would be to URL-encode
%{QUERY_STRING}
in .htaccess before appending it to the cache URL in the RewriteRule. I even think that simply search-replacing%
with%25
in%{QUERY_STRING}
would do. Unfortunately, I haven't been able to accomplish this via mod_rewrite.Comment #6
jamix CreditAttribution: jamix commentedJust found a solution that will work for Apache >= 2.2. In .htaccess, each RewriteRule that looks like this:
RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]
has to be rewritten as:
Thanks to the new
B
flag, we can insert%{QUERY_STRING}
URL-encoded that way. TheB
flag is available since Apache 2.2, though, so this solution is not versatile.Comment #7
Anonymous (not verified) CreditAttribution: Anonymous commentedThis whole issue needs discussion and has been re-assigned from a bug to a feature request. The first thing is that on windows systems in boot.module one cannot have certain characters in filenames (lines 587-592)
Since apache is making the request for a filename that cannot exist on Windows, this is not going to be possible.
then there is the issue as to whether boost should be caching query strings in the first place ? The inference is that the content is dynamic even for anonymous users. I'll gladly produce a patch if anyone can produce a convincing argument for the need to put what is effectively an opening tag, pipe or closing tag on the file system.
Comment #8
Anonymous (not verified) CreditAttribution: Anonymous commentedThere is a solution and I have some test code that will need to be tidied and re-written to the coding standards.
The main issue with this is that the OP was correct, the code
looks for a file with the query string encoded and say "yes it does exist" and so it sets a redirect, it should then pass straight through the cache, but instead boost picks up that it doesn't exist (because of the encoding) and so it passes it back to index.php and subsequently itself for regeneration, thereby setting a double redirect through index.php which then hits the db and still creates the file that should have been served. This file is never served but sometimes cleaned up by cron.
My code will serve the cached file and then exit. If the file is removed, boost works as normal. The normal caveats apply in that the page will not expire for anonymous users unless caches are cleared, it is edited and/ or picked up by cron and boost_expire.
Now the question that I would ask is whether it's correct for me to implement this patch, bearing in mind my comment above where we are serving a static page with a query_string to anonymous users, appears that the query_string should be altering the page instead of serving static html. If anyone has a good reason like a popular module that uses ajax and references the query string then it may be appropriate, or possible there is a query string that changes a theme or display, otherwise I would be more inclined to move this to closed (worked as designed)
Comment #9
bgm CreditAttribution: bgm commentedCall me stubborn, but I can't reproduce this bug (under Linux/Debian, Apache 2.2). I tried: "http://example.org/?é>èц" (and other variants). It wrote the file: _test=%C3%A9%20%C3%A8%3E%D1%86.html, and next requests to that page made me hit the cache. What am I missing?
There are valid use cases for caching URLs with query strings. For example, a view with a pager will have myview?page=2 for the second page of results. If people want to exclude pages from Boost (ex: very specific ajax requests that are not worth caching), they can use hook_boost_is_cacheable().
Comment #10
Anonymous (not verified) CreditAttribution: Anonymous commentedSensei, have you tried the simplest version? (I also bow to your example of the pager but does it need encoded characters?)
http://www.example.com/?arrow=>
my debugging code below for the patch notice the SERVER[REDIRECT_URL]. The page is never served from the cache whereas /var/www/cache/normal/ru.webserver/_arrow=%3E.html is generated, no boost headers or tags at the bottom (latest dev version).
I tried your version which wrote
/var/www/cache/normal/ru.webserver/_test=%C3%A9%3E%C3%A8%D1%86.html
with the same results. It can be tried out live at
http://www.howtogetdivorced.com/ (cached)
http://www.howtogetdivorced.com/?arrow=> (file created, not sent)
http://www.howtogetdivorced.com/?test=é>èц (lovely example, file created, not sent)
http://www.howtogetdivorced.com/?test=9 (cached)
(apache 2.2.22) and viewing the source and lack of boost tags. Tried with different browsers and Lynx as I don't like the way the address bar changes to human readable form. Notice the double redirect [REDIRECT_REDIRECT_boostpath] ?
Comment #11
jamix CreditAttribution: jamix commentedMy use case for special characters in the query string was Boost-caching of pjax requests (pjax module). The module passes CSS selectors in GET parameters, and those in my case contain special characters, for example:
.container > div
.Comment #12
Anonymous (not verified) CreditAttribution: Anonymous commentedBad Bad Bad idea being able to submit css, it screams out "XSS" attack me. I once wrote my entire cv in javascript and then posted it on a URL on a government website to prove a point, I also created fake items in a shop and took orders for non-existent items to demonstrate the dangers. But I will have a look. A little busy today though.
Philip.
Comment #13
jamix CreditAttribution: jamix commentedWell, that's how the pjax module works anyway. I appreciate your looking into it.
Comment #14
Anonymous (not verified) CreditAttribution: Anonymous commentedEven if this does not make it into boost, I do have the code, so you could be my test subject if you are willing?
Comment #15
jamix CreditAttribution: jamix commentedSure, I can take a look and see if the code helps.
Comment #15.0
jamix CreditAttribution: jamix commentedadded similar issues
Comment #16
blake.thompson CreditAttribution: blake.thompson commentedI have come across this issue with a date filter as a query parameter. Apache specifically doesn't like the %2F character in the Boost filename when the AllowEncodedSlashes directive is set to Off, which I guess it is by default. My solution currently is to not cache pages when the date filter has a value.
Comment #17
bendev CreditAttribution: bendev commentedthe patch helped in my case...
thanks!
Comment #18
gdebilde CreditAttribution: gdebilde commentedThe patch works for me too.
Thx
Comment #19
DamienMcKennaI cleaned up the patch to match Drupal's coding standards. I also left the original file to contain the full contents instead of dummy contents.
Comment #20
bendev CreditAttribution: bendev commentedpatch#19 is ok
Comment #22
bgm CreditAttribution: bgm commentedPatch committed. Thank you everyone!