Because of apache's misbehavior following rewrite rules (that boost depends) are not being processed correctly;

RewriteCond %{DOCUMENT_ROOT}/cache/normal/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html -s
RewriteRule .* cache/normal/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]  

When query string contains an url encoded parameter such as "a%2Fa" boost creates a filename as-is which is correct. On the other hand according to apache process's stack trace the "RewriteCondition" above matches and evaluates to true;
stat64("[path]/index.html_werf=sdf%20sdfg%20gdh.html", {st_mode=S_IFREG|0664, st_size=21654, ...}) = 0

It stats the file and finds, but somehow while apache evaluating the "RewriteRule" it asks "urldecoded" filename;
stat64("[path]/index.html_werf=sdf sdfg gdh.html", 0xbffcb3bc) = -1 ENOENT (No such file or directory)

and as you see it results to "no entity".

RewriteCondition => [path]/index.html_werf=sdf%20sdfg%20gdh.html [OK]
RewriteRule => [path]/index.html_werf=sdf sdfg gdh.html [FAIL]

To avoid this, I tried "NE" flag in rewriterule but it failed also. Because boost depends on apache's htaccess I think it is proper to workaround this in boost module rather than asking to apache (and probably they will tell "this is not a bug this is a feature" when you create an issue about this :) )

The patch is attached, hope this helps somebody. If any filename is encoded, it creates an extra un-encoded empty file, by this way both condition and rule evaluates to true.

May be related to:
- http://drupal.org/node/782274
- http://drupal.org/node/797136

CommentFileSizeAuthor
#19 boost-n1398578-19.patch739 bytesDamienMcKenna
boost.module.patch678 bytesmow
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

bgm’s picture

Issue tags: +i18n, +utf-8

Related issue: #1386166: PDOException with UTF-8 aliases / boost does not cache pages with UTF-8 aliases

@ mow: can you check if these issues are the same / test the patch in the comment #4?

Thanks

trante’s picture

subscribe

bgm’s picture

Status: Active » Closed (duplicate)
jamix’s picture

Status: Closed (duplicate) » Active

I'm reopening this because it's only partly a duplicate of #1386166: PDOException with UTF-8 aliases / boost does not cache pages with UTF-8 aliases. The important distinction is that here, we're talking about special characters in the query string, and #4 from the abovementioned issue does not fix it.

Let's take this URL as an example: http://www.example.com/?arrow=%3E. When this page gets requested for the first time, Boost creates a file in the filesystem called _arrow=%3E.html (I'm assuming default settings here). So far, so good.

On a subsequent request, the following condition is tried in .htaccess:

RewriteCond %{DOCUMENT_ROOT}/cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html -s

%{QUERY_STRING} doesn't get URL-decoded by Apache, so that equals to a search for _arrow=%3E.html in the filesystem. This match is successful, of course.

Following this condition is this rule:

RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]

What it does is issue an internal redirect to the following URL: /cache/normal/www.example.com/_arrow=%3E.html. And - here's the important part - this URL points to a file called _arrow=>.html in the filesystem, not _arrow=%3E.html as intended. The result is a cache miss. For a cache hit to occur, we'd have to rewrite the path as /cache/normal/www.example.com/_arrow=%253E.html instead.

This is not Apache misbehavior as suggested by OP but side effects of appending non-encoded query string to the URL path.

A simple solution to fix this would be to URL-encode %{QUERY_STRING} in .htaccess before appending it to the cache URL in the RewriteRule. I even think that simply search-replacing % with %25 in %{QUERY_STRING} would do. Unfortunately, I haven't been able to accomplish this via mod_rewrite.

jamix’s picture

Just found a solution that will work for Apache >= 2.2. In .htaccess, each RewriteRule that looks like this:

RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]

has to be rewritten as:

RewriteCond %{QUERY_STRING} (.*)
RewriteRule .* cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%1\.html [B,L,T=text/html]

Thanks to the new B flag, we can insert %{QUERY_STRING} URL-encoded that way. The B flag is available since Apache 2.2, though, so this solution is not versatile.

Anonymous’s picture

Assigned: Unassigned »
Category: bug » feature
Priority: Critical » Normal
Status: Active » Needs review

This whole issue needs discussion and has been re-assigned from a bug to a feature request. The first thing is that on windows systems in boot.module one cannot have certain characters in filenames (lines 587-592)

  // Check for reserved characters if on windows.
  // http://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
  // " * : < > |
  $chars = '"*:<>|';
  if (stristr(PHP_OS, 'WIN') && preg_match("/[" . $chars . "]/", $full)) {
    $parts['is_cacheable'] = FALSE;
    $parts['is_cacheable_reason'] = 'Reserved characters on MS Windows';
    return $parts;
  }

Since apache is making the request for a filename that cannot exist on Windows, this is not going to be possible.

then there is the issue as to whether boost should be caching query strings in the first place ? The inference is that the content is dynamic even for anonymous users. I'll gladly produce a patch if anyone can produce a convincing argument for the need to put what is effectively an opening tag, pipe or closing tag on the file system.

Anonymous’s picture

There is a solution and I have some test code that will need to be tidied and re-written to the coding standards.

The main issue with this is that the OP was correct, the code

RewriteCond %{DOCUMENT_ROOT}/cache/%{ENV:boostpath}/%{HTTP_HOST}%{REQUEST_URI}_%{QUERY_STRING}\.html -s

looks for a file with the query string encoded and say "yes it does exist" and so it sets a redirect, it should then pass straight through the cache, but instead boost picks up that it doesn't exist (because of the encoding) and so it passes it back to index.php and subsequently itself for regeneration, thereby setting a double redirect through index.php which then hits the db and still creates the file that should have been served. This file is never served but sometimes cleaned up by cron.

My code will serve the cached file and then exit. If the file is removed, boost works as normal. The normal caveats apply in that the page will not expire for anonymous users unless caches are cleared, it is edited and/ or picked up by cron and boost_expire.

Now the question that I would ask is whether it's correct for me to implement this patch, bearing in mind my comment above where we are serving a static page with a query_string to anonymous users, appears that the query_string should be altering the page instead of serving static html. If anyone has a good reason like a popular module that uses ajax and references the query string then it may be appropriate, or possible there is a query string that changes a theme or display, otherwise I would be more inclined to move this to closed (worked as designed)

bgm’s picture

Title: Apache has problem with urlencoded cached file names » Apache has problem with urlencoded query strings in cached file names
Status: Needs review » Needs work

Call me stubborn, but I can't reproduce this bug (under Linux/Debian, Apache 2.2). I tried: "http://example.org/?é>èц" (and other variants). It wrote the file: _test=%C3%A9%20%C3%A8%3E%D1%86.html, and next requests to that page made me hit the cache. What am I missing?

then there is the issue as to whether boost should be caching query strings in the first place ?

There are valid use cases for caching URLs with query strings. For example, a view with a pager will have myview?page=2 for the second page of results. If people want to exclude pages from Boost (ex: very specific ajax requests that are not worth caching), they can use hook_boost_is_cacheable().

Anonymous’s picture

Sensei, have you tried the simplest version? (I also bow to your example of the pager but does it need encoded characters?)

http://www.example.com/?arrow=>

my debugging code below for the patch notice the SERVER[REDIRECT_URL]. The page is never served from the cache whereas /var/www/cache/normal/ru.webserver/_arrow=%3E.html is generated, no boost headers or tags at the bottom (latest dev version).

/var/www/sites/all/modules/boost/boost.module : Wed, 10 Oct 12 07:49:55 +0100
 _boost
Array
(
)

 parts
Array
(
    [scheme] => http
    [host] => ru.webserver
    [path] => 
    [query] => arrow=%3E
    [full_path] => 
    [base_path] => /
    [query_array] => Array
        (
            [arrow] => >
        )

    [query_extra] => arrow=>
    [url_full] => ru.webserver/_arrow=%3E
    [url] => http://ru.webserver/?arrow=%3E
    [url_decoded] => /_arrow=>
)

SERVER
Array
(
    [REDIRECT_REDIRECT_boostpath] => normal
    [REDIRECT_REDIRECT_STATUS] => 200
    [REDIRECT_boostpath] => normal
    [REDIRECT_STATUS] => 200
    [boostpath] => normal
    [HTTP_HOST] => ru.webserver
    [HTTP_ACCEPT] => text/html, text/plain, text/css, text/sgml, */*;q=0.01
    [HTTP_ACCEPT_LANGUAGE] => en
    [HTTP_USER_AGENT] => Lynx/2.8.8dev.9 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/2.10.5
    [PATH] => /usr/local/bin:/usr/bin:/bin
    [SERVER_SIGNATURE] => <address>Apache/2.2.22 (Ubuntu) Server at ru.webserver Port 80</address>

    [SERVER_SOFTWARE] => Apache/2.2.22 (Ubuntu)
    [SERVER_NAME] => ru.webserver
    [SERVER_ADDR] => 127.0.1.1
    [SERVER_PORT] => 80
    [REMOTE_ADDR] => 127.0.0.1
    [DOCUMENT_ROOT] => /var/www
    [SERVER_ADMIN] => webmaster@localhost
    [SCRIPT_FILENAME] => /var/www/index.php
    [REMOTE_PORT] => 40223
    [REDIRECT_QUERY_STRING] => arrow=%3E
    [REDIRECT_URL] => /cache/normal/ru.webserver/_arrow=>.html
    [GATEWAY_INTERFACE] => CGI/1.1
    [SERVER_PROTOCOL] => HTTP/1.0
    [REQUEST_METHOD] => GET
    [QUERY_STRING] => arrow=%3E
    [REQUEST_URI] => /?arrow=%3E
    [SCRIPT_NAME] => /index.php
    [PHP_SELF] => /index.php
    [REQUEST_TIME] => 1349851795
    [HTTP_REFERER] => 
)

I tried your version which wrote

/var/www/cache/normal/ru.webserver/_test=%C3%A9%3E%C3%A8%D1%86.html

with the same results. It can be tried out live at

http://www.howtogetdivorced.com/ (cached)
http://www.howtogetdivorced.com/?arrow=> (file created, not sent)
http://www.howtogetdivorced.com/?test=é>èц (lovely example, file created, not sent)
http://www.howtogetdivorced.com/?test=9 (cached)

(apache 2.2.22) and viewing the source and lack of boost tags. Tried with different browsers and Lynx as I don't like the way the address bar changes to human readable form. Notice the double redirect [REDIRECT_REDIRECT_boostpath] ?

jamix’s picture

My use case for special characters in the query string was Boost-caching of pjax requests (pjax module). The module passes CSS selectors in GET parameters, and those in my case contain special characters, for example: .container > div.

Anonymous’s picture

Bad Bad Bad idea being able to submit css, it screams out "XSS" attack me. I once wrote my entire cv in javascript and then posted it on a URL on a government website to prove a point, I also created fake items in a shop and took orders for non-existent items to demonstrate the dangers. But I will have a look. A little busy today though.
Philip.

jamix’s picture

Well, that's how the pjax module works anyway. I appreciate your looking into it.

Anonymous’s picture

Even if this does not make it into boost, I do have the code, so you could be my test subject if you are willing?

jamix’s picture

Sure, I can take a look and see if the code helps.

jamix’s picture

Issue summary: View changes

added similar issues

blake.thompson’s picture

I have come across this issue with a date filter as a query parameter. Apache specifically doesn't like the %2F character in the Boost filename when the AllowEncodedSlashes directive is set to Off, which I guess it is by default. My solution currently is to not cache pages when the date filter has a value.

bendev’s picture

the patch helped in my case...
thanks!

gdebilde’s picture

The patch works for me too.
Thx

DamienMcKenna’s picture

Status: Needs work » Needs review
FileSize
739 bytes

I cleaned up the patch to match Drupal's coding standards. I also left the original file to contain the full contents instead of dummy contents.

bendev’s picture

Status: Needs review » Reviewed & tested by the community

patch#19 is ok

  • bgm committed f98cbf4 on 7.x-1.x authored by DamienMcKenna
    Issue #1398578 by mow, DamienMcKenna, jamix, Philip_Clarke, bendev:...
bgm’s picture

Status: Reviewed & tested by the community » Fixed

Patch committed. Thank you everyone!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.