Pages with non-Latin characters in URL are not being read.

running on local machine, port:8082
Acquia Drupal stack

-Apache 2.2.22 Port:8082
-MySQL 5.1.66 Port:33066
-PHP 5.3.18
The caches are created in cache/normal/my-local-drupal-folder-name

This is how pages with non-Latin urls are being cached:
non-latin url cached pages

Regular encoding-decoding of a Persian phrase:

%D8%B3%D9%84%D8%A7%D9%85-%D8%AF%D9%88%D8%B3%D8%AA-%D9%85%D9%86

After decoding looks like:

سلام-دوست-من

So, please explain how the URI's are translated into the filenames, what could be the problem in the case I've explaind above and how to fix it?

Files: 
CommentFileSizeAuthor
#14 windows DB.JPG64.45 KBgaramani
#14 Linux DB.JPG73.23 KBgaramani
cached page.JPG49.8 KBgaramani

Comments

garamani’s picture

Any Idea, instruction or solution?

Philip_Clarke’s picture

It will take some time to debug, two reasons really.

  1. There are certainly multilingual websites in arabic, russian and french with non ascii charactersets that work
  2. I'm on holiday in a forest :-)

Obviously I need to understand what is happening but recently there was a german website that also had a character set problem and I suspect that there is a configuration difference in either the PHP or filesystem as to what is available and it is going to be difficult to reproduce.

garamani’s picture

The second reason definitely make sense! :D
Have a nice holiday Philip, and meanwhile if it's possible for you please tell me which part of Boost module generate the encoded file name?
.htaccess file in drupal root?
.htaccess located in cache/normal/my-local-drupal-folder-name
or
some functions in boost.module?

I want to investigate more about the source of problem with my poor knowledge.

Thank you and Have Fun.

Philip_Clarke’s picture

It's going to be somewhere inside the boost.module I would start by looking for

<?php
 
function boost_transform_url
?>

and examining the value of request_uri() (a drupal core function) by saving it to a file, I would also compare it to $_SERVER variables

garamani’s picture

Dear Philip,

After hours of studying the boost.module codes and after following the clue that you gave me, I find out:

It's too complicated for me to find the problem.

BUT

I think you want me to be sure that the request_uri() function is passing proper values.
After putting the following code in all pages, It prints out the correct decoded URL!

<?php
$t1
request_uri();
print
$t1;
echo
"<br />";
$t2 = preg_replace("/%u([0-9a-f]{3,4})/i","&#x\\1;",urldecode($t1));
print
html_entity_decode($t2,null,'UTF-8');
?>

For this page: apolo:8082/چگونه-لبخند-بزنیم؟

t1:
/%DA%86%DA%AF%D9%88%D9%86%D9%87-%D9%84%D8%A8%D8%AE%D9%86%D8%AF-%D8%A8%D8%B2%D9%86%DB%8C%D9%85%D8%9F

t2:
/چگونه-لبخند-بزنیم؟

So, It seems the request_uri() result from server is fine.

Philip_Clarke’s picture

How much control do you have over the server ? can you add to the virtual host file (apache configuration),

RewriteLog "/where/you/want/the/logs/rewrite.log"
RewriteLogLevel 3

so that we can see what apache is looking for ? Also what is the filename written out ? Your t2 code looks better at decoding url's than the current boost urldecode. I personally would like to see filename like چگونه-لبخند-بزنیم؟_.html in the filesystem but we need to see what apache is looking for from the logs (or way till I get home) to my servers.

garamani’s picture

My website is under development and It's been installed by Acquia Drupal Stack on win7. So I have full access to change configurations. What steps should i follow?
BTW, I don't want to bother you during holidays; I can wait.

Philip_Clarke’s picture

Do the above and see if in the rewrite logs you have apache looking for a persian translated url or if the URL is encoded with % signs. Having read through the url decode documentation there appears to be different URL interpretations by the browsers.

garamani’s picture

The URL is not encoded,
40 lines of rewrite.log:

127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '(^|/)\.' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/چگونه-لبخند-بزنیم؟ -> چگونه-لبخند-بزنیم؟
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'چگونه-لبخند-بزنیم؟'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (2) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] rewrite 'چگونه-لبخند-بزنیم؟' -> 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] add per-dir prefix: index.php -> C:/Users/Garamani/Downloads/Compressed/apolo/index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (2) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip document_root prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> /index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23ac1a8/initial] (1) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] internal redirect with /index.php [INTERNAL REDIRECT]
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '(^|/)\.' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '.*' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/index.php -> index.php
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] applying pattern '^' to uri 'index.php'
127.0.0.1 - - [31/Aug/2014:00:25:53 +041800] [apolo/sid#433178][rid#23b8fa8/initial/redir#1] (1) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] pass through C:/Users/Garamani/Downloads/Compressed/apolo/index.php
127.0.0.1 - - [31/Aug/2014:00:26:04 +041800] [apolo/sid#433178][rid#23b2d80/initial] (3) [perdir C:/Users/Garamani/Downloads/Compressed/apolo/] strip per-dir prefix: C:/Users/Garamani/Downloads/Compressed/apolo/sites/all/modules/jquery_update/replace/jquery/1.7/jquery.min.js -> sites/all/modules/jquery_update/replace/jquery/1.7/jquery.min.js
Philip_Clarke’s picture

Now the main question would be can windows create that kind of filename that apache is looking for? and is the site eventually going to be deployed on linux/ unix?

garamani’s picture

The website is going to be deployed on Linux(VPS: Centos 6).
Are you saying that the problem might get solved on Linux?

Well, your comment gave me an idea to test drupal writing for files with persian file name!
I tried to upload an image with a persian filename (تست_میکنیم.jpg) into an article.

This is the file name after saving in Drupal files folder: تست_میکنیم.jpg!!
but when I right click on that image in the article to save on disk, the file name is تست_میکنیم.jpg

I don't know if the test above proves something or not!

Philip_Clarke’s picture

I think the first thing I will attempt next week is to save the filenames out using your example characters rather than the url decode function, this may also deal with an issue with a german website. I honestly do not know if it will solve the problem on windows or if there is a problem on window or if it is just the browser displaying strange characters, unfortunately I do not own a windows machine. I am 100% sure that in the past on linux based machines, we have had Russian, Arabic and French websites all working with correct filenames and no problems with translations, but PHP changes over the years and the url decode is looking quite old so I believe it's time for an update.

garamani’s picture

Yes, First of all we need to find the source of the problem and Trying persian filename in another system and OS would be helpful. I'll wait till next week to see the results of your tests.

garamani’s picture

StatusFileSize
new73.23 KB
new64.45 KB

Well, After uploading my website on a linux server (centos6.5 - apache - mysql), the cached urls are decoded correctly and being served perfectly.

Another thing that also has been changed, is the mysql storage engine.

Linux Mysql DB:
Linux DB Tables

Windows Mysql DB:
Windows mysql DB

Philip_Clarke’s picture

There must be a difference between the windows and linux handling of character sets, it does at least explain why in "real life" we've had working sites, I am so sorry that I have neglected this, it quite frankly slipped my mind when I got back, as I do believe the URI translation needs a bit of work to bring it to a more modern consistent output.

Thanks.

dkynast’s picture

After reading through all of this, i'm frankly sure, that this issue is related to issue "QUERY_STRING containing non-ASCII characters leads to 404" - https://www.drupal.org/node/2313715

Different language, same problem.

Philip_Clarke’s picture

Hopefully this should be solved when I rewrite the code, although the major difference in this one was that linux servers handled the character sets, the windows version did not, whereas you are only a linux set up.

Philip_Clarke’s picture

Status:Active» Closed (cannot reproduce)