I'm currently working on a website running on a domain containing an umlaut (ä, ü, ï for example). When I use the following BB code, it's not being parsed, but the code is simply displayed on the page: [url]http://exämple.com/[/url] (Obfuscated domain name for privacy reasons). As soon as I remove the character containing the umlaut, the link works like a charm.

Comments

naudefj’s picture

Title: [url] tags break with urls containing non ASCII characters » [url] tags break: domains with non ASCII characters like an umlaut

I do understand the problem, but cannot figure out how to solve it.

Any suggestions?

naudefj’s picture

Status: Active » Postponed

Issue postponed until someone submits a patch.

gilcot’s picture

non-pur ascii characters should/must be encoded.. see l()

yngens’s picture

same issue here.

gilcot, non-ascii characters must not necessarily be encoded

I believe non-ascii characters should be included in the following lines in 'bbcode-filter.inc', but I dont know how. I need to get cyrillic characters work with BBcode.


if (variable_get("bbcode_make_links_$format", 1)) {

    // pad with a space so we can match things at the start of the 1st line
    $ret = ' ' . $body;
    // padding to already filtered links
    $ret = preg_replace('#(<a.+>)(.+</a>)#i', "$1\x07$2", $ret);

    // matches an "xxx://yyyy" URL at the start of a line, or after a space.
    // xxxx can only be alpha characters.
    // yyyy is anything up to the first space, newline, comma, double quote or <
    $ret = preg_replace('#(?<=^|[\t\r\n >\(\[\]\|])([a-z]+?://[\w\-]+\.([\w\-]+\.)*\w+(:[0-9]+)?(/[^ "\'\(\n\r\t<\)\[\]\|]*)?)((?<![,\.])|(?!\s))#i', '<a href="\$

    // matches a "www|ftp.xxxx.yyyy[/zzzz]" kinda lazy URL thing
    // Must contain at least 2 dots. xxxx contains either alphanum, or "-"
    // zzzz is optional.. will contain everything up to the first space, newline,
    // comma, double quote or <.
     $ret = preg_replace('#([\t\r\n >\(\[\|])(www|ftp)\.(([\w\-]+\.)*[\w]+(:[0-9]+)?(/[^ \"\'\(\n\r\t<\)\[\]\|]*)?)#i', '\1<a href="http://\2.\3">\2.\3</a>', $re$

    // matches an email@domain type address at the start of a line, or after a space.
    // Note: Only the followed chars are valid; alphanums, "-", "_" and or ".".
    if (variable_get("bbcode_encode_mailto_$format", 1))
      $ret = preg_replace_callback("#([\t\r\n ])([a-z0-9\-_.]+?)@([\w\-]+\.([\w\-\.]+\.)*[\w]+)#i", '_bbcode_encode_mailto', $ret);
    else
      $ret = preg_replace('#([\t\r\n ])([a-z0-9\-_.]+?)@([\w\-]+\.([\w\-\.]+\.)*[\w]+)#i', '\\1<a href="mailto:\\2@\\3">\\2@\\3</a>', $ret);

    // Remove our padding
    $ret = str_replace("\x07", '', $ret);
    $body = substr($ret, 1);
  }
lars skjærlund’s picture

Priority: Normal » Major

I'm bitten by this one, too.

But it's not just a small bug, I'm afraid: It seems to be a serious localization problem. In my case, I'm unable to use filenames with Danish characters - which, of course, should be perfectly legal in a modern world.

I've tried playing with the code a bit, but was quickly struck by a showstopper: Most of the bbcode-filter is regular expressions, and we really need Unicode to support more than mere ASCII characters. Unicode _is_ supported by the PHP preg_ functions - that is, in principle, and if and only if the underlying PCRE library is compiled with Unicode support. On my Linux distro, it isn't.

Of course I could upgrade my Linux - but in that case I'll get PHP 5.3 as well, and that's another huge Drupal issue, as we all know...

So the bbcode module needs to be rewritten to support non-ASCII characters: For starters that means that all occurrences of \w should be replaced by something else. There's a lot of them - for those of you not so familiar with regular expressions, \w matches "word" characters meaning the letters a..z and A..Z and the numbers 0..9. Nothing more than that - so all of us non-English speaking people are left behind.

And next it should be decided if the module should continue to use the preg_ function family as this requires Unicode support in the underlying PCRE library which, unfortunately, seems not to be the norm.

Until that happens, I'm afraid the bbcode module is for English-speaking people only.

BTW - if you want to know why some of us need Unicode support, look no further than to my name!

Regards,
Lars