Hello,

I have a problem with the "Convert URLs into links" filter in drupal 7:

I have this filter activated for the "Full HTML" text format. If I create a new page and enter a URL with special characters (for example ä, ü, ö), the resulting clickable URL is formatted wrongly.

Example:
I create a new page with the following text:

www.test.de
www.daniel-müller.de

The sourcecode of the resulting page looks like this:

<a href="http://www.test.de">www.test.de</a>
<a href="http://www.daniel-m">www.daniel-m</a>üller.de

Problem: If there is a special character (like the 'ü' in the second URL), the clickable URL is cut off just before the special character.

Is this a general problem with drupal 7 or something in my server / drupal - configuration?

Files: 
CommentFileSizeAuthor
#19 filter-urlfilter-i18n-1657886-19.patch7.94 KBadam7
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch filter-urlfilter-i18n-1657886-19.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#14 filter-urlfilter-i18n-1657886-13.patch7.97 KBHanno
PASSED: [[SimpleTest]]: [MySQL] 57,465 pass(es).
[ View ]
#10 filter-urlfilter-i18n-1657886.patch4.04 KBHanno
PASSED: [[SimpleTest]]: [MySQL] 56,733 pass(es).
[ View ]
#7 filter-autolink-i18naddresses-1657886.patch3.31 KBHanno
FAILED: [[SimpleTest]]: [MySQL] 56,625 pass(es), 3 fail(s), and 0 exception(s).
[ View ]
#7 links_with_patch.png93.17 KBHanno
#5 ssp_temp_capture.png33.21 KBHanno

Comments

Jaypan’s picture

Version:7.14» 8.x-dev
Priority:Normal» Major

I'm seeing this same issue with arabic URLs. For example if I have the following URL: http://www.site.com/users/عمر the URL parsing will end at the last forward slash, before the arabic text. The problem is, as you can see, this is a user path, so I need it to also parse the username. The problem lies in _filter_url(). The character class for the regex doesn't allow for non-alphabetic letters other than a defined set of punctuation.

Jaypan’s picture

I've hacked _filter_url() to work with Arabic characters, but the method I used is not ideal I don't think. I added the Arabic character match \p{Arabic} to each of the character classes, and added my own 'u' modifier to the end of each of the patterns. It's properly parsing the links now, but I don't like having to hack core, and I'm sure there is a more dynamic way to do this. Maybe create a settings for the URL converter that allows for one to select if and which UTF8 languages they would like to be able to include in their URL searches. I'm open to other ideas as well.

russo79’s picture

There are other characters such as "( )" who seem to be filtered giving the same result:

e.g. if the user types this url
http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboards-with-Linux-(Gnome15-plugin).html

it gets converted to the following html:

<a href="http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboards-with-Linux-">http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboa...</a>(Gnome15-plugin).html

Hanno’s picture

Title:Filter "Convert URLs into links" doesn't support multilingual web addresses» "Convert URLs into links" - filter & special characters like ä ü ö
Category:bug» support
Status:Needs review» Active
Issue tags:-needs backport to D7

Twitter does a very good job to convert all kind of URL's to shortener links. Their code can be found here:
https://github.com/twitter/twitter-text-js/blob/master/twitter-text.js
A PHP version based on that code is created here: https://github.com/stephenbeckett/TwitterURLMatchPHP
See http://www.stevebeckett.com/twitter-url-match-for-php/

Hanno’s picture

Title:"Convert URLs into links" - filter & special characters like ä ü ö» Filter "Convert URLs into links" doesn't support multilingual web addresses
Category:support» bug
StatusFileSize
new33.21 KB

Nowadays, these are valid web adresses:
http://президент.рф/ (president of the Russian federation)
http://موقع.وزارة-الأتصالات.مصر/ (Ministry of Communication of Egypt)
And earlier mentioned:
http://www.daniel-müller.de/
None of them work currently in Drupal 8 as shown in the screenshot below, and mixed characters breaks.
screensho of urls not turing into links

catch’s picture

Issue tags:+needs backport to D7

Here's the relevant code that excludes those characters:

// Prepare domain name pattern.
  // The ICANN seems to be on track towards accepting more diverse top level
  // domains, so this pattern has been "future-proofed" to allow for TLDs
  // of length 2-64.
  $domain = '(?:[A-Za-z0-9._+-]+\.)?[A-Za-z]{2,64}\b';
  $ip = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}';
  $auth = '[a-zA-Z0-9:%_+*~#?&=.,/;-]+@';
  $trail = '[a-zA-Z0-9:%_+*~#&\[\]=/;?!\.,-]*[a-zA-Z0-9:%_+*~#&\[\]=/;-]';
Hanno’s picture

Status:Needs review» Active
StatusFileSize
new93.17 KB
new3.31 KB
FAILED: [[SimpleTest]]: [MySQL] 56,625 pass(es), 3 fail(s), and 0 exception(s).
[ View ]

Created a patch based on the mentioned Regex in Twitter-js for the trail, and the regex for characters Symfony is using for validating links. It fixes bugs to correct:
- umlauts
- domain names in utf-8 characters
- () in url path #1843260: URL Links with brackets are not processed correctly by input filter with "Web page addresses...turn into links automatically."
- @ in url path #2005986: URL Filter fails if there is an @ in the link
- ! in url path #1480992: URLs containing a '!' separator are not formatted as links

We could based on the Twitter script even improve the filter to detect links that don't start with 'www' or 'http', but that could be an feature request. First fix this the current bug.

links after patch

Hanno’s picture

Status:Active» Needs review

review needed

Status:Active» Needs work

The last submitted patch, filter-autolink-i18naddresses-1657886.patch, failed testing.

Hanno’s picture

Status:Needs work» Needs review
StatusFileSize
new4.04 KB
PASSED: [[SimpleTest]]: [MySQL] 56,733 pass(es).
[ View ]

punctuations removed from code as its now handled with the url-regex.

Jaypan’s picture

Thank you for your work on this. Much appreciated.

Gábor Hojtsy’s picture

Title:"Convert URLs into links" - filter & special characters like ä ü ö» Filter "Convert URLs into links" doesn't support multilingual web addresses
Category:support» bug
Status:Active» Needs review
Issue tags:+needs backport to D7, +D8MI, +language-base

Tagging.

andypost’s picture

Issue tags:+Needs tests

Actually we need tests for that, also what's about copyright for code?

+++ b/core/modules/filter/filter.moduleundefined
@@ -1086,34 +1086,49 @@ function _filter_url($text, $filter) {
+  ¶
...
+  ¶
...
+  ¶
...
+  $valid_url_path = '(?:(?:'.$valid_url_path_characters . '*(?:'.$valid_url_balanced_parens .$valid_url_path_characters . '*)*'. $valid_url_ending_characters . ')|(?:@' . $valid_url_path_characters . '+\/))'; ¶

trailing white-space

+++ b/core/modules/filter/filter.moduleundefined
@@ -1086,34 +1086,49 @@ function _filter_url($text, $filter) {
+  $trail = '('.$valid_url_path.'*)?(\\?'.$valid_url_query_chars .'*'.$valid_url_query_ending_chars.')?';

tab char?

Hanno’s picture

StatusFileSize
new7.97 KB
PASSED: [[SimpleTest]]: [MySQL] 57,465 pass(es).
[ View ]

Thanks for your review. Here is a new patch with tests included for unicode and special characters.
Copyright: The regex is partly based on twitter.js. Is it possible with this license (Apache License v2) to include the whole function to improve the URL detection in Drupal?

andypost’s picture

Status:Needs review» Reviewed & tested by the community

Awesome!

@Hanno please file another issue about library, suppose Core\Component is a good place

Hanno’s picture

Issue tags:-Needs tests

Thanks. Created a new issue for further improvements #2019229: Turn web addresses starting with neither the http-protocol nor the www-prefix into links
Hope this patch gets commited.

alexpott’s picture

Status:Reviewed & tested by the community» Fixed
Issue tags:+Needs tests

Committed a335ae6 and pushed to 8.x. Thanks!

alexpott’s picture

Version:8.x-dev» 7.x-dev
Status:Fixed» Patch (to be ported)
adam7’s picture

Status:Patch (to be ported)» Needs review
StatusFileSize
new7.94 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch filter-urlfilter-i18n-1657886-19.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

Copy of filter-urlfilter-i18n-1657886-13.patch for 7.x.

Jaypan’s picture

Issue summary:View changes
Status:Needs review» Reviewed & tested by the community

I have been using the patch from #19 for months now without issue. I say it's good.

adam7’s picture

Same here:)

Status:Reviewed & tested by the community» Needs work

The last submitted patch, 19: filter-urlfilter-i18n-1657886-19.patch, failed testing.

chx’s picture

This patch uses \p and requiring --enable-unicode-properties for your PCRE is not a decision to be made lightly. In my opinion even the D8 version needs to use the constants PREG_CLASS_NUMBERS and similar. For Drupal 7 changing the requirements this drastically is an absolute no go. Please find a script https://drupal.org/comment/2430064#comment-2430064 here to convert properties to something every PCRE understands.

effulgentsia’s picture

Just my 2 cents, but I don't believe this should be backported to D7. The filter says "Convert URLs ...", and according to http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier, IRIs are not URIs, and therefore, not URLs. If contrib wants to add a filter for IRIs, that's contrib's business, and then it can choose whether to allow IRIs into href values, or whether to percent encode, or punycode, etc.

I think we may want to reopen this as a D8 issue to see if the approach we picked of setting non-URL IRIs into href values is desirable.

chx’s picture

Version:7.x-dev» 8.x-dev
Status:Needs work» Active

I am bumping this back to D8 then.

effulgentsia’s picture

Ok, first question: are IRIs allowed as href values in HTML5? http://stackoverflow.com/questions/14074731/are-iris-valid-as-html-attri... says yes, but following the links in the answer, I don't see where that's confirmed. Have the specs changed since that answer?

Second: what are people's thoughts on unicode PCRE for D8, per #23?

effulgentsia’s picture

I don't believe this should be backported to D7. The filter says "Convert URLs ...", and according to http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier, IRIs are not URIs, and therefore, not URLs.

Looks like http://url.spec.whatwg.org/ is broadening what the term URL means. So, if we end up with a solution that satisfies D7 PCRE requirements and HTML4, then I'd be ok with that being backported to D7. Or, a D8 solution for D8 requirements and a D7 solution for D7 requirements would be ok too.

Jaypan’s picture

There are two ways of looking at this:

1) A purely programmatical point of view. This is where we only look at what a URL is from a programming point of view, and if a Drupal link doesn't fit that, then too bad, we need to keep it pure.
2) A user point of view. This is where we look at the fact that users are inserting links to other pages within textareas in Drupal, and expecting them to be turned into usable links.

The user doesn't really give a damn about the definition of a URL. Many, if not most, users will not even know what a URL is. All they care about is that they are typing a link to somewhere on the site, and the link either works or doesn't.

From that point of view, I think it's incorrect to be approaching this problem from the perspective of the first point. Who is a website for, the developers, or the users? Having links not work is horrible from a UI perspective, as most users won't know how to write their own tags for URLs. If a developer thinks it's not a proper URL, well that's too bad for the developer, seeing as the URL works in the browser.

effulgentsia’s picture

Yep, I agree with #28 (I changed my mind from #24), but only to the extent that we don't violate internet standards to achieve it. Fortunately, we don't have to. We can keep the name URL in the filter name without that being incorrect, thanks to http://url.spec.whatwg.org/, and so long as the implementation complies with the standards of the respective doctype (HTML5 for D8, HTML4 for D7) as well as server requirements regarding PCRE, then I agree with fixing this in both D7 and D8.

Mixologic’s picture

Unicode characters are allowed in domain names/urls. They get converted by browsers into punycode, because DNS is still ascii. So a punycode link will show up as unicode in the address bar. http://яндекс.рф if you mouse over that domain, you'll see that it points to http://xn--d1acpjx3f.xn--p1ai/ . Which is aka http://yandex.ru

so they should definitely get converted automatically into links, as they are definitely urls.

Hanno’s picture

Version:8.0.x-dev» 7.x-dev
Priority:Major» Normal

Not sure why this issue is still open for Drupal 8 as the patch is in and we all agree on that a IRI should be handled as a URL: converted to a link. We can try to backport this to Drupal 7.
chx mentioned that for Drupal 7 we need a fall back if PCRE is disabled for unicode support. If we need such a mechanism for D8 as well, we could open another issue. Will change this issue to Drupal 7 as there it is still an issue.