Hello,

I have a problem with the "Convert URLs into links" filter in drupal 7:

I have this filter activated for the "Full HTML" text format. If I create a new page and enter a URL with special characters (for example ä, ü, ö), the resulting clickable URL is formatted wrongly.

Example:
I create a new page with the following text:

www.test.de
www.daniel-müller.de

The sourcecode of the resulting page looks like this:

<a href="http://www.test.de">www.test.de</a>
<a href="http://www.daniel-m">www.daniel-m</a>üller.de

Problem: If there is a special character (like the 'ü' in the second URL), the clickable URL is cut off just before the special character.

Is this a general problem with drupal 7 or something in my server / drupal - configuration?

Current Status

This has been committed to Drupal 8. For Drupal 7, the patch in #39 seems to work for people but unless/until we provide a fallback as mentioned in #23, this issue is blocked from being included in Drupal 7 core.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Jaypan’s picture

Version: 7.14 » 8.x-dev
Priority: Normal » Major

I'm seeing this same issue with arabic URLs. For example if I have the following URL: http://www.site.com/users/عمر the URL parsing will end at the last forward slash, before the arabic text. The problem is, as you can see, this is a user path, so I need it to also parse the username. The problem lies in _filter_url(). The character class for the regex doesn't allow for non-alphabetic letters other than a defined set of punctuation.

Jaypan’s picture

I've hacked _filter_url() to work with Arabic characters, but the method I used is not ideal I don't think. I added the Arabic character match \p{Arabic} to each of the character classes, and added my own 'u' modifier to the end of each of the patterns. It's properly parsing the links now, but I don't like having to hack core, and I'm sure there is a more dynamic way to do this. Maybe create a settings for the URL converter that allows for one to select if and which UTF8 languages they would like to be able to include in their URL searches. I'm open to other ideas as well.

russo79’s picture

There are other characters such as "( )" who seem to be filtered giving the same result:

e.g. if the user types this url
http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboards-with-Linux-(Gnome15-plugin).html

it gets converted to the following html:

<a href="http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboards-with-Linux-">http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboa...</a>(Gnome15-plugin).html

Hanno’s picture

Title: Filter "Convert URLs into links" doesn't support multilingual web addresses » "Convert URLs into links" - filter & special characters like ä ü ö
Category: bug » support
Status: Needs review » Active
Issue tags: -Needs backport to D7

Twitter does a very good job to convert all kind of URL's to shortener links. Their code can be found here:
https://github.com/twitter/twitter-text-js/blob/master/twitter-text.js
A PHP version based on that code is created here: https://github.com/stephenbeckett/TwitterURLMatchPHP
See http://www.stevebeckett.com/twitter-url-match-for-php/

Hanno’s picture

Title: "Convert URLs into links" - filter & special characters like ä ü ö » Filter "Convert URLs into links" doesn't support multilingual web addresses
Category: support » bug
FileSize
33.21 KB

Nowadays, these are valid web adresses:
http://президент.рф/ (president of the Russian federation)
http://موقع.وزارة-الأتصالات.مصر/ (Ministry of Communication of Egypt)
And earlier mentioned:
http://www.daniel-müller.de/
None of them work currently in Drupal 8 as shown in the screenshot below, and mixed characters breaks.
screensho of urls not turing into links

catch’s picture

Issue tags: +Needs backport to D7

Here's the relevant code that excludes those characters:

 // Prepare domain name pattern.
  // The ICANN seems to be on track towards accepting more diverse top level
  // domains, so this pattern has been "future-proofed" to allow for TLDs
  // of length 2-64.
  $domain = '(?:[A-Za-z0-9._+-]+\.)?[A-Za-z]{2,64}\b';
  $ip = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}';
  $auth = '[a-zA-Z0-9:%_+*~#?&=.,/;-]+@';
  $trail = '[a-zA-Z0-9:%_+*~#&\[\]=/;?!\.,-]*[a-zA-Z0-9:%_+*~#&\[\]=/;-]';
Hanno’s picture

Status: Needs review » Active
FileSize
93.17 KB
3.31 KB

Created a patch based on the mentioned Regex in Twitter-js for the trail, and the regex for characters Symfony is using for validating links. It fixes bugs to correct:
- umlauts
- domain names in utf-8 characters
- () in url path #1843260: URL Links with brackets are not processed correctly by input filter with "Web page addresses...turn into links automatically."
- @ in url path #2005986: URL Filter fails if there is an @ in the link
- ! in url path #1480992: URLs containing a '!' separator are not formatted as links

We could based on the Twitter script even improve the filter to detect links that don't start with 'www' or 'http', but that could be an feature request. First fix this the current bug.

links after patch

Hanno’s picture

Status: Active » Needs review

review needed

Status: Active » Needs work

The last submitted patch, filter-autolink-i18naddresses-1657886.patch, failed testing.

Hanno’s picture

Status: Needs work » Needs review
FileSize
4.04 KB

punctuations removed from code as its now handled with the url-regex.

Jaypan’s picture

Thank you for your work on this. Much appreciated.

Gábor Hojtsy’s picture

Title: "Convert URLs into links" - filter & special characters like ä ü ö » Filter "Convert URLs into links" doesn't support multilingual web addresses
Category: support » bug
Status: Active » Needs review
Issue tags: +Needs backport to D7, +D8MI, +language-base

Tagging.

andypost’s picture

Issue tags: +Needs tests

Actually we need tests for that, also what's about copyright for code?

+++ b/core/modules/filter/filter.moduleundefined
@@ -1086,34 +1086,49 @@ function _filter_url($text, $filter) {
+  ¶
...
+  ¶
...
+  ¶
...
+  $valid_url_path = '(?:(?:'.$valid_url_path_characters . '*(?:'.$valid_url_balanced_parens .$valid_url_path_characters . '*)*'. $valid_url_ending_characters . ')|(?:@' . $valid_url_path_characters . '+\/))'; ¶

trailing white-space

+++ b/core/modules/filter/filter.moduleundefined
@@ -1086,34 +1086,49 @@ function _filter_url($text, $filter) {
+  $trail = 	'('.$valid_url_path.'*)?(\\?'.$valid_url_query_chars .'*'.$valid_url_query_ending_chars.')?';

tab char?

Hanno’s picture

Thanks for your review. Here is a new patch with tests included for unicode and special characters.
Copyright: The regex is partly based on twitter.js. Is it possible with this license (Apache License v2) to include the whole function to improve the URL detection in Drupal?

andypost’s picture

Status: Needs review » Reviewed & tested by the community

Awesome!

@Hanno please file another issue about library, suppose Core\Component is a good place

Hanno’s picture

Issue tags: -Needs tests

Thanks. Created a new issue for further improvements #2019229: Turn web addresses starting with neither the http-protocol nor the www-prefix into links
Hope this patch gets commited.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed
Issue tags: +Needs tests

Committed a335ae6 and pushed to 8.x. Thanks!

alexpott’s picture

Version: 8.x-dev » 7.x-dev
Status: Fixed » Patch (to be ported)
adam7’s picture

Status: Patch (to be ported) » Needs review
FileSize
7.94 KB

Copy of filter-urlfilter-i18n-1657886-13.patch for 7.x.

Jaypan’s picture

Issue summary: View changes
Status: Needs review » Reviewed & tested by the community

I have been using the patch from #19 for months now without issue. I say it's good.

adam7’s picture

Same here:)

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 19: filter-urlfilter-i18n-1657886-19.patch, failed testing.

chx’s picture

This patch uses \p and requiring --enable-unicode-properties for your PCRE is not a decision to be made lightly. In my opinion even the D8 version needs to use the constants PREG_CLASS_NUMBERS and similar. For Drupal 7 changing the requirements this drastically is an absolute no go. Please find a script https://drupal.org/comment/2430064#comment-2430064 here to convert properties to something every PCRE understands.

effulgentsia’s picture

Just my 2 cents, but I don't believe this should be backported to D7. The filter says "Convert URLs ...", and according to http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier, IRIs are not URIs, and therefore, not URLs. If contrib wants to add a filter for IRIs, that's contrib's business, and then it can choose whether to allow IRIs into href values, or whether to percent encode, or punycode, etc.

I think we may want to reopen this as a D8 issue to see if the approach we picked of setting non-URL IRIs into href values is desirable.

chx’s picture

Version: 7.x-dev » 8.x-dev
Status: Needs work » Active

I am bumping this back to D8 then.

effulgentsia’s picture

Ok, first question: are IRIs allowed as href values in HTML5? http://stackoverflow.com/questions/14074731/are-iris-valid-as-html-attri... says yes, but following the links in the answer, I don't see where that's confirmed. Have the specs changed since that answer?

Second: what are people's thoughts on unicode PCRE for D8, per #23?

effulgentsia’s picture

I don't believe this should be backported to D7. The filter says "Convert URLs ...", and according to http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier, IRIs are not URIs, and therefore, not URLs.

Looks like http://url.spec.whatwg.org/ is broadening what the term URL means. So, if we end up with a solution that satisfies D7 PCRE requirements and HTML4, then I'd be ok with that being backported to D7. Or, a D8 solution for D8 requirements and a D7 solution for D7 requirements would be ok too.

Jaypan’s picture

There are two ways of looking at this:

1) A purely programmatical point of view. This is where we only look at what a URL is from a programming point of view, and if a Drupal link doesn't fit that, then too bad, we need to keep it pure.
2) A user point of view. This is where we look at the fact that users are inserting links to other pages within textareas in Drupal, and expecting them to be turned into usable links.

The user doesn't really give a damn about the definition of a URL. Many, if not most, users will not even know what a URL is. All they care about is that they are typing a link to somewhere on the site, and the link either works or doesn't.

From that point of view, I think it's incorrect to be approaching this problem from the perspective of the first point. Who is a website for, the developers, or the users? Having links not work is horrible from a UI perspective, as most users won't know how to write their own tags for URLs. If a developer thinks it's not a proper URL, well that's too bad for the developer, seeing as the URL works in the browser.

effulgentsia’s picture

Yep, I agree with #28 (I changed my mind from #24), but only to the extent that we don't violate internet standards to achieve it. Fortunately, we don't have to. We can keep the name URL in the filter name without that being incorrect, thanks to http://url.spec.whatwg.org/, and so long as the implementation complies with the standards of the respective doctype (HTML5 for D8, HTML4 for D7) as well as server requirements regarding PCRE, then I agree with fixing this in both D7 and D8.

Mixologic’s picture

Unicode characters are allowed in domain names/urls. They get converted by browsers into punycode, because DNS is still ascii. So a punycode link will show up as unicode in the address bar. http://яндекс.рф if you mouse over that domain, you'll see that it points to http://xn--d1acpjx3f.xn--p1ai/ . Which is aka http://yandex.ru

so they should definitely get converted automatically into links, as they are definitely urls.

Hanno’s picture

Version: 8.0.x-dev » 7.x-dev
Priority: Major » Normal

Not sure why this issue is still open for Drupal 8 as the patch is in and we all agree on that a IRI should be handled as a URL: converted to a link. We can try to backport this to Drupal 7.
chx mentioned that for Drupal 7 we need a fall back if PCRE is disabled for unicode support. If we need such a mechanism for D8 as well, we could open another issue. Will change this issue to Drupal 7 as there it is still an issue.

Berdir’s picture

It looks like this caused a regression, at least it is the only relevant issue that I can find that touches this code: #2557021: Url Filter does not correctly recognise URL's with uppercase query arguments. Any help welcome.

Volker23’s picture

If this won't be backported to D7, is there a contrib module for this to work around? Or should I just apply the patch?

  • alexpott committed 41edb24 on 8.3.x
    Issue #1657886 by Hanno: Filter 'Convert URLs into links; doesn't...

  • alexpott committed 41edb24 on 8.3.x
    Issue #1657886 by Hanno: Filter 'Convert URLs into links; doesn't...

  • alexpott committed 41edb24 on 8.4.x
    Issue #1657886 by Hanno: Filter 'Convert URLs into links; doesn't...

  • alexpott committed 41edb24 on 8.4.x
    Issue #1657886 by Hanno: Filter 'Convert URLs into links; doesn't...
idimopoulos’s picture

Hi guys,
I don't know if it's appropriate to comment on the ticket from so long ago but I wanted to put some details here just to get them recorded and be verified by the community.
The current implementation of the regular expression matching the domain name is

$domain = '(?:[\p{L}\p{M}\p{N}._+-]+\.)?[\p{L}\p{M}]{2,64}\b';

This means that domain names like: http://example.com work but domain names like http://example.com1 do not work.
_filter_url does not accept Top Level Domain with a number.

I did some research and these are the data I found:
This answer in StackOverflow (https://stackoverflow.com/questions/7411255/is-it-possible-to-have-one-s...) is a very good start as it states that:

  • A domain name/host/etc can be up to 24 characters including alphanumeric characters, minus character (-) and dots. Note that the length limitation is an old constraint. Now it is changed that the TLD alone can be up to 64 and hostnames up to 255 characters if I am not mistaking.
  • The domain name must not start with a '-' character or end with a '.'
  • The top level domain (last part of the domain) can contain numbers but must include at least one alphabetic character.

The information derive from the RFC specifications attached to that thread.

So here are the contradictions.
There is a vast discussion on what should be and what should not be supported. The above regex that filter module is using matches all current TLD on the web. A list of the TLDs that exist in the world can be found at https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains.
If we follow this list, it's safe to have the

[\p{L}\p{M}]{2,64}\b

at the end of the expression that match non numeric valid characters in the TLD. This restricts us though by not allowing single character TLDs and domain names with numbers in it. This comes because RFC allows the single letter and domains that include numbers but there are currently none in the world.

That leaves the filter module with lacking support on hostnames. A good example is if the site is running in an intranet where the hostname/domain-name is http://sandwiches247 or http://sandwitches.net1. None of these are recognized as a valid domain names as it is considered that both sandwiches247 and net1 are TLDs and are not allowed to have numbers according to the regexp above.

Since this ticket was the one that I was able to track actual changes and discussion there, I wanted to know if there is a decision/discussion/security advice/anything related to why are numbers are not recognized in the top level domain names. Are we excluding local or intranet hostnames?

loopduplicate’s picture

Status: Active » Needs review
FileSize
10.25 KB

Here's an updated patch for D7. It is a back port from what's in D8 core. I know this isn't acceptable, see #23, however, it should be an improvement over #19, which doesn't apply to core anymore and doesn't include updates that have been made since, like the one mentioned in #32.

albapb’s picture

Tested the patch in #39 and it works for me.

loopduplicate’s picture

Issue summary: View changes
Status: Needs review » Needs work

Just so people are clear, there is a chance that this will be backported to 7.x; the patch in #39 seems to work for people but we'll need to write a fallback as mentioned in #23. I've updated the issue description with this status. I've changed the issue status to "Needs work" as well.

enriquelacoma’s picture

Here is and update for patch D7. This patch implements a fallback when pcre is not compiled with unicode support. In that case, \p{L}, \p{M} and \p{N} are replaced by the hexadecimal values from this constants PREG_CLASS_LETTERS, PREG_CLASS_NUMBERS, PREG_CLASS_CJK and PREG_CLASS_COMBINED_MARKS

jyraya’s picture

Hello,

I fixed a small issue in the patch that uses the fallback mechanism when the unicode support is enabled and PCRE in the other case.

I tested it successfully with such a content, and with the PCRE and after the fallback:

<p>http://www.google.be</p>

<p><a href="../test.htm">Dummy link</a></p>

<p>test@hotmail.com</p>

<p><a href="mailto:test@hotmail.com">Dummy mailto</a> opkzeop&amp; http://www.research.lancs.ac.uk/portal/en/publications/-(270245b3-6bd6-483e-ba38-fdf9ca6719f3).html</p>

<p>www.daniel-müller.de http://www.site.com/users/عمر www.site2.com/users/عمر</p>

<p>http://президент.рф/</p>

<p>http://موقع.وزارة-الأتصالات.مصر/</p>

<p>http://موقع.وزارة-الأتصالات.مصر <a href="http://موقع.وزارة-الأتصالات.مصر/">Arab URL</a> http://見.香港/</p>

<p><a href="http://見.香港/">Chinese URL</a> http://টিসেলুন-সম্পর্কিত.শব্দ <a href="http://টিসেলুন-সম্পর্কিত.শব্দ">Hindi URL</a></p>

<p>https://hi.wikipedia.org/wiki/बंगाल_की_खाड़ी <a href="https://hi.wikipedia.org/wiki/बंगाल_की_खाड़ी">Hindi URL</a></p>

<p>https://βετία/Ελβετία http://ל.רבו-הטיפול.שראל oazjpofjpoafapfapfap</p>

<p>https://גוגל.איבערט-ייטשער http://การ-แปลภา.ษา</p>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus volutpat, erat a consectetur rhoncus, mauris metus viverra ante, nec suscipit quam nibh vel ipsum. Interdum et malesuada fames ac ante ipsum primis in faucibus. Nam et erat elementum, tristique massa non, dignissim felis. Curabitur fermentum ante id lorem finibus iaculis. Quisque malesuada lectus eu libero ornare, quis interdum . In et rutrum magna. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean et tortor et nisl fermentum euismod. Sed viverra congue urna, interdum tincidunt neque. Maecenas lobortis diam nec odio tempus semper. Nam blandit iaculis venenatis. http://јазик_превеува.тест, Fusce ultricies ultricies lobortis. Curabitur vitae volutpat purus. Nulla facilisi: https://api.drupal.org/api/drupal/modules!taxonomy!taxonomy.module/function/taxonomy_node_insert/7.x</p>

<p>Quisque sollicitudin, neque a suscipit scelerisque, erat augue consequat lacus, sed lacinia erat ipsum non dolor. sgzergzggzg Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Morbi accumsan dignissim ante quis commodo. Pellentesque dapibus magna eget lorem gravida, id varius felis lacinia. Praesent sed ligula mi. http://点心和烤鸭.w3.mag.keio.ac.jp, proin libero eros, fermentum sed sem ut, hendrerit luctus dolor. Donec eu condimentum turpis, sed gravida sem. Donec tincidunt magna in sagittis molestie. Nulla non blandit velit, sed ornare justo. Aliquam sed dolor sem. In sodales quis eros sit amet bibendum. Mauris nec mi vitae arcu sagittis porttitor. Sed quis commodo nisi.</p>

<p>Etiam vehicula congue urna quis placerat: http://ヒキワリ.ナットウ.ニホン.</p>

Status: Needs review » Needs work

The last submitted patch, 43: filter-urlfilter-i18n-1657886-41.patch, failed testing. View results

Delphine Lepers’s picture

I tested the patch #43 (by the way, the name is incorrect).
It works well for me, however I find it sad that on mailto protocol, the subject and body are not included into the href.

test@hotmail.com?subject=mysubject&body=body
Transforms to
<p><a href="mailto:test@hotmail.com">test@hotmail.com</a>?subject=mysubject&amp;body=body</p>

jyraya’s picture

Status: Needs work » Needs review
Delphine Lepers’s picture

Status: Needs review » Reviewed & tested by the community

I tested and it works perfectly fine, marking as reviewed.

poker10’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: -Needs tests

Thanks for working on this. Is there any reason, why the PREG_CLASS_NUMBERS and PREG_CLASS_CJK were moved from search.module to unicode.inc, but the PREG_CLASS_COMBINED_MARKS was defined in filter.module? If we are changing the location, it is a bit distracting to keep them on separate places.

@@ -1794,5 +1814,34 @@ function _filter_html_escape_tips($filter, $format, $long = FALSE) {
+ * @param (string) $pattern
+ *   Regular expression to match.
+ * @param (string) $callback
+ *   Function to call after the regular expression match.
+ * @param (string) $text
+ *   Text to check.

Let's keep the standard @param string, without brackets.

@@ -1794,5 +1814,34 @@ function _filter_html_escape_tips($filter, $format, $long = FALSE) {
+}
+/**
  * @} End of "Standard filters".
  */

There is a missing newline.

@@ -1558,7 +1578,7 @@ function _filter_url($text, $filter) {
-    // Revert back to the original comment contents
+    // Revert to the original comment contents

If we are updating comments (not sure if this change was needed), please use them correctly and add a full stop.

@@ -83,6 +83,63 @@ define('PREG_CLASS_UNICODE_WORD_BOUNDARY',
+/**
+ * Matches all 'N' Unicode character classes (numbers)
+ */

Full stop is also missing here.

----------

As these all are a minor changes, I am not opening a separate issue for D7 now, if we can correct this in the next patch iteration. If not, it would be the best to close this and create a separate issue for D7, as per backport policy. Thanks!