Hello,
I have a problem with the "Convert URLs into links" filter in drupal 7:
I have this filter activated for the "Full HTML" text format. If I create a new page and enter a URL with special characters (for example ä, ü, ö), the resulting clickable URL is formatted wrongly.
Example:
I create a new page with the following text:
www.test.de
www.daniel-müller.de
The sourcecode of the resulting page looks like this:
<a href="http://www.test.de">www.test.de</a>
<a href="http://www.daniel-m">www.daniel-m</a>üller.de
Problem: If there is a special character (like the 'ü' in the second URL), the clickable URL is cut off just before the special character.
Is this a general problem with drupal 7 or something in my server / drupal - configuration?
Current Status
This has been committed to Drupal 8. For Drupal 7, the patch in #39 seems to work for people but unless/until we provide a fallback as mentioned in #23, this issue is blocked from being included in Drupal 7 core.
Comment | File | Size | Author |
---|---|---|---|
#43 | filter-urlfilter-i18n-1657886-41.patch | 18.62 KB | jyraya |
#42 | filter-urlfilter-i18n-1657886-40.patch | 18.63 KB | enriquelacoma |
#39 | filter-urlfilter-i18n-1657886-39.patch | 10.25 KB | loopduplicate |
#19 | filter-urlfilter-i18n-1657886-19.patch | 7.94 KB | adam7 |
#14 | filter-urlfilter-i18n-1657886-13.patch | 7.97 KB | Hanno |
Comments
Comment #1
Jaypan CreditAttribution: Jaypan commentedI'm seeing this same issue with arabic URLs. For example if I have the following URL: http://www.site.com/users/عمر the URL parsing will end at the last forward slash, before the arabic text. The problem is, as you can see, this is a user path, so I need it to also parse the username. The problem lies in _filter_url(). The character class for the regex doesn't allow for non-alphabetic letters other than a defined set of punctuation.
Comment #2
Jaypan CreditAttribution: Jaypan commentedI've hacked _filter_url() to work with Arabic characters, but the method I used is not ideal I don't think. I added the Arabic character match \p{Arabic} to each of the character classes, and added my own 'u' modifier to the end of each of the patterns. It's properly parsing the links now, but I don't like having to hack core, and I'm sure there is a more dynamic way to do this. Maybe create a settings for the URL converter that allows for one to select if and which UTF8 languages they would like to be able to include in their URL searches. I'm open to other ideas as well.
Comment #3
russo79 CreditAttribution: russo79 commentedThere are other characters such as "( )" who seem to be filtered giving the same result:
e.g. if the user types this url
http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboards-with-Linux-(Gnome15-plugin).html
it gets converted to the following html:
<a href="http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboards-with-Linux-">http://addons.teamspeak.com/directory/plugins/hardware/Logitech-G-Keyboa...</a>(Gnome15-plugin).html
Comment #4
Hanno CreditAttribution: Hanno commentedTwitter does a very good job to convert all kind of URL's to shortener links. Their code can be found here:
https://github.com/twitter/twitter-text-js/blob/master/twitter-text.js
A PHP version based on that code is created here: https://github.com/stephenbeckett/TwitterURLMatchPHP
See http://www.stevebeckett.com/twitter-url-match-for-php/
Comment #5
Hanno CreditAttribution: Hanno commentedNowadays, these are valid web adresses:
http://президент.рф/ (president of the Russian federation)
http://موقع.وزارة-الأتصالات.مصر/ (Ministry of Communication of Egypt)
And earlier mentioned:
http://www.daniel-müller.de/
None of them work currently in Drupal 8 as shown in the screenshot below, and mixed characters breaks.
Comment #6
catchHere's the relevant code that excludes those characters:
Comment #7
Hanno CreditAttribution: Hanno commentedCreated a patch based on the mentioned Regex in Twitter-js for the trail, and the regex for characters Symfony is using for validating links. It fixes bugs to correct:
- umlauts
- domain names in utf-8 characters
- () in url path #1843260: URL Links with brackets are not processed correctly by input filter with "Web page addresses...turn into links automatically."
- @ in url path #2005986: URL Filter fails if there is an @ in the link
- ! in url path #1480992: URLs containing a '!' separator are not formatted as links
We could based on the Twitter script even improve the filter to detect links that don't start with 'www' or 'http', but that could be an feature request. First fix this the current bug.
Comment #8
Hanno CreditAttribution: Hanno commentedreview needed
Comment #10
Hanno CreditAttribution: Hanno commentedpunctuations removed from code as its now handled with the url-regex.
Comment #11
Jaypan CreditAttribution: Jaypan commentedThank you for your work on this. Much appreciated.
Comment #12
Gábor HojtsyTagging.
Comment #13
andypostActually we need tests for that, also what's about copyright for code?
trailing white-space
tab char?
Comment #14
Hanno CreditAttribution: Hanno commentedThanks for your review. Here is a new patch with tests included for unicode and special characters.
Copyright: The regex is partly based on twitter.js. Is it possible with this license (Apache License v2) to include the whole function to improve the URL detection in Drupal?
Comment #15
andypostAwesome!
@Hanno please file another issue about library, suppose
Core\Component
is a good placeComment #16
Hanno CreditAttribution: Hanno commentedThanks. Created a new issue for further improvements #2019229: Turn web addresses starting with neither the http-protocol nor the www-prefix into links
Hope this patch gets commited.
Comment #17
alexpottCommitted a335ae6 and pushed to 8.x. Thanks!
Comment #18
alexpottComment #19
adam7 CreditAttribution: adam7 commentedCopy of filter-urlfilter-i18n-1657886-13.patch for 7.x.
Comment #20
Jaypan CreditAttribution: Jaypan commentedI have been using the patch from #19 for months now without issue. I say it's good.
Comment #21
adam7 CreditAttribution: adam7 commentedSame here:)
Comment #23
chx CreditAttribution: chx commentedThis patch uses \p and requiring --enable-unicode-properties for your PCRE is not a decision to be made lightly. In my opinion even the D8 version needs to use the constants PREG_CLASS_NUMBERS and similar. For Drupal 7 changing the requirements this drastically is an absolute no go. Please find a script https://drupal.org/comment/2430064#comment-2430064 here to convert properties to something every PCRE understands.
Comment #24
effulgentsia CreditAttribution: effulgentsia commentedJust my 2 cents, but I don't believe this should be backported to D7. The filter says "Convert URLs ...", and according to http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier, IRIs are not URIs, and therefore, not URLs. If contrib wants to add a filter for IRIs, that's contrib's business, and then it can choose whether to allow IRIs into
href
values, or whether to percent encode, or punycode, etc.I think we may want to reopen this as a D8 issue to see if the approach we picked of setting non-URL IRIs into href values is desirable.
Comment #25
chx CreditAttribution: chx commentedI am bumping this back to D8 then.
Comment #26
effulgentsia CreditAttribution: effulgentsia commentedOk, first question: are IRIs allowed as href values in HTML5? http://stackoverflow.com/questions/14074731/are-iris-valid-as-html-attri... says yes, but following the links in the answer, I don't see where that's confirmed. Have the specs changed since that answer?
Second: what are people's thoughts on unicode PCRE for D8, per #23?
Comment #27
effulgentsia CreditAttribution: effulgentsia commentedLooks like http://url.spec.whatwg.org/ is broadening what the term URL means. So, if we end up with a solution that satisfies D7 PCRE requirements and HTML4, then I'd be ok with that being backported to D7. Or, a D8 solution for D8 requirements and a D7 solution for D7 requirements would be ok too.
Comment #28
Jaypan CreditAttribution: Jaypan commentedThere are two ways of looking at this:
1) A purely programmatical point of view. This is where we only look at what a URL is from a programming point of view, and if a Drupal link doesn't fit that, then too bad, we need to keep it pure.
2) A user point of view. This is where we look at the fact that users are inserting links to other pages within textareas in Drupal, and expecting them to be turned into usable links.
The user doesn't really give a damn about the definition of a URL. Many, if not most, users will not even know what a URL is. All they care about is that they are typing a link to somewhere on the site, and the link either works or doesn't.
From that point of view, I think it's incorrect to be approaching this problem from the perspective of the first point. Who is a website for, the developers, or the users? Having links not work is horrible from a UI perspective, as most users won't know how to write their own tags for URLs. If a developer thinks it's not a proper URL, well that's too bad for the developer, seeing as the URL works in the browser.
Comment #29
effulgentsia CreditAttribution: effulgentsia commentedYep, I agree with #28 (I changed my mind from #24), but only to the extent that we don't violate internet standards to achieve it. Fortunately, we don't have to. We can keep the name URL in the filter name without that being incorrect, thanks to http://url.spec.whatwg.org/, and so long as the implementation complies with the standards of the respective doctype (HTML5 for D8, HTML4 for D7) as well as server requirements regarding PCRE, then I agree with fixing this in both D7 and D8.
Comment #30
MixologicUnicode characters are allowed in domain names/urls. They get converted by browsers into punycode, because DNS is still ascii. So a punycode link will show up as unicode in the address bar. http://яндекс.рф if you mouse over that domain, you'll see that it points to http://xn--d1acpjx3f.xn--p1ai/ . Which is aka http://yandex.ru
so they should definitely get converted automatically into links, as they are definitely urls.
Comment #31
Hanno CreditAttribution: Hanno commentedNot sure why this issue is still open for Drupal 8 as the patch is in and we all agree on that a IRI should be handled as a URL: converted to a link. We can try to backport this to Drupal 7.
chx mentioned that for Drupal 7 we need a fall back if PCRE is disabled for unicode support. If we need such a mechanism for D8 as well, we could open another issue. Will change this issue to Drupal 7 as there it is still an issue.
Comment #32
BerdirIt looks like this caused a regression, at least it is the only relevant issue that I can find that touches this code: #2557021: Url Filter does not correctly recognise URL's with uppercase query arguments. Any help welcome.
Comment #33
Volker23 CreditAttribution: Volker23 as a volunteer commentedIf this won't be backported to D7, is there a contrib module for this to work around? Or should I just apply the patch?
Comment #38
idimopoulos CreditAttribution: idimopoulos as a volunteer commentedHi guys,
I don't know if it's appropriate to comment on the ticket from so long ago but I wanted to put some details here just to get them recorded and be verified by the community.
The current implementation of the regular expression matching the domain name is
This means that domain names like: http://example.com work but domain names like http://example.com1 do not work.
_filter_url does not accept Top Level Domain with a number.
I did some research and these are the data I found:
This answer in StackOverflow (https://stackoverflow.com/questions/7411255/is-it-possible-to-have-one-s...) is a very good start as it states that:
The information derive from the RFC specifications attached to that thread.
So here are the contradictions.
There is a vast discussion on what should be and what should not be supported. The above regex that filter module is using matches all current TLD on the web. A list of the TLDs that exist in the world can be found at https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains.
If we follow this list, it's safe to have the
at the end of the expression that match non numeric valid characters in the TLD. This restricts us though by not allowing single character TLDs and domain names with numbers in it. This comes because RFC allows the single letter and domains that include numbers but there are currently none in the world.
That leaves the filter module with lacking support on hostnames. A good example is if the site is running in an intranet where the hostname/domain-name is http://sandwiches247 or http://sandwitches.net1. None of these are recognized as a valid domain names as it is considered that both sandwiches247 and net1 are TLDs and are not allowed to have numbers according to the regexp above.
Since this ticket was the one that I was able to track actual changes and discussion there, I wanted to know if there is a decision/discussion/security advice/anything related to why are numbers are not recognized in the top level domain names. Are we excluding local or intranet hostnames?
Comment #39
loopduplicateHere's an updated patch for D7. It is a back port from what's in D8 core. I know this isn't acceptable, see #23, however, it should be an improvement over #19, which doesn't apply to core anymore and doesn't include updates that have been made since, like the one mentioned in #32.
Comment #40
albapb CreditAttribution: albapb commentedTested the patch in #39 and it works for me.
Comment #41
loopduplicateJust so people are clear, there is a chance that this will be backported to 7.x; the patch in #39 seems to work for people but we'll need to write a fallback as mentioned in #23. I've updated the issue description with this status. I've changed the issue status to "Needs work" as well.
Comment #42
enriquelacoma CreditAttribution: enriquelacoma commentedHere is and update for patch D7. This patch implements a fallback when pcre is not compiled with unicode support. In that case, \p{L}, \p{M} and \p{N} are replaced by the hexadecimal values from this constants PREG_CLASS_LETTERS, PREG_CLASS_NUMBERS, PREG_CLASS_CJK and PREG_CLASS_COMBINED_MARKS
Comment #43
jyraya CreditAttribution: jyraya commentedHello,
I fixed a small issue in the patch that uses the fallback mechanism when the unicode support is enabled and PCRE in the other case.
I tested it successfully with such a content, and with the PCRE and after the fallback:
Comment #45
Delphine Lepers CreditAttribution: Delphine Lepers for European Commission and European Union Institutions, Agencies and Bodies commentedI tested the patch #43 (by the way, the name is incorrect).
It works well for me, however I find it sad that on mailto protocol, the subject and body are not included into the href.
test@hotmail.com?subject=mysubject&body=body
Transforms to
<p><a href="mailto:test@hotmail.com">test@hotmail.com</a>?subject=mysubject&body=body</p>
Comment #46
jyraya CreditAttribution: jyraya commentedComment #47
Delphine Lepers CreditAttribution: Delphine Lepers for European Commission and European Union Institutions, Agencies and Bodies commentedI tested and it works perfectly fine, marking as reviewed.
Comment #48
poker10 CreditAttribution: poker10 at ActivIT s.r.o. commentedThanks for working on this. Is there any reason, why the
PREG_CLASS_NUMBERS
andPREG_CLASS_CJK
were moved fromsearch.module
tounicode.inc
, but thePREG_CLASS_COMBINED_MARKS
was defined infilter.module
? If we are changing the location, it is a bit distracting to keep them on separate places.Let's keep the standard
@param string
, without brackets.There is a missing newline.
If we are updating comments (not sure if this change was needed), please use them correctly and add a full stop.
Full stop is also missing here.
----------
As these all are a minor changes, I am not opening a separate issue for D7 now, if we can correct this in the next patch iteration. If not, it would be the best to close this and create a separate issue for D7, as per backport policy. Thanks!