Not all web addresses turn into links. Currently, only web addresses starting with the protocol ('http(s)://') or the 'www' prefix will turn into links. Use of the www prefix is declining and excludes valid web addresses like:

facebook.com
drupal.org
bit.ly/QtQET
data.worldbank.org
en.wikipedia.org/wiki/Uniform_resource_locator

Proposed solution

We now have a regex matching on http and www, but we could include a third option: match on the known top level domains (generic and countries).

Two improvements we could do:

1. We could introduce matching to a valid top level domain. We then have to include the 220 top level domains (org|com|uk|ly etc) and that will fix the addressses above.

2. Matching the new private TLD's (*.anything) will be probably harder if we don't want false positives.
If such a TLD-address has a trail we could probably safely match them:
example.example/page/1

Resources

Twitter does a very good job to convert all kind of URL's in all kind of situations to shortener links.
Code can be found here:
https://github.com/twitter/twitter-text-js/blob/master/twitter-text.js
A PHP version based on that code is created here: https://github.com/stephenbeckett/TwitterURLMatchPHP
See http://www.stevebeckett.com/twitter-url-match-for-php/

Comments

Pancho’s picture

Title:turn web addresses not starting with the http-protocol or the www-prefix into links» Turn web addresses starting with neither the http-protocol nor the www-prefix into links
Category:feature» task

Very nice!

We should ask the author Steve Beckett if he is willing to double-license his twitterURLMatch class as Apache2 + GPL3, because Apache2 isn't compatible with our GPL2.
If yes, we should IMHO create a reasonable number of tests, and if it proves to work as advertised, especially not producing false positives, then we should request the class be added to our Packaging whitelist, so we can use it. We certainly don't want to reinvent the wheel.

Pancho’s picture

Version:8.x-dev» 9.x-dev

Hmm, no Steve Beckett sadly can't relicense his class as it is a derived work of the Twitter library which is also Apache2-licensed.
We had the same problem with Bootstrap, which Twitter might possibly relicense on MIT, but hasn't done yet.

We should still keep asking Twitter for a relicensing of twitter-text-js.
For now, however, all we can do now is create a contrib module and hotlink the class somewhere on the net or require manual download. As soon as the library has been relicensed, we can include it in our contrib module at a later point.

While I would have loved having this in D8 core, I believe we should go that pragmatic route rather than writing our own code we have to maintain ourselves. Don't really believe it would be ready in time and then it might even not be accepted into core.
Therefore tentatively moving this issue to the D9 queue... :(

Hanno’s picture

Version:9.x-dev» 8.x-dev

Well, it was my first thought as well, to include it as a library as a Proudly Found Elsewhere. At second thought it is not a good idea. The regex of Twitter is complex because it's made in javascript and regex and unicode in javascript is very limited. As it doesn't handle unicode, Twitter has to include each and every character in the regex. We just use 'p{L}' to match all letters, like Symfony does. So, we can handle this logic without custom tweaking.
Meanwhile I worked on a patch handling the detection for web addresses without a protocol. Works great and did some code cleaning.
So let's try for Drupal 8 ?

Pancho’s picture

Sure, show me what what you have!
If it's indeed much simpler than the Twitter code it seems like the better solution anyway.

Pancho’s picture

Issue summary:View changes

.

Wim Leers’s picture

Version:8.x-dev» 9.x-dev
Issue summary:View changes
Status:Active» Postponed

This would be nice to have. But how are we ever going to have a full list of trustworthy URLs?

This is the kind of thing that must first be developed and matured in Drupal contrib. Until then, moving to 9.x and postponing.

Hanno’s picture

@Wim well, we can simply match on the top level domains with a regex like (*.)+[com|org|gov..|be|nl|de...](/*)? as mentioned in 1 to catch all the valid url's. But with the new TLD's coming up we will miss new fancy ones. Twitter-text is trying to keep up with the new ones.
But it's indeed not something for 8.x at this point.