This issue has been reported privately to the security team but it was decided to handle this as public security improvement since no direct vulnerability is involved. This issue was reported by mvonfrie.
Problem/Motivation
valid_url() report http://- as valid URL:
$valid = valid_url('http://-', true);
Urls look like this after the scheme (protocol) according to RFC 3986, section 3.2:
authority = [ userinfo "@" ] host [ ":" port ]
host = IP-literal / IPv4address / reg-name
reg-name should be "conform to the DNS syntax"
Important quote from below research:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen.
UrlHelper::isValid() should reject a hostname that starts, ends, or is only a hyphen.
Proposed resolution
Improve the regex in UrlHelper::isValid() (valid_url() in Drupal 7).
Remaining tasks
Write a patch.
Review
Commit
User interface changes
none.
API changes
none.
Original comments:
#1 benjy commented April 2, 2015 at 6:35am
Component: » Code
This just seems like a bug to me at this point and could easily be handled in public.
Comment #2 Pere Orga commented April 2, 2015 at 3:43pm
I have tried a few browsers and all think URL http://- is valid.
http://-: http://i.imgur.com/zjaPlzv.png
http://localhost: http://i.imgur.com/UzXV4Xe.png
Comment #3 benjy commented April 3, 2015 at 3:57am
From: https://www.ietf.org/rfc/rfc3986.txt
Scheme names consist of a sequence of characters beginning with a
letter and followed by any combination of letters, digits, plus
("+"), period ("."), or hyphen ("-"). Although schemes are case-
insensitive, the canonical form is lowercase and documents that
specify schemes must do so with lowercase letters. An implementation
should accept uppercase letters as equivalent to lowercase in scheme
names (e.g., allow "HTTP" as well as "http") for the sake of
robustness but should only produce lowercase scheme names for
consistency.
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
So the example with the hyphen is valid because technically it could be part of the scheme?
Comment #4 benjy commented April 3, 2015 at 4:09am
OK, more reading, my first comment wasn't the reason.
So as this issue reports, the format after the scheme is: host = IP-literal / IPv4address / reg-name
And reg-name is made up of: reg-name = *( unreserved / pct-encoded / sub-delims )
And unreserved is: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
Which explains the valid url when starting with a hyphen.
Comment #5 mvonfrie commented April 4, 2015 at 4:07pm
Yes, but exactly before the definition of
reg-name = *( unreserved / pct-encoded / sub-delims )
the RFC says
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.
This is not covered by the formal syntax specification. That means every host name can contain hyphens, but neither start nor end with them or just be a hyphen:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen.
https://tools.ietf.org/html/rfc1035, 2.3.1
So in my understanding http://- is not a valid url as - is not a valid hostname.
Comment #6 Pere Orga commented April 4, 2015 at 7:36pm
In any case I also think this could be discussed in public. It doesn't look like any harm could be done.
Comment #7 benjy commented April 6, 2015 at 6:11am
+1 for public from me.
| Comment | File | Size | Author |
|---|---|---|---|
| #22 | Tested--code.jpg | 192.48 KB | ranjith_kumar_k_u |
| #22 | Before--patch.jpg | 52.65 KB | ranjith_kumar_k_u |
| #22 | After--patch.jpg | 52.74 KB | ranjith_kumar_k_u |
| #17 | drupal-2474191-17.patch | 1.74 KB | mrinalini9 |
| #14 | drupal-2474191-14.patch | 1.74 KB | gapple |
Issue fork drupal-2474191
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #1
ayesh commentedWe definitely have a problem!
Domain names cannot start or end with hyphens, but:
filter_var works as it should:
It all probably comes down to
*as the only correct regex to validate URLs with today's and future TLDs (http://වෙබ්.පාර්ලිමේන්තුව.ලංකා/ anyone?).Comment #8
jibranvalid_urldoesn't exist in D8 anymore so moving the issue to D7.Comment #9
jibranah!
Comment #10
avpadernoComment #11
avpadernoThere isn't any
Url::isValid()method, in Drupal 8.8.x.Comment #12
avpadernoComment #13
gappleHere's a test-only patch, to cover leading and trailing hyphens in domain names.
Comment #14
gapple- Separate out the IPv4 test - it's now slightly more strict by requiring 4 number segments, but still allows invalid addresses such as 000.299.300.999
- Require that each domain segment is at least one alphanumeric character, and hyphens are only valid prior to an alphanumeric character or percent-encoded character.
Comment #17
mrinalini9 commentedRerolled patch to 9.1.x.
Comment #19
quietone commentedComment #20
quietone commentedRetesting with 9.2.x
If that passed, then this is ready for review.
Comment #21
quietone commentedComment #22
ranjith_kumar_k_u commentedI have tested the last patch on 9.2 dev with few examples

Tested code
Before patch

After patch

Comment #23
gregglesThe tests in #22 seem to indicate this is working properly. Also updating the issue summary to make it easier to find what the "right" behavior is.
Comment #24
dawehnerI wonder whether it would be worth testing some combinations of minuses as well, like -`example-.com` or so, this way more cases are covered
Comment #25
ayesh commentedJust a heads-up - upstream PHP has some changes, that parse_url function returns different results on latest PHP builds. I will update with more concrete examples soon.
Comment #28
vikashsoni commentedApplied #17 patch in drupal-9.3.x-dev looking good and applied successfully
Thanks for the patch
Comment #29
benjifisherThanks to everyone who has worked on this. Regular expressions are so much fun, aren't they?
I reviewed the issue summary, the comments, the code, and the test. I did not read the RFCs referenced in the issue summary. I would like to suggest several changes.
The most complicated problem is that the current code has a single regular expression that is supposed to match either a domain name or an IPv4 address:
(?:[a-z0-9\-\.]|%[0-9a-f]{2})+. That single regular expression is not strict enough, and I think it makes sense to have separate expressions for the two options (as @gapple pointed out in #14).The problem is that, when you do that, you are making the match for IPv4 addresses more strict. That is a good thing to do, but it is outside the scope of this issue.
We should change either the patch or the scope so that they match. I suggest we change the scope. That means updating the issue title and the description. When you do that, please be as clear as you can in the "Proposed resolution" section.
In #25, @Ayesh pointed out that there are changes to the
parse_url()function in PHP 8. I reviewed the PHP docs for parse_url(). I think the only changes are whether "empty" query strings and fragments are represented byNULLor''. That should not affect this issue.I would like to point out that
testValidAbsolute()already includesex-ample.comas one of its test cases.The following suggestions mostly refer to the main part of the patch:
[a-z]...[a-z0-9]. Except that "letter" should allow a URL-encoded character, so make that(?:[a-z]|%[0-9a-f]{2})...(?:[a-z0-9]|%[0-9a-f]{2})-?(?:[a-z0-9]|%[0-9a-f]{2})requires each hyphen to be followed by an alphanumeric character or URL-encoded non-ASCII character. This is stricter and more complicated than the correct regular expression,[a-z0-9\-]|%[0-9a-f]{2}.