This issue has been reported privately to the security team but it was decided to handle this as public security improvement since no direct vulnerability is involved. This issue was reported by mvonfrie.
Problem/Motivation
valid_url() report http://-
as valid URL:
$valid = valid_url('http://-', true);
Urls look like this after the scheme (protocol) according to RFC 3986, section 3.2:
authority = [ userinfo "@" ] host [ ":" port ]
host = IP-literal / IPv4address / reg-name
reg-name should be "conform to the DNS syntax"
Important quote from below research:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen.
UrlHelper::isValid() should reject a hostname that starts, ends, or is only a hyphen.
Proposed resolution
Improve the regex in UrlHelper::isValid() (valid_url() in Drupal 7).
Remaining tasks
Write a patch.
Review
Commit
User interface changes
none.
API changes
none.
Original comments:
#1 benjy commented April 2, 2015 at 6:35am Component: » Code This just seems like a bug to me at this point and could easily be handled in public. Comment #2 Pere Orga commented April 2, 2015 at 3:43pm I have tried a few browsers and all think URL http://- is valid. http://-: http://i.imgur.com/zjaPlzv.png http://localhost: http://i.imgur.com/UzXV4Xe.png Comment #3 benjy commented April 3, 2015 at 3:57am From: https://www.ietf.org/rfc/rfc3986.txt Scheme names consist of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus ("+"), period ("."), or hyphen ("-"). Although schemes are case- insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters. An implementation should accept uppercase letters as equivalent to lowercase in scheme names (e.g., allow "HTTP" as well as "http") for the sake of robustness but should only produce lowercase scheme names for consistency. scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) So the example with the hyphen is valid because technically it could be part of the scheme? Comment #4 benjy commented April 3, 2015 at 4:09am OK, more reading, my first comment wasn't the reason. So as this issue reports, the format after the scheme is: host = IP-literal / IPv4address / reg-name And reg-name is made up of: reg-name = *( unreserved / pct-encoded / sub-delims ) And unreserved is: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" Which explains the valid url when starting with a hyphen. Comment #5 mvonfrie commented April 4, 2015 at 4:07pm Yes, but exactly before the definition of reg-name = *( unreserved / pct-encoded / sub-delims ) the RFC says Such a name consists of a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumeric character and possibly also containing "-" characters. This is not covered by the formal syntax specification. That means every host name can contain hyphens, but neither start nor end with them or just be a hyphen: The labels must follow the rules for ARPANET host names. They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen. https://tools.ietf.org/html/rfc1035, 2.3.1 So in my understanding http://- is not a valid url as - is not a valid hostname. Comment #6 Pere Orga commented April 4, 2015 at 7:36pm In any case I also think this could be discussed in public. It doesn't look like any harm could be done. Comment #7 benjy commented April 6, 2015 at 6:11am +1 for public from me.
Comment | File | Size | Author |
---|---|---|---|
#22 | Tested--code.jpg | 192.48 KB | ranjith_kumar_k_u |
#22 | Before--patch.jpg | 52.65 KB | ranjith_kumar_k_u |
#22 | After--patch.jpg | 52.74 KB | ranjith_kumar_k_u |
#17 | drupal-2474191-17.patch | 1.74 KB | mrinalini9 |
#14 | drupal-2474191-14.patch | 1.74 KB | gapple |
Comments
Comment #1
Ayesh CreditAttribution: Ayesh commentedWe definitely have a problem!
Domain names cannot start or end with hyphens, but:
filter_var works as it should:
It all probably comes down to
*
as the only correct regex to validate URLs with today's and future TLDs (http://වෙබ්.පාර්ලිමේන්තුව.ලංකා/ anyone?).Comment #8
jibranvalid_url
doesn't exist in D8 anymore so moving the issue to D7.Comment #9
jibranah!
Comment #10
apadernoComment #11
apadernoThere isn't any
Url::isValid()
method, in Drupal 8.8.x.Comment #12
apadernoComment #13
gappleHere's a test-only patch, to cover leading and trailing hyphens in domain names.
Comment #14
gapple- Separate out the IPv4 test - it's now slightly more strict by requiring 4 number segments, but still allows invalid addresses such as 000.299.300.999
- Require that each domain segment is at least one alphanumeric character, and hyphens are only valid prior to an alphanumeric character or percent-encoded character.
Comment #17
mrinalini9 CreditAttribution: mrinalini9 commentedRerolled patch to 9.1.x.
Comment #19
quietone CreditAttribution: quietone as a volunteer commentedComment #20
quietone CreditAttribution: quietone as a volunteer commentedRetesting with 9.2.x
If that passed, then this is ready for review.
Comment #21
quietone CreditAttribution: quietone as a volunteer commentedComment #22
ranjith_kumar_k_u CreditAttribution: ranjith_kumar_k_u at Zyxware Technologies commentedI have tested the last patch on 9.2 dev with few examples
Tested code
Before patch
After patch
Comment #23
gregglesThe tests in #22 seem to indicate this is working properly. Also updating the issue summary to make it easier to find what the "right" behavior is.
Comment #24
dawehnerI wonder whether it would be worth testing some combinations of minuses as well, like -`example-.com` or so, this way more cases are covered
Comment #25
Ayesh CreditAttribution: Ayesh commentedJust a heads-up - upstream PHP has some changes, that parse_url function returns different results on latest PHP builds. I will update with more concrete examples soon.
Comment #28
vikashsoni CreditAttribution: vikashsoni as a volunteer and at Zyxware Technologies commentedApplied #17 patch in drupal-9.3.x-dev looking good and applied successfully
Thanks for the patch
Comment #29
benjifisherThanks to everyone who has worked on this. Regular expressions are so much fun, aren't they?
I reviewed the issue summary, the comments, the code, and the test. I did not read the RFCs referenced in the issue summary. I would like to suggest several changes.
The most complicated problem is that the current code has a single regular expression that is supposed to match either a domain name or an IPv4 address:
(?:[a-z0-9\-\.]|%[0-9a-f]{2})+
. That single regular expression is not strict enough, and I think it makes sense to have separate expressions for the two options (as @gapple pointed out in #14).The problem is that, when you do that, you are making the match for IPv4 addresses more strict. That is a good thing to do, but it is outside the scope of this issue.
We should change either the patch or the scope so that they match. I suggest we change the scope. That means updating the issue title and the description. When you do that, please be as clear as you can in the "Proposed resolution" section.
In #25, @Ayesh pointed out that there are changes to the
parse_url()
function in PHP 8. I reviewed the PHP docs for parse_url(). I think the only changes are whether "empty" query strings and fragments are represented byNULL
or''
. That should not affect this issue.I would like to point out that
testValidAbsolute()
already includesex-ample.com
as one of its test cases.The following suggestions mostly refer to the main part of the patch:
[a-z]...[a-z0-9]
. Except that "letter" should allow a URL-encoded character, so make that(?:[a-z]|%[0-9a-f]{2})...(?:[a-z0-9]|%[0-9a-f]{2})
-?(?:[a-z0-9]|%[0-9a-f]{2})
requires each hyphen to be followed by an alphanumeric character or URL-encoded non-ASCII character. This is stricter and more complicated than the correct regular expression,[a-z0-9\-]|%[0-9a-f]{2}
.