This was brought to my attention by my colleague robbyahn this week.
<a href="http://drupal.org/project/modules/?f[0]=drupal_core%3A103&f[1]=bs_project_sandbox%3A0">D7 modules</a>
is not valid in HTML5 and cannot be made so with D8, D7 or any Drupal. (without theme hacks)
This is a core issue with the drupal_http_build_query() function
Under the url() function.
Simply:
The "[" and "]" characters are (now) illegal in HTML5 URLs!
After some digging, that started with a "Huh?" and proceeded to my calling both the W3C 'experimental' validator and http://html5.validator.nu/ broken ... I find that HTML5 does not use URLs or URIs, but IRIs instead. This seems to have come in around the time of supporting international UTF-8 international characters in web addresses last decade.
This minimal HTML5 Document hard, red-flag fails when pasted into the W3C validator set as HTML5.
<!DOCTYPE html><html>
<head><title>example</title></head>
<body>
<a href="http://drupal.org/project/modules/?f[0]=drupal_core%3A103&f[1]=bs_project_sandbox%3A0">Drupal 7 modules</a>
</body>
</html>
>Validation Output:
1 Error
-
Line 4,
Column 100:
Bad value http://drupal.org/project/modules/?f[0]=drupal_core%3A103&f[1]=bs_projec... for attribute href on element a: Illegal character in query component.…s/?f[0]=drupal_core%3A103&amp;f[1]=bs_project_sandbox%3A0"<strong title="Position where error was detected.">></strong>Drupal 7 modules</a>
- Syntax of IRI reference:
- Any URL. For example:
/hello
,#canvas
, orhttp://example.org/
. Characters should be represented in NFC and spaces should be escaped as%20
.
A valid version is
<!DOCTYPE html><html>
<head><title>example</title></head>
<body>
<a href="http://drupal.org/project/modules/?f%5B0%5D=drupal_core%3A103&f%5B1%5D=bs_project_sandbox%3A0">Drupal 7 modules</a>
</body>
</html>
Which has "[" and "]" replaced with the URL-encoded "%5B" and "%5D" respectively.
In my limited testing so far doing so is browser-friendly (though suck-ier to read) and the Apache/PHP request handler does what it should when decoding it. Its functionally OK.
I am aware that nested GET arrays with [] arguments in the query strings has been a cool, lazy feature of PHP since forever. But if we (our clients) want a clean validation report and also want to use interesting arguments in URLs, this can be fixed.
Patch details below...
Comment | File | Size | Author |
---|---|---|---|
#46 | interdiff-1827854-42-46.txt | 8.58 KB | jamesrward |
#46 | html5-changes-1827854-46.patch | 10.44 KB | jamesrward |
#42 | html5-changes-1827854-42.patch | 1.6 KB | jamesrward |
#40 | rawurlencode_square_brackets-1827854-40-D7.patch | 1.57 KB | anrikun |
#32 | html5-changes-1827854-32.patch | 1.67 KB | legolasbo |
Comments
Comment #1
dman CreditAttribution: dman commentedOur problem is the line in drupal_http_build_query()
which goes:
So.
Either we fix that up to become
or (I would prefer) we let the system sort this (and any similar issues) out with URL-encoding the key, always, by changing
to
Comment #2
dman CreditAttribution: dman commentedSuggested strawman patch attached.
Input welcome. Is the really a huge problem? I've not found much noise about it in the PHP-HTML5 discssions online or in any existing issue search on html5
I am aware that if applied (to Drupals core url() function) it may trigger an amount of regression sparks.
It's not *killing* anything in the real world so far, but a W3C green-light is an important thing (to us)
Comment #3
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is never a problem in practice, but we should fix it anyway.
You patch leads to
$key
being encoded twice, so needs work for that. We probably also need to fix the tests.Comment #4
dman CreditAttribution: dman commentedAh yeah, I see. Because drupal_http_build_query recurses upon itself it can double up the encoding.
So I guess the first place would be the place to do it. I didn't study it too hard when just trying to see if it was worth trying to fix at all..
Nice that there is a test already there for this!
Testbot, see if this will do better...
In my test I'm using urldecode() instead of a literal string, as I don't care how it arrives at the result, as long as the result ends up correct. OTOH, it doesn't prove that this fix does anything at all either, just that it's fully compatible with previous behavior. Hm.
Comment #5
Damien Tournoud CreditAttribution: Damien Tournoud commentedYou can move the rawurlencode() outside the ternary operation.
No, please check the result explicitly. Modified like this, this test would pass with our without your patch :)
Comment #6
superspring CreditAttribution: superspring commentedSame patch as above, reworked with Damien's changes.
Comment #7
coderintherye CreditAttribution: coderintherye commentedThe test result says this passed, but the message here seems to be stuck at test request sent. What's up testbot?
Comment #8
dman CreditAttribution: dman commentedtry bumping to requeue I guess. It's well old now...
Comment #9
dman CreditAttribution: dman commentedand testbot ahoy?
Comment #10
waverate CreditAttribution: waverate commentedHas this been reviewed/committed in D8 so that it can be backported to D7?
Comment #10.0
waverate CreditAttribution: waverate commentedwording improvement
Comment #11
joseph.olstadHi Waverate, I applied the changes shown in the patch from comment #6 to D7 and they work without modification. So the code works on D7.
Any chance this can get committed into D7 and D8 core (and D6 if someone can test it?) ? We applied this fix and were able to pass w3c validation on our facet search page.
Anyone else using this patch?
Thanks,
Comment #12
joseph.olstadrenamed patch to follow naming convention (description_description-NODEID-COMMENTNUMBER.patch ) thus w3c_url_validation_html5_changes-1827854-12.patch
Comment #14
joseph.olstadNot sure why #12 failed (it is a copy of #6),
So use #6 instead, it passed.
Comment #15
dcam CreditAttribution: dcam commentedClosed the D7 backport issue, #2289867: [D7] W3C HTML5 Validation error with PHP array-based query-strings built with url() containing [ ] characters, as a duplicate.
I kind of doubt this is backportable, but I'll tag it and the committers can make the call.
#6 needs a reroll.
Comment #16
superspring CreditAttribution: superspring commentedRerolling... Looks like the majority of this patch has already been done.
The tests in
- core/modules/system/lib/Drupal/system/Tests/Common/UrlTest.php:testDrupalGetQueryParameters
may be sufficient now. Someone else can look at these if they would like some more.
Comment #18
joseph.olstadThe patch 5 works on D7. I've got over 300 contrib modules running for quite some time using this patch and no side-effects yet seen. The client web browser handles decoding and nothing really changes from the drupal side of things except for properly encoding the two square brackets before generating a link.
The simpleTest failure seen above is actually a problem with UrlHelperTest.php .
Line 40 of UrlHelperTest.php should be changed..
It's now:
Should be changed to:
%5B is the [
%5D is the ]
If you look at line 37, 38 , 39 they have the encoding for special characters using % but line 40 does not have the encoding.
So, the code is right, the test case UrlHelperTest.php needs to be updated for the new patch to pass testing .
Click here to view the source code for the test case
Thanks
Comment #19
joseph.olstadSee comments #5, #6 and #18
Comment #20
gdaw CreditAttribution: gdaw commentedWe may be able to test this in more than one D7 environment, but as Joseph reported we've seen no problems.
Comment #21
star-szrThis doesn't reroll but does need further work.
Comment #22
joseph.olstadWe've been using this w3c validation fix to core for months now and it works.
Obviously the simple test needs to be adjusted, but the actual patch from superspring in comment #5 is good.
Anyone know how to adjust/update the simplytest? Obviously it's checking for the square bracket [ character instead of the uriencoded characters
for instance "[" is uri encoded as "%5B"
and "]" is uri encoded as "%5D"
the patch does this correctly but the simply test keeps checking for the non-encoded characters [ and ].
Please adjust the simplytest accordingly and then the patch will pass testing. If we can get this included in core then Drupal core will become HTML5 w3c compliant.
Comment #23
rooby CreditAttribution: rooby commentedComment #24
bceyssensRerolled
Comment #25
dawehnerSo for D8 we have the following comment:
Comment #26
joseph.olstadWe're still using the D7 version of this patch.
Comment #27
cdonner CreditAttribution: cdonner commentedThis is all good, until you paste such a URL as a target into the Redirect module. At that point the % gets encoded again, resulting in %255B and %255D, respectively. Filter query broken. You have to put the [] back into the URL before using it anywhere else in Drupal. Things like this make me want to throw Drupal out and write my site from scratch.
Comment #28
joseph.olstadIn response to dawehner and using "http_build_query() directly" (instead of rawurlencode?) , the official system requirements page says that 5.4.5 or higher is required for D8. So that means that the D8 patch could now use http_build_query()
"Drupal 8: PHP 5.4.5 or higher"
Comment #29
joseph.olstadcdonner in comment #27 you mentioned the redirect module. I installed and tested the "redirect" module (version 7.x-1.0-rc1) in combination with this core patch and the redirect worked correctly for a link that contained square brackets and there was no problem dealing with the patch that url encodes the square brackets. The redirect of url encoded square brackets still works correctly after patching. The client browser interprets the square brackets correctly. Our production D7 site has been using the D7 version of this patch since it was created and we have over 400 modules and features and not one of them has had a problem with this.
Comment #30
martijn@ibuildings.nl CreditAttribution: martijn@ibuildings.nl at Ibuildings commentedJust applied and reviewed the patch in #24. Changes seem sensible to me and I haven't found any breakage.
Re-uploaded the patch to make it run on the new test infrastructure.
Comment #31
legolasboMy colleague mdeletter has reviewed this issue but is being hindered by the fact that is not a "confirmed user" yet. His comment got lost in translation which is why I am posting his comment for him.
Please credit mdeletter on commit for reviewing the patch.
Comment #32
legolasboAlso D.O bugged out with his patch upload, which is why I am re-uploaded the patch again.
In case it wasn't already clear, patches #30 and #32 are identical to patch #24 and only re-uploaded to make sure they are running in the new testing infrastructure.
Comment #33
alexpottCan someone confirm that we have test coverage of making a request using square brackets and it still working as expected after this change?
Comment #34
legolasbo@alexpott,
Why does the way we generate a URL have anything to do with how a request is handled?
Comment #35
alexpott@legolasbo so I might not have asked the question probably - but what we need is confirmation that PHP still handles the query string as expected.
Comment #36
legolasbo@alexpott, So if I understand you correctly, you want to make sure that a request to
example.com/blog?tags%5B3%5D=3
andexample.com/blog?tags[3]=3
results in the same page without errors?We have been unable to locate any automated tests that verify this in D8. However, since the request is (AFAIK) handled by Symfony, they should have test coverage for this (assumption). I have manually tested the case mentioned above and it is handled as expected.
Comment #40
anrikun CreditAttribution: anrikun commentedTesting the same patch as #32 for D7
Comment #41
jamesrward CreditAttribution: jamesrward as a volunteer commentedRe #36 this is tested by symfony in HttpFoundation/Tests/RequestTest.php line 648:
2.8/src/Symfony/Component/HttpFoundation/Tests/RequestTest.php
Linked to 2.8 as that is what D8 core is using.
This is confirming that the un-encoded query string gets encoded properly. Encoded query strings are handled as-is. I tried hacking in the example in #36 and it passed the same test.
Would love to see this move towards RTBC for D8 here so we can get the D7 backport in.
Comment #42
jamesrward CreditAttribution: jamesrward as a volunteer commentedRerolled for 8.3.x
Comment #44
joseph.olstad@jamesrward , looks like 4 new (failing) tests also need the conversion from
[]
to%5B%5D
Comment #45
joseph.olstad@alexpott after this change nothing breaks , this is because the webbrowser automatically converts the square brackets , what is broken is w3c compliance (as silly as it is in this case IMHO ) when drupal without the patch renders links with [] , then when the w3c validator runs, it sees the square brackets and flags it as non html5 compliant. The patch does not prevent square brackets from being used (thats actually handled by web browsers when you click on a link). What the patch actually does is make Drupal link rendering be html5 compliant.
been using the D7 patch with hundreds of contrib modules and custom modules , no problem. Did not have to rewrite any code in my contrib or custom modules.
Comment #46
jamesrward CreditAttribution: jamesrward as a volunteer commentedHere's a patch and an interdiff with the BigPipe tests fixed. Basically these tests expected to put square bracket arguments in and get square bracket arguments out. Our patch urlencodes those brackets so the input and output no-longer match. Per #45 I don't think we are actually breaking BigPipe with the patch in #42, just the tests. To me this is a good indicator that we need to get this fix in soon. The longer core stays like this, the more places we are going to have to hunt down and replace non-W3C compliant tests.
Comment #47
jamesrward CreditAttribution: jamesrward as a volunteer commentedComment #48
jamesrward CreditAttribution: jamesrward as a volunteer commentedI think the only thing left to decide here is if this goes on 8.3.x as a bug-fix or waits for 8.4. Paging @alexpott
Comment #49
jamesrward CreditAttribution: jamesrward as a volunteer commentedComment #50
joseph.olstad@jamesrward, yes you are correct, just the tests needed adjusting.
Comment #51
alexpottGiven http_build_query does this:
Our own equivalent will now do the same.
Let's get this done. I considered whether or not we need a CR for this and I don't think so. The return of \Drupal\Component\Utility\UrlHelper::buildQuery() is still a valid URL rawencoded string - only more so.
I guess it is technically possible that someone somewhere is doing preg_match on the output but that'd be wrong... you'd do:
Comment #52
alexpottI tested core's menu form for @cdonner's comment #27.
I installed standard and did an advanced search after logging in as user 1. This gave me the beautiful url of
search/node?keys=search%20term&f%5B0%5D=type%3Aarticle&f%5B1%5D=language%3Aen&advanced-form=1
and I used that as a menu link. All works great.Comment #53
alexpottCommitted 02d56af and pushed to 8.4.x. Thanks!
I will discuss backporting this to 8.3.x with the other committers.
I credited @Damien Tournoud for his review that identified an issue with the original patch. Thanks to everyone else for their comments.
Perhaps someone can open a follow up to convert \Drupal\Component\Utility\UrlHelper::buildQuery() to just called http_build_query() - it will be a minor performance improvement.
Comment #55
alexpottDiscussed with @catch we agreed to backport to 8.3.x
Comment #57
alexpottThe followup to use http_build_query() already exists - there are problems with it see #2322059: Keep the UrlHelper::buildQuery method and remove @todo
Comment #58
joseph.olstadD7 patch is ready to go, been using it on several sites for over a year without issues.
Comment #59
joseph.olstadComment #60
xjm@joseph.olstad, can you create a separate issue for D7? Our current process is to create separate issues for D7 if needed and just reference the D8 one in the related issues. (Also, none of the file attachments on this issue are D7 patches.)
Comment #61
jamesrward CreditAttribution: jamesrward as a volunteer commented@xjm backport issue re-opened at #2289867: [D7] W3C HTML5 Validation error with PHP array-based query-strings built with url() containing [ ] characters
#40 appears to be latest D7 ready so I will drop it into that issue for testing.
Comment #62
xjmThanks @jamesrward!
Comment #64
andrewmacpherson CreditAttribution: andrewmacpherson as a volunteer and at Annertech commentedthis is the only issue tagged with "w3c validator", moving to the other tag for posterity.
Comment #65
joseph.olstadYears later, regressions found in facets module carrying over , still affecting probably both 7x and 8x/9x facets.
#2988736: Encoded characters are in the url
and
#3226121: Missing facet caused by inconsistent encoded url that confuses facets module, facets uses spaces but search_api_fulltext submits + symbols