When query string parameters are passed to url() via $options['query'], the output rendered by url() is invalid because ampersands (to separate different query string values) are not properly escaped.
From the XHTML 1.0 spec:
C.12. Using Ampersands in Attribute Values (and Elsewhere)
In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.
I ran into this issue when I tried to validate Drupal's output with the devel menu enabled: Devel adds a 'destination'-parameter to most of the urls in its menu. This caused all pages on which the menu was active to be invalid XHTML. An example of an href generated by url() before this patch is:
/index.php?q=devel/cache/clear&destination=user%2F1
The correct href should be:
/index.php?q=devel/cache/clear&destination=user%2F1
Note: (just so this doesn't get moved to the devel queue) the devel menu is just an example. IMO this is a core bug).
| Comment | File | Size | Author |
|---|---|---|---|
| #5 | 399488_common_l_invalid_xhtml.patch | 2.47 KB | mr.baileys |
| common_url_invalid_xhtml.patch | 2.61 KB | mr.baileys |
Comments
Comment #1
damien tournoud commentedurl() just returns... a URL. It's l() that needs fixing.
Comment #2
damien tournoud commentedOk, url() calls filter_xss_bad_protocol() that calls check_plain(). Conclusion: there is no core bug here. It's the responsibility of the calling code to properly call check_plain() when outputing the URL returned by url() in an HTML context.
Comment #3
mr.baileysI can follow your reasoning when it comes to url(). As you point out, url just returns a url, and can be used outside of an HTML context.
However (and sorry if I'm being dense), I'm not convinced that l() is not currently broken. The function declaration states that the return value is
and the return statement is:
return '<a href="' . url($path, $options) . '"' . drupal_attributes($options['attributes']) . '>' . ($options['html'] ? $text : check_plain($text)) . '</a>';So l() is claiming to return HTML, but the HTML it returns is not always valid. Are you saying any invocation of l() should be wrapped in check_plain by calling code? And doesn't this potentially cause already encoded ampersands to be encoded again though (& becoming &amp;)?
But isn't filter_xss_bad_protocol() only called for absolute URLs? The bug/issue outlined here occurs with internal Drupal links, so I think the call to filter_xss_bad_protocol is skipped. Not sure this is relevant though, as we're now focusing on l().
Comment #4
damien tournoud commented@mr.baileys: True, we removed from url() in D7 the check_url() that added no security (because the result from url() is already sanitized) in a previous issue. We should have transformed the check_url() into check_plain() instead. Could you provide a patch that does that?
Comment #5
mr.baileysComment #6
mr.baileysStrange, Drupal ate my comment... Trying again:
Found the change you are referring too: #354812: filter_xss_bad_protocol is called hundreds of times on some pages.
Patch attached adds the check_plain where filter_xss_bad_protocol used to be. I ran the l() tests and they still pass after this change. Checked the resulting document with the W3C validator and it is now considered valid XHTML.
Thanks for steering me in the right direction!
Comment #7
dries commentedAh, another reason to have an XHTML verifier in SimpleTest. Maybe we could start with a simple XML validator using some of the PHP5 functionality.
Comment #8
mr.baileysI strongly agree. I patched my version of simpletest to include some rudimentiary DTD validation through DOMDocument. It's slow as molasses (because DTDs are fetched from the W3C site, not good) and throws some extra warnings, so it's not ready for prime time...
I'll open an issue so we can get a discussion going.
EDIT: added #402356: XHTML validator needed in SimpleTest.
Comment #9
damien tournoud commentedConfirmed that this fix validation errors in #402254: How do we do on XHTML validation?. Green lighted!
Comment #10
damien tournoud commentedAdding proper tag.
Comment #11
webchickOk, cool. Damien spent some time on IRC walking me through the rationale for this patch, so committed this to HEAD. My preference would be to commit this with tests, but once #402254: How do we do on XHTML validation? is in, that'll fix that problem everywhere. :)
Thanks, mr. baileys!
Comment #13
rfayThis introduced a regression:
The '&' between query elements now gets inappropriately escaped.
Same problem in Drupal 6, for historical reasons.
Making this active so the original participants can suggest the right way to fix this (in D6 and D7).
Comment #15
chx commentedPlease read http://acko.net/blog/safe-string-theory-for-the-web and understand that this patch mixed up HTML attribute text with HTML inline text -- check_plain is used to escape the latter not the former.
Comment #16
mr.baileys@rfay:
I don't think it is inappropriate. See the original post and XHTML 1.0 spec referenced there. Verbatim from the specs:
@chx:
The l() docs says "Formats an internal or external URL link as an HTML anchor tag"…; In order for that HTML anchor string to be valid, the &s need to be escaped.
I agree that this might get us into trouble if down the chain drupal_attributes() is called using output generated by l(), leading to double-escaping, but that's a separate issue IMO.
Comment #17
rfayNo, sorry, it's definitely wrong. I don't have an objection to dealing with it in a separate issue, but it's definitely wrong, and it was introduced here, and that's why I wanted to talk about the issue here.
For example:
Essentially, this means that the l() function does not handle multiple query args at all, and produces output that *looks* like it's handling them, in the browser. But it's just an escaped imitation of the real thing. The actual output is illegal in every way, as we end up with a single query arg, which has an '=' in it... but most of all, it doesn't do what it was intended to do.
Comment #18
mr.baileysNo, sorry, I still think it's correct ;)
When I run the following piece of code through the W3C validator, I get all greens, and the links are handled correctly:
When I run the following code through the validator (only difference is the use of & instead of & in the url), the validator reports errors.
Just as described in the specification.
I'm unable to reproduce that:
When I embed the following link in an XHTML document:
click on it, and do a var_dump($_GET) on the receiving end, I get:
(the query parameters are correctly split)
Same goes for Google: embed the following link in a valid XHTML doc:
and google will give you a list of PDF files for Drupal.
So while I'm happy to admit I could be wrong, the specification, XHTML validation result, automated XHTML testing (see Damien's comment in #9) and some manual testing all seem to indicate that the current behavior is correct.
Comment #19
heine commentedmr.baileys is correct. The ampersand in the query must be encoded as
&when in HTML.Comment #20
rfayOK, I'll buy it. Thanks for your patience. I agree that the original spec description is very clear. I'm sorry I didn't read the original posting carefully: I thought that the behavior I was seeing was an *artifact* of the change, not the fundamental reason for the change.
Perhaps my server-side handling of the link is at fault, and it needs to un-escape the parameters before processing.