Since a rewrite in October 2010, html_entity_decode()
doesn't decode invalid numeric entities anymore in PHP 5.4, as a consequence:
html_entity_decode('', ENT_QUOTES, 'UTF-8');
Returns ''
on PHP 5.4 and a single character chr(14)
on PHP 5.3.
Is this a good thing? Should we ask PHP core to reconsider?
One effect of this is that one of our filter_xss()
test fails, because something like '<img src="  javascript:alert(0)">'
is not escaped anymore.
Comment | File | Size | Author |
---|---|---|---|
#28 | decode_invalid_numeric_entities-1210798-28.patch | 7.45 KB | andribas |
#13 | sec.html_.txt | 29.74 KB | klausi |
Comments
Comment #1
Damien Tournoud CreditAttribution: Damien Tournoud commentedTag
Comment #2
pwolanin CreditAttribution: pwolanin commentedIs there a PHP issue describing why they made this change?
Comment #3
Damien Tournoud CreditAttribution: Damien Tournoud commented@pwolanin: The only thing I could find is the commit message linked in the OP.
Comment #4
chx CreditAttribution: chx commentedAs discussed on IRC, if PHP core fails us then we need to fall back to the old userspace implementation.
Comment #5
sunNot sure I follow:
Comment #6
sunAside from #5, is it possible that there are undocumented flags we could try? (especially ENT_DISALLOWED and ENT_SUBSTITUTE might be worth a try)
Comment #7
Damien Tournoud CreditAttribution: Damien Tournoud commentedIndeed, PHP versions < 5.4 return
chr(14)
in that case (it's not a character printable by my terminal). That'sU+000E
, and it is indeed a valid UTF-8 control character. So there seems to be something wrong with PHP 5.4.If I understand correctly, the default value is HTML 4.01 mode, which is described as allowing "all the numerical entities that represent a Unicode code point (< U+10FFFFFF)", so I am a little bit confused why this is happening.
I'm opening a separate feature request to enable
ENT_SUBSTITUTE
by default incheck_plain()
.Comment #8
Damien Tournoud CreditAttribution: Damien Tournoud commented#1211866: Enable ENT_SUBSTITUTE flag in Html::escape for the related feature request.
Comment #9
linora CreditAttribution: linora commentedComment #10
DamienMcKennaAdding an extra tag to help catalog it.
Comment #11
DamienMcKennaFor completeness sake, PHP 5.2.13 says:
Comment #12
Crell CreditAttribution: Crell commentedHas anyone filed this upstream with PHP internals? Sounds like something that should be addressed there, either with a bugfix or recommended workarounds from the people who know it best...
Comment #12.0
Crell CreditAttribution: Crell commentedMinor addition.
Comment #12.1
Damien Tournoud CreditAttribution: Damien Tournoud commentedUpdate summary.
Comment #13
klausiThis issue was unpublished for a couple of days because it was investigated by the Drupal security team. No exploitable vulnerability has been found, so we can continue here in public.
Attached is the thread of the security.drupal.org discussion (rename to .html to view in your browser).
Comment #14
dcrocks CreditAttribution: dcrocks commentedThe character '' is a control character and not valid in a html document in any case. Why are we concerned about these?
Comment #15
Wim LeersThis is still a @todo in
XssTest
today.Comment #16
catchThis came up in #2193023: EditorXssFilter/StandardTest::dataset #25 fails on php 5.4 which added another @todo.
Comment #17
pwolanin CreditAttribution: pwolanin commentedrelated #2387055: possible Xss on php 5.4+
Comment #18
pwolanin CreditAttribution: pwolanin commentedre-reading the thread - it seems there may be attacks against IE6, but I 'm not seeing that even that works after Drupal filters the img tag.
However, it's a little unclear if there is a vector e.g. if the input format allows IMG tags?
Comment #19
andribas CreditAttribution: andribas commentedTo achieve old behavior of PHP 5.3 we could manually decode all numeric entities with control characters
They cannot be twice decoded since control character could not be part of html entity, so it is safe to decode valid and invalid numeric entities (as it work in php 5.3)
But you should consider http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
8.1.4 http://www.w3.org/html/wg/drafts/html/master/syntax.html#text
and http://www.w3.org/html/wg/drafts/html/master/syntax.html#preprocessing-t...
This is bug report from php team https://bugs.php.net/bug.php?id=52860
and commit https://github.com/php/php-src/commit/91727cb844ecfe55f6dcd2f13ffeec3962...
to understand new behavior.
Also you should consider to use fast algorithm when parsing untrusted html input.
Drupal core sanitize function relies on kses algorithm
if we could use FSM instead of all preg_replace, preg_replace_callback this would make great performance impact on drupal core.
At this time input parsed many times, but it could pe parsed consequently.
Good approach is gumbo-parser from google, here it's tokenizer https://github.com/google/gumbo-parser/blob/master/src/tokenizer.c
and comment about it's work https://github.com/google/gumbo-parser/issues/53#issuecomment-43395597
May be it could be ported in simplified version.
Comment #20
chx CreditAttribution: chx commentedhttp://php.net/manual/en/mbstring.installation.php
Relying on an mb function alas won't work. (It's not in the required extensions list in
system_requirements
) either.Comment #21
andribas CreditAttribution: andribas commentedThis would do the same (adopted from php source) - decode all not valid numeric entities.
Since code of that entity is not valid in document - it cannot be part of another entity, hence it will be decoded only once - both for php 5.3 and 5.4
if old behavior of php 5.3 is wanted
Comment #22
andribas CreditAttribution: andribas commentedfixed typo (a-z => a-f)
Comment #23
chx CreditAttribution: chx commentedI will review this later but let's see what the bot has to say.
Comment #24
chx CreditAttribution: chx commentedAlso, thanks you SO MUCH for the awesome work on this, I really appreciate doing this. See https://twitter.com/chx/status/542001339564163072 :)
Comment #26
andribas CreditAttribution: andribas commentedi crossed my fingers)
please review flag ENT_HTML5
by default flags set to ENT_COMPAT | ENT_HTML401
and these constants were added in 5.4.0 - may be we need only one case for switch
http://php.net/manual/en/function.html-entity-decode.php
Comment #27
andribas CreditAttribution: andribas commentedchanged flag to default ENT_HTML401. if ENT_HTML5 needed, it also should be set in
return html_entity_decode($text, ENT_QUOTES, 'UTF-8');
Comment #28
andribas CreditAttribution: andribas commentedSorry for wrong code, provided test.
It checks that numeric entity is decoded in supplied method only if it would be not decoded native.
and if it would be decoded native, it will not be decoded in
decodeInvalidNumericEntities()
also added boundary to
$code<0x110000
, Largest code point for UTF-8 is 0x10FFFFto check correct decoding - pack function, mbstring needed, i don't know how to check without it.
on localhost passes.
Comment #29
andribas CreditAttribution: andribas commentedComment #30
andribas CreditAttribution: andribas commented@chx please, review.
if this bug will be covered by another issue https://www.drupal.org/node/1333730
mention, please.
Comment #31
twistor CreditAttribution: twistor commented@andribas, I will do some checking in Masterminds/HTML5 to see how they handle decoding.
For decoding Utf-8 php, see https://github.com/twistor/mbphp/blob/master/src/Encoder/Utf8.php
Comment #32
andribas CreditAttribution: andribas commented@twistor thank you!
i have got decoding function and check against valid codepoint from here
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...
because it is exactly what native function do when decoding, see
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...
Comment #33
twistor CreditAttribution: twistor commentedAhh, I see you're only using mbstring() for the test.
Comment #34
cashwilliams CreditAttribution: cashwilliams commentedSince Drupal 8 now requires PHP 5.5.9 (https://www.drupal.org/requirements), is this issue now irrelevant?
Comment #35
klausiNo, still relevant on PHP 5.5: https://3v4l.org/njqTN
Comment #39
David_Rothstein CreditAttribution: David_Rothstein as a volunteer commentedComment #50
SpokjeSince we're now launching drupal 10, which requires PHP 8.1 and the behaviour has not changed since the original PHP 5.4 change (https://3v4l.org/njqTN), I think we can safely assume by now, that this is the intended behaviour now.
Also seeing that this issue hasn't got any attention in the last 6+ years, I'm going to be bold and close this issue as a won't fix.