Since a rewrite in October 2010, html_entity_decode() doesn't decode invalid numeric entities anymore in PHP 5.4, as a consequence:

<?php
html_entity_decode('&#14;', ENT_QUOTES, 'UTF-8');
?>

Returns '&#14;' on PHP 5.4 and a single character chr(14) on PHP 5.3.

Is this a good thing? Should we ask PHP core to reconsider?

One effect of this is that one of our filter_xss() test fails, because something like '<img src=" &#14;  javascript:alert(0)">' is not escaped anymore.

Files: 
CommentFileSizeAuthor
#28 decode_invalid_numeric_entities-1210798-28.patch7.45 KBandribas
FAILED: [[SimpleTest]]: [PHP 5.5 MySQL] Unable to apply patch decode_invalid_numeric_entities-1210798-28.patch. Unable to apply patch. See the log in the details link for more information. View
#13 sec.html_.txt29.74 KBklausi

Comments

Damien Tournoud’s picture

Issue tags:+PHP 5.4

Tag

pwolanin’s picture

Is there a PHP issue describing why they made this change?

Damien Tournoud’s picture

@pwolanin: The only thing I could find is the commit message linked in the OP.

chx’s picture

As discussed on IRC, if PHP core fails us then we need to fall back to the old userspace implementation.

sun’s picture

Not sure I follow:

> php -r "var_dump(PHP_VERSION, html_entity_decode('&#14;', ENT_QUOTES, 'UTF-8'));"
string(5) "5.3.5"
string(1) "♫"
sun’s picture

Aside from #5, is it possible that there are undocumented flags we could try? (especially ENT_DISALLOWED and ENT_SUBSTITUTE might be worth a try)

Damien Tournoud’s picture

Indeed, PHP versions < 5.4 return chr(14) in that case (it's not a character printable by my terminal). That's U+000E, and it is indeed a valid UTF-8 control character. So there seems to be something wrong with PHP 5.4.

If I understand correctly, the default value is HTML 4.01 mode, which is described as allowing "all the numerical entities that represent a Unicode code point (< U+10FFFFFF)", so I am a little bit confused why this is happening.

I'm opening a separate feature request to enable ENT_SUBSTITUTE by default in check_plain().

Damien Tournoud’s picture

linora’s picture

DamienMcKenna’s picture

Issue tags:+PHP 5.4

Adding an extra tag to help catalog it.

DamienMcKenna’s picture

For completeness sake, PHP 5.2.13 says:

string(6) "5.2.13"
string(1) ""
Crell’s picture

Has anyone filed this upstream with PHP internals? Sounds like something that should be addressed there, either with a bugfix or recommended workarounds from the people who know it best...

Crell’s picture

Issue summary:View changes

Minor addition.

Damien Tournoud’s picture

Issue summary:View changes

Update summary.

klausi’s picture

StatusFileSize
new29.74 KB

This issue was unpublished for a couple of days because it was investigated by the Drupal security team. No exploitable vulnerability has been found, so we can continue here in public.

Attached is the thread of the security.drupal.org discussion (rename to .html to view in your browser).

dcrocks’s picture

The character '' is a control character and not valid in a html document in any case. Why are we concerned about these?

Wim Leers’s picture

This is still a @todo in XssTest today.

catch’s picture

Issue summary:View changes
pwolanin’s picture

Category:Task» Bug report
Priority:Normal» Critical
pwolanin’s picture

Category:Bug report» Task
Priority:Critical» Normal

re-reading the thread - it seems there may be attacks against IE6, but I 'm not seeing that even that works after Drupal filters the img tag.

However, it's a little unclear if there is a vector e.g. if the input format allows IMG tags?

andribas’s picture

To achieve old behavior of PHP 5.3 we could manually decode all numeric entities with control characters
They cannot be twice decoded since control character could not be part of html entity, so it is safe to decode valid and invalid numeric entities (as it work in php 5.3)
But you should consider http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
8.1.4 http://www.w3.org/html/wg/drafts/html/master/syntax.html#text
and http://www.w3.org/html/wg/drafts/html/master/syntax.html#preprocessing-t...

This is bug report from php team https://bugs.php.net/bug.php?id=52860
and commit https://github.com/php/php-src/commit/91727cb844ecfe55f6dcd2f13ffeec3962...
to understand new behavior.

Also you should consider to use fast algorithm when parsing untrusted html input.
Drupal core sanitize function relies on kses algorithm

The kses code is based on an HTML filter that Ulf wrote on his own back in 2002 for the open-source project Gnuheter ( http://savannah.nongnu.org/projects/ gnuheter ). Gnuheter is a fork from PHP-Nuke. The HTML filter has been improved a lot since then.
To stop people from having sleepless nights, we feel the urgent need to state that kses doesn't have anything to do with the KDE project, despite having a name that starts with a K.
In case someone was wondering, Ulf is available for kses-related consulting.
Finally, the name kses comes from the terms XSS and access. It's also a recursive acronym (every open-source project should have one!) for "kses strips evil scripts".

if we could use FSM instead of all preg_replace, preg_replace_callback this would make great performance impact on drupal core.
At this time input parsed many times, but it could pe parsed consequently.
Good approach is gumbo-parser from google, here it's tokenizer https://github.com/google/gumbo-parser/blob/master/src/tokenizer.c
and comment about it's work https://github.com/google/gumbo-parser/issues/53#issuecomment-43395597
May be it could be ported in simplified version.

chx’s picture

http://php.net/manual/en/mbstring.installation.php

mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module

Relying on an mb function alas won't work. (It's not in the required extensions list in system_requirements) either.

andribas’s picture

StatusFileSize
new6.01 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Setup environment: Test cancelled by admin prior to completion. View

This would do the same (adopted from php source) - decode all not valid numeric entities.
Since code of that entity is not valid in document - it cannot be part of another entity, hence it will be decoded only once - both for php 5.3 and 5.4
if old behavior of php 5.3 is wanted

andribas’s picture

StatusFileSize
new6.01 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 82,111 pass(es). View

fixed typo (a-z => a-f)

php core/scripts/run-tests.sh --color --concurrency 4 --class "Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest"

Drupal test run
---------------

Tests to be run:
  - Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest

Test run started:
  Monday, December 8, 2014 - 17:11

Test summary
------------

Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest        192 passes                                     

Test run duration: 0 sec
chx’s picture

Status:Active» Needs review
Issue tags:-boost_cache boost php, -#arguments, -#D7AX

I will review this later but let's see what the bot has to say.

chx’s picture

Also, thanks you SO MUCH for the awesome work on this, I really appreciate doing this. See https://twitter.com/chx/status/542001339564163072 :)

The last submitted patch, 21: drupal-decode_numeric_entities-1210798-21.patch, failed testing.

andribas’s picture

i crossed my fingers)
please review flag ENT_HTML5
by default flags set to ENT_COMPAT | ENT_HTML401
and these constants were added in 5.4.0 - may be we need only one case for switch
http://php.net/manual/en/function.html-entity-decode.php

andribas’s picture

StatusFileSize
new6.18 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 82,163 pass(es). View

changed flag to default ENT_HTML401. if ENT_HTML5 needed, it also should be set in return html_entity_decode($text, ENT_QUOTES, 'UTF-8');

andribas’s picture

StatusFileSize
new7.45 KB
FAILED: [[SimpleTest]]: [PHP 5.5 MySQL] Unable to apply patch decode_invalid_numeric_entities-1210798-28.patch. Unable to apply patch. See the log in the details link for more information. View
new2.3 KB

Sorry for wrong code, provided test.
It checks that numeric entity is decoded in supplied method only if it would be not decoded native.
and if it would be decoded native, it will not be decoded in decodeInvalidNumericEntities()
also added boundary to $code<0x110000, Largest code point for UTF-8 is 0x10FFFF
to check correct decoding - pack function, mbstring needed, i don't know how to check without it.
on localhost passes.

andribas’s picture

andribas’s picture

@chx please, review.
if this bug will be covered by another issue https://www.drupal.org/node/1333730
mention, please.

twistor’s picture

@andribas, I will do some checking in Masterminds/HTML5 to see how they handle decoding.

For decoding Utf-8 php, see https://github.com/twistor/mbphp/blob/master/src/Encoder/Utf8.php

andribas’s picture

@twistor thank you!
i have got decoding function and check against valid codepoint from here
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...

because it is exactly what native function do when decoding, see
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...

twistor’s picture

Ahh, I see you're only using mbstring() for the test.

cashwilliams’s picture

Since Drupal 8 now requires PHP 5.5.9 (https://www.drupal.org/requirements), is this issue now irrelevant?

klausi’s picture

Title:In PHP 5.4, html_entity_decode() doesn't decode invalid numeric entities» In PHP 5.4+, html_entity_decode() doesn't decode invalid numeric entities

No, still relevant on PHP 5.5: https://3v4l.org/njqTN

Status:Needs review» Needs work

The last submitted patch, 28: decode_invalid_numeric_entities-1210798-28.patch, failed testing.