Since a rewrite in October 2010, html_entity_decode() doesn't decode invalid numeric entities anymore in PHP 5.4, as a consequence:

html_entity_decode('', ENT_QUOTES, 'UTF-8');

Returns '' on PHP 5.4 and a single character chr(14) on PHP 5.3.

Is this a good thing? Should we ask PHP core to reconsider?

One effect of this is that one of our filter_xss() test fails, because something like '<img src=" &#14; javascript:alert(0)">' is not escaped anymore.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Damien Tournoud’s picture

Issue tags: +PHP 5.4

Tag

pwolanin’s picture

Is there a PHP issue describing why they made this change?

Damien Tournoud’s picture

@pwolanin: The only thing I could find is the commit message linked in the OP.

chx’s picture

As discussed on IRC, if PHP core fails us then we need to fall back to the old userspace implementation.

sun’s picture

Not sure I follow:

> php -r "var_dump(PHP_VERSION, html_entity_decode('&#14;', ENT_QUOTES, 'UTF-8'));"
string(5) "5.3.5"
string(1) "♫"
sun’s picture

Aside from #5, is it possible that there are undocumented flags we could try? (especially ENT_DISALLOWED and ENT_SUBSTITUTE might be worth a try)

Damien Tournoud’s picture

Indeed, PHP versions < 5.4 return chr(14) in that case (it's not a character printable by my terminal). That's U+000E, and it is indeed a valid UTF-8 control character. So there seems to be something wrong with PHP 5.4.

If I understand correctly, the default value is HTML 4.01 mode, which is described as allowing "all the numerical entities that represent a Unicode code point (< U+10FFFFFF)", so I am a little bit confused why this is happening.

I'm opening a separate feature request to enable ENT_SUBSTITUTE by default in check_plain().

Damien Tournoud’s picture

linora’s picture

Issue tags: -PHP 5.4 +boost_cache boost php, +#arguments, +#D7AX

DamienMcKenna’s picture

Issue tags: +PHP 5.4

Adding an extra tag to help catalog it.

DamienMcKenna’s picture

For completeness sake, PHP 5.2.13 says:

string(6) "5.2.13"
string(1) ""
Crell’s picture

Has anyone filed this upstream with PHP internals? Sounds like something that should be addressed there, either with a bugfix or recommended workarounds from the people who know it best...

Crell’s picture

Issue summary: View changes

Minor addition.

Damien Tournoud’s picture

Issue summary: View changes

Update summary.

klausi’s picture

FileSize
29.74 KB

This issue was unpublished for a couple of days because it was investigated by the Drupal security team. No exploitable vulnerability has been found, so we can continue here in public.

Attached is the thread of the security.drupal.org discussion (rename to .html to view in your browser).

dcrocks’s picture

The character '' is a control character and not valid in a html document in any case. Why are we concerned about these?

Wim Leers’s picture

This is still a @todo in XssTest today.

catch’s picture

Issue summary: View changes
pwolanin’s picture

Category: Task » Bug report
Priority: Normal » Critical
pwolanin’s picture

Category: Bug report » Task
Priority: Critical » Normal

re-reading the thread - it seems there may be attacks against IE6, but I 'm not seeing that even that works after Drupal filters the img tag.

However, it's a little unclear if there is a vector e.g. if the input format allows IMG tags?

andribas’s picture

To achieve old behavior of PHP 5.3 we could manually decode all numeric entities with control characters
They cannot be twice decoded since control character could not be part of html entity, so it is safe to decode valid and invalid numeric entities (as it work in php 5.3)
But you should consider http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
8.1.4 http://www.w3.org/html/wg/drafts/html/master/syntax.html#text
and http://www.w3.org/html/wg/drafts/html/master/syntax.html#preprocessing-t...

This is bug report from php team https://bugs.php.net/bug.php?id=52860
and commit https://github.com/php/php-src/commit/91727cb844ecfe55f6dcd2f13ffeec3962...
to understand new behavior.

Also you should consider to use fast algorithm when parsing untrusted html input.
Drupal core sanitize function relies on kses algorithm

The kses code is based on an HTML filter that Ulf wrote on his own back in 2002 for the open-source project Gnuheter ( http://savannah.nongnu.org/projects/ gnuheter ). Gnuheter is a fork from PHP-Nuke. The HTML filter has been improved a lot since then.
To stop people from having sleepless nights, we feel the urgent need to state that kses doesn't have anything to do with the KDE project, despite having a name that starts with a K.
In case someone was wondering, Ulf is available for kses-related consulting.
Finally, the name kses comes from the terms XSS and access. It's also a recursive acronym (every open-source project should have one!) for "kses strips evil scripts".

if we could use FSM instead of all preg_replace, preg_replace_callback this would make great performance impact on drupal core.
At this time input parsed many times, but it could pe parsed consequently.
Good approach is gumbo-parser from google, here it's tokenizer https://github.com/google/gumbo-parser/blob/master/src/tokenizer.c
and comment about it's work https://github.com/google/gumbo-parser/issues/53#issuecomment-43395597
May be it could be ported in simplified version.

chx’s picture

http://php.net/manual/en/mbstring.installation.php

mbstring is a non-default extension. This means it is not enabled by default. You must explicitly enable the module

Relying on an mb function alas won't work. (It's not in the required extensions list in system_requirements) either.

andribas’s picture

FileSize
6.01 KB

This would do the same (adopted from php source) - decode all not valid numeric entities.
Since code of that entity is not valid in document - it cannot be part of another entity, hence it will be decoded only once - both for php 5.3 and 5.4
if old behavior of php 5.3 is wanted

andribas’s picture

fixed typo (a-z => a-f)

php core/scripts/run-tests.sh --color --concurrency 4 --class "Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest"

Drupal test run
---------------

Tests to be run:
  - Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest

Test run started:
  Monday, December 8, 2014 - 17:11

Test summary
------------

Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest        192 passes                                      

Test run duration: 0 sec
chx’s picture

Status: Active » Needs review
Issue tags: -boost_cache boost php, -#arguments, -#D7AX

I will review this later but let's see what the bot has to say.

chx’s picture

Also, thanks you SO MUCH for the awesome work on this, I really appreciate doing this. See https://twitter.com/chx/status/542001339564163072 :)

The last submitted patch, 21: drupal-decode_numeric_entities-1210798-21.patch, failed testing.

andribas’s picture

i crossed my fingers)
please review flag ENT_HTML5
by default flags set to ENT_COMPAT | ENT_HTML401
and these constants were added in 5.4.0 - may be we need only one case for switch
http://php.net/manual/en/function.html-entity-decode.php

andribas’s picture

changed flag to default ENT_HTML401. if ENT_HTML5 needed, it also should be set in return html_entity_decode($text, ENT_QUOTES, 'UTF-8');

andribas’s picture

Sorry for wrong code, provided test.
It checks that numeric entity is decoded in supplied method only if it would be not decoded native.
and if it would be decoded native, it will not be decoded in decodeInvalidNumericEntities()
also added boundary to $code<0x110000, Largest code point for UTF-8 is 0x10FFFF
to check correct decoding - pack function, mbstring needed, i don't know how to check without it.
on localhost passes.

andribas’s picture

andribas’s picture

@chx please, review.
if this bug will be covered by another issue https://www.drupal.org/node/1333730
mention, please.

twistor’s picture

@andribas, I will do some checking in Masterminds/HTML5 to see how they handle decoding.

For decoding Utf-8 php, see https://github.com/twistor/mbphp/blob/master/src/Encoder/Utf8.php

andribas’s picture

@twistor thank you!
i have got decoding function and check against valid codepoint from here
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...

because it is exactly what native function do when decoding, see
https://github.com/php/php-src/blob/91727cb844ecfe55f6dcd2f13ffeec3962aa...

twistor’s picture

Ahh, I see you're only using mbstring() for the test.

cashwilliams’s picture

Since Drupal 8 now requires PHP 5.5.9 (https://www.drupal.org/requirements), is this issue now irrelevant?

klausi’s picture

Title: In PHP 5.4, html_entity_decode() doesn't decode invalid numeric entities » In PHP 5.4+, html_entity_decode() doesn't decode invalid numeric entities

No, still relevant on PHP 5.5: https://3v4l.org/njqTN

Status: Needs review » Needs work

The last submitted patch, 28: decode_invalid_numeric_entities-1210798-28.patch, failed testing.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

David_Rothstein’s picture

Issue tags: +Needs backport to D7

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Spokje’s picture

Status: Needs work » Closed (won't fix)

Since we're now launching drupal 10, which requires PHP 8.1 and the behaviour has not changed since the original PHP 5.4 change (https://3v4l.org/njqTN), I think we can safely assume by now, that this is the intended behaviour now.

Also seeing that this issue hasn't got any attention in the last 6+ years, I'm going to be bold and close this issue as a won't fix.