http://www.w3.org/TR/REC-html40/sgml/entities.html lists the HTML 4 entities.

drupal_html_to_text() uses decode_entities() ("Decode all HTML entities"), which in turn uses the PHP get_html_translation_table() function, which supports only part of the HTML 4 entities (see http://ch2.php.net/manual/en/function.html-entity-decode.php#78665).

There's a conscious effort in decode_entities() to add ', but others like € and — remain undefined and are thus not translated into the proper characters.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Anonymous’s picture

Component: other » base system
Status: Active » Needs review
FileSize
6.79 KB

Patch to decode all HTML 4 entities. Tested minimally.

salvis’s picture

Status: Needs review » Needs work

Thanks!

Translating well-known and supported entities like < into numeric entities for feeding into html_entity_decode() is a bit of overkill.

convert_entity() pollutes the namespace.

The code doesn't conform to the Drupal coding conventions (return on the same line as if), and the comment is probably inappropriate for inclusion into core. I don't know in what form attributions should be made.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
8.29 KB

Cleaned up the code a bit. If there's specific entities that you want removed, let me know. I'm of the opinion that the full HTML4 entity set should be supported within Drupal, even if PHP by default does not. Why bother including a subset of the standard, rather than supporting the full standard? Just makes sense to me to support the full standard.

salvis’s picture

No, I didn't mean supporting only a subset, but to avoid duplicating those that are already in get_html_translation_table(). html_entity_decode() will check for those again anyway, so translating them into numeric entities first is unnecessary.

Anonymous’s picture

Good point. I used the following test code to find out what entities were supported in get_html_translation_table():

<?php
$array = get_html_translation_table(HTML_ENTITIES);

foreach($array as $key=>$value) {
  echo "Key: " . $key . "; Value: $value Numeric: " . ord($key) . "<br />\n";
}
?>

I then removed each of these entities from my table, so as to avoid duplication.

The resulting patch is attached.

Anonymous’s picture

Okay, I tested this more fully, it is now working properly for entities that are in the default lookup table, as well as the new lookup table that I've created.

Steven’s picture

The last patch calls html_entity_decode() twice (once inside the preg_replace_callback and once outside). By definition, it cannot be a correct HTML to text conversion anymore.

A quick test also shows that it does not decode simple input strings like &amp; and &#x20AC; correctly. The use of ISO-8859-1 is also incorrect.

The lookup table can also be shrunk significantly by reducing each entry to an integer rather than a string.

Finally, the proper way to address this should be to patch decode_entities(), rather than adding a special case in mail.inc.

salvis’s picture

Status: Needs review » Needs work

(based on Steven's review)

grendzy’s picture

subscribing

grendzy’s picture

FileSize
14.71 KB
16.13 KB

Here's a patch for both Drupal 6 & 7. The code was adapted the TCPDF library, as found in Joomla. The file I copied from was authored by Nicola Asuni and licensed under the LGPL.

Basically, this is a complete replacement for decode_entities(), which relies on PHP's get_html_translation_table(). Regrettably, get_html_translation_table() is very incomplete.

The backport to D6 is slightly different in that it doesn't remove _decode_entities(), so that we don't risk breaking things for anyone who calls this function in a contrib module.

catch’s picture

Title: drupal_html_to_text() doesn't support all entities » decode_entities() doesn't support all entities
Version: 6.x-dev » 7.x-dev
Status: Needs work » Needs review

Could we not do get_html_translation_table() then += the missing entities?

jhedstrom’s picture

Status: Needs review » Needs work

This patch will be a lot easier to test once #276453: Tests needed: unicode.inc is committed. As is, I applied that test patch, ran the tests, then applied this patch, and ran the tests. I got 4 fails after the patch, so it needs some tweaks:

Make sure the decoded entity of &amp;#34; is &#34;. (Got ")
Make sure the decoded entity of &quot; is ". (Got &quot;)
Make sure the decoded entity of &amp;#39; is &#39;. (Got ')
Make sure the decoded entity of &#34; is &#34;. (Got ")
grendzy’s picture

FileSize
13.71 KB

Here are updated patches that are less invasive -- it's just the table, and no other code from TCPDF is incorporated. The D7 patch passes the unit tests from #276453.

catch - it would be feasible to only add elements missing from get_html_translation_table(). However, my own feeling is that PHP's implementation is so bad that we shouldn't rely on it even partially. Having a complete table might also make it easier to update in the future, if new entities are defined. (There are a lot of comments about this misfeature on php.net.)

While it would make the patch smaller, it would still be a very long list -- there are 100 entities in PHP's table, and 252 in the patch.

If cluttering up unicode.inc with this table is a concern, I would rather move this function into a separate file.

I'm also just posting the D7 patch -- the same patch will apply to D6 with a small offset.

grendzy’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch failed testing.

grendzy’s picture

Status: Needs work » Needs review
Damien Tournoud’s picture

FileSize
0 bytes

Here is a proper patch for this, that depends on #352359: [core] Fixes for drupal_strpos() for the testing part.

Compared to #13:

- the entity table is preprocessed (there is really no need to call chr() twice for every entity...) and moved to a separate file, that lighten unicode.inc which is on the critical path (about 100k memory saved)
- we stop using in_array() and array_diff() which are slow, and just invert $exclude, because this is the only thing that we need
- extended tests from #352359: [core] Fixes for drupal_strpos()

Damien Tournoud’s picture

And the good patch.

Dries’s picture

I think the "100k memory saved" is a bit dubious. I'd prefer a patch that leaves everything in one file, and then we can worry about the memory savings later.

Damien Tournoud’s picture

FileSize
8.06 KB

unicode.inc *is* on the critical path, so we need to make sure it is parsed as efficiently as possible, and that it doesn't take too much memory.

Here is a test:


** Test with full table in file.
Memory usage at startup: 53532
Memory usage after inclusion of the file: 91008

** Test without full table in file.
Memory usage at startup: 53536
Memory usage after inclusion of the file: 56164
?>

We save 35k by not having the table in unicode.inc, plus the other optimizations that we have in that issue (do not inefficiently duplicate the table by calling array_diff, do not convert ISO8859-1 to UTF-8... etc.), we will reach the hundreds I talked about.

grendzy’s picture

We save 35k by not having the table in unicode.inc

Isn't this only true though if decode_entities() is never called? It seams a near certainty that it will be loaded on most every page view, since for example url() will trigger the table to be loaded.

That said, I think it's fine either way to split the table into a separate file, or not. I'll just be glad to stop getting &rsquo; in my email all the time. ;)

I tested #18, and it works fine. I also tried a few other entities not covered by the test (like &mdash;) and those worked too.

Damien Tournoud’s picture

It seams a near certainty that it will be loaded on most every page view, since for example url() will trigger the table to be loaded.

Are you sure? Why is it so?

grendzy’s picture

No, you're right. I was thinking url() will call filter_xss_bad_protocol(), which leads to decode_entities(). But as you pointed out, filter_xss_bad_protocol() is called with $decode = FALSE.

In that case, Damien has a pretty good reason to put the entities in a separate file.

Dries’s picture

Status: Needs review » Fixed

Thanks for the additional experiments, Damien. That helped. I've committed this patch to CVS. Thanks.

salvis’s picture

Version: 7.x-dev » 6.x-dev
Status: Fixed » Patch (to be ported)

Thanks!

grendzy’s picture

Status: Patch (to be ported) » Needs review
FileSize
8.44 KB

Backport:

To test, use modules/simpletest/tests/unicode.test from Drupal 7 with SimpleTest 2.5. You'll get one failure, but this is unrelated to decode_entities, rather it's a bug in drupal_substr().

grendzy’s picture

Issue tags: +backport, +unicode

darn. Missed the boat on D6.9 release. ; )

dmitrig01’s picture

Status: Needs review » Reviewed & tested by the community

Tested, read, looks good.

Gábor Hojtsy’s picture

Did people test this on D6? It would be vital to ensure no backwards compatibility problems are introduced.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 212130-D6-decode-entities-support-all-entities.patch, failed testing.

grendzy’s picture

Status: Needs work » Reviewed & tested by the community

bad bot! This is a D6 patch.

salvis’s picture

Rerolled but otherwise unchanged from #26.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, decode-entities-support-all-entities.212130.32.patch, failed testing.

salvis’s picture

I don't understand why #32 doesn't apply...

grendzy’s picture

Status: Needs work » Reviewed & tested by the community

#26 still applies, this is just the test bot failing. #961172: All D6 Core patches are failing

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Ok, reviewed the patch once again, looked at the D7 patch, and it all looked in line. Committed, thanks.

Damien Tournoud’s picture

And the fun thing is that in the meantime, we found a better solution for D7 (which relies on PHP >= 5.0, so doesn't apply to D6). #878408: Replace decode_entities() with built-in html_entity_decode()

Status: Fixed » Closed (fixed)
Issue tags: -backport, -unicode

Automatically closed -- issue fixed for 2 weeks with no activity.