Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
http://www.w3.org/TR/REC-html40/sgml/entities.html lists the HTML 4 entities.
drupal_html_to_text() uses decode_entities() ("Decode all HTML entities"), which in turn uses the PHP get_html_translation_table() function, which supports only part of the HTML 4 entities (see http://ch2.php.net/manual/en/function.html-entity-decode.php#78665).
There's a conscious effort in decode_entities() to add ', but others like € and — remain undefined and are thus not translated into the proper characters.
Comments
Comment #1
Anonymous (not verified) CreditAttribution: Anonymous commentedPatch to decode all HTML 4 entities. Tested minimally.
Comment #2
salvisThanks!
Translating well-known and supported entities like < into numeric entities for feeding into html_entity_decode() is a bit of overkill.
convert_entity() pollutes the namespace.
The code doesn't conform to the Drupal coding conventions (return on the same line as if), and the comment is probably inappropriate for inclusion into core. I don't know in what form attributions should be made.
Comment #3
Anonymous (not verified) CreditAttribution: Anonymous commentedCleaned up the code a bit. If there's specific entities that you want removed, let me know. I'm of the opinion that the full HTML4 entity set should be supported within Drupal, even if PHP by default does not. Why bother including a subset of the standard, rather than supporting the full standard? Just makes sense to me to support the full standard.
Comment #4
salvisNo, I didn't mean supporting only a subset, but to avoid duplicating those that are already in get_html_translation_table(). html_entity_decode() will check for those again anyway, so translating them into numeric entities first is unnecessary.
Comment #5
Anonymous (not verified) CreditAttribution: Anonymous commentedGood point. I used the following test code to find out what entities were supported in get_html_translation_table():
I then removed each of these entities from my table, so as to avoid duplication.
The resulting patch is attached.
Comment #6
Anonymous (not verified) CreditAttribution: Anonymous commentedOkay, I tested this more fully, it is now working properly for entities that are in the default lookup table, as well as the new lookup table that I've created.
Comment #7
Steven CreditAttribution: Steven commentedThe last patch calls html_entity_decode() twice (once inside the preg_replace_callback and once outside). By definition, it cannot be a correct HTML to text conversion anymore.
A quick test also shows that it does not decode simple input strings like
&
and€
correctly. The use of ISO-8859-1 is also incorrect.The lookup table can also be shrunk significantly by reducing each entry to an integer rather than a string.
Finally, the proper way to address this should be to patch decode_entities(), rather than adding a special case in mail.inc.
Comment #8
salvis(based on Steven's review)
Comment #9
grendzy CreditAttribution: grendzy commentedsubscribing
Comment #10
grendzy CreditAttribution: grendzy commentedHere's a patch for both Drupal 6 & 7. The code was adapted the TCPDF library, as found in Joomla. The file I copied from was authored by Nicola Asuni and licensed under the LGPL.
Basically, this is a complete replacement for decode_entities(), which relies on PHP's get_html_translation_table(). Regrettably, get_html_translation_table() is very incomplete.
The backport to D6 is slightly different in that it doesn't remove _decode_entities(), so that we don't risk breaking things for anyone who calls this function in a contrib module.
Comment #11
catchCould we not do get_html_translation_table() then += the missing entities?
Comment #12
jhedstromThis patch will be a lot easier to test once #276453: Tests needed: unicode.inc is committed. As is, I applied that test patch, ran the tests, then applied this patch, and ran the tests. I got 4 fails after the patch, so it needs some tweaks:
Comment #13
grendzy CreditAttribution: grendzy commentedHere are updated patches that are less invasive -- it's just the table, and no other code from TCPDF is incorporated. The D7 patch passes the unit tests from #276453.
catch - it would be feasible to only add elements missing from get_html_translation_table(). However, my own feeling is that PHP's implementation is so bad that we shouldn't rely on it even partially. Having a complete table might also make it easier to update in the future, if new entities are defined. (There are a lot of comments about this misfeature on php.net.)
While it would make the patch smaller, it would still be a very long list -- there are 100 entities in PHP's table, and 252 in the patch.
If cluttering up unicode.inc with this table is a concern, I would rather move this function into a separate file.
I'm also just posting the D7 patch -- the same patch will apply to D6 with a small offset.
Comment #14
grendzy CreditAttribution: grendzy commentedComment #16
grendzy CreditAttribution: grendzy commentedperhaps a victim of #335122: Test clean HEAD after every commit
Comment #17
Damien Tournoud CreditAttribution: Damien Tournoud commentedHere is a proper patch for this, that depends on #352359: [core] Fixes for drupal_strpos() for the testing part.
Compared to #13:
- the entity table is preprocessed (there is really no need to call chr() twice for every entity...) and moved to a separate file, that lighten unicode.inc which is on the critical path (about 100k memory saved)
- we stop using in_array() and array_diff() which are slow, and just invert $exclude, because this is the only thing that we need
- extended tests from #352359: [core] Fixes for drupal_strpos()
Comment #18
Damien Tournoud CreditAttribution: Damien Tournoud commentedAnd the good patch.
Comment #19
Dries CreditAttribution: Dries commentedI think the "100k memory saved" is a bit dubious. I'd prefer a patch that leaves everything in one file, and then we can worry about the memory savings later.
Comment #20
Damien Tournoud CreditAttribution: Damien Tournoud commentedunicode.inc *is* on the critical path, so we need to make sure it is parsed as efficiently as possible, and that it doesn't take too much memory.
Here is a test:
** Test with full table in file.
Memory usage at startup: 53532
Memory usage after inclusion of the file: 91008
** Test without full table in file.
Memory usage at startup: 53536
Memory usage after inclusion of the file: 56164
?>
We save 35k by not having the table in unicode.inc, plus the other optimizations that we have in that issue (do not inefficiently duplicate the table by calling array_diff, do not convert ISO8859-1 to UTF-8... etc.), we will reach the hundreds I talked about.
Comment #21
grendzy CreditAttribution: grendzy commentedWe save 35k by not having the table in unicode.inc
Isn't this only true though if decode_entities() is never called? It seams a near certainty that it will be loaded on most every page view, since for example url() will trigger the table to be loaded.
That said, I think it's fine either way to split the table into a separate file, or not. I'll just be glad to stop getting ’ in my email all the time. ;)
I tested #18, and it works fine. I also tried a few other entities not covered by the test (like —) and those worked too.
Comment #22
Damien Tournoud CreditAttribution: Damien Tournoud commentedAre you sure? Why is it so?
Comment #23
grendzy CreditAttribution: grendzy commentedNo, you're right. I was thinking url() will call filter_xss_bad_protocol(), which leads to decode_entities(). But as you pointed out, filter_xss_bad_protocol() is called with $decode = FALSE.
In that case, Damien has a pretty good reason to put the entities in a separate file.
Comment #24
Dries CreditAttribution: Dries commentedThanks for the additional experiments, Damien. That helped. I've committed this patch to CVS. Thanks.
Comment #25
salvisThanks!
Comment #26
grendzy CreditAttribution: grendzy commentedBackport:
To test, use modules/simpletest/tests/unicode.test from Drupal 7 with SimpleTest 2.5. You'll get one failure, but this is unrelated to decode_entities, rather it's a bug in drupal_substr().
Comment #27
grendzy CreditAttribution: grendzy commenteddarn. Missed the boat on D6.9 release. ; )
Comment #28
dmitrig01 CreditAttribution: dmitrig01 commentedTested, read, looks good.
Comment #29
Gábor HojtsyDid people test this on D6? It would be vital to ensure no backwards compatibility problems are introduced.
Comment #31
grendzy CreditAttribution: grendzy commentedbad bot! This is a D6 patch.
Comment #32
salvisRerolled but otherwise unchanged from #26.
Comment #34
salvisI don't understand why #32 doesn't apply...
Comment #35
grendzy CreditAttribution: grendzy commented#26 still applies, this is just the test bot failing. #961172: All D6 Core patches are failing
Comment #36
Gábor HojtsyOk, reviewed the patch once again, looked at the D7 patch, and it all looked in line. Committed, thanks.
Comment #37
Damien Tournoud CreditAttribution: Damien Tournoud commentedAnd the fun thing is that in the meantime, we found a better solution for D7 (which relies on PHP >= 5.0, so doesn't apply to D6). #878408: Replace decode_entities() with built-in html_entity_decode()