Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
When publishing Instant Articles, funky accented characters were appearing in place of non-breaking spaces and mdashes . The issue turned out to be an issue with how the HTML was loaded into the DOMDocument.
See the following link for the solution and more information.
http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly
Comment | File | Size | Author |
---|---|---|---|
#11 | fb_instant_articles-UTF8_encoding-2711165-11-D7.patch | 961 bytes | sadiefp |
#10 | fb_instant_articles-2711165-10-UTF8_mistranslation.patch | 877 bytes | Les Lim |
loadhtmlutf8.patch | 902 bytes | scottfalkingham | |
Comments
Comment #2
scottrigbyHi @scottfalkingham,
While we use d.o for issue tracking, we are using GitHub PRs rather than a patch workflow (see the README or project page for info).
For this patch, in D7 we should use drupal_convert_to_utf8() (in D8, the equivalent is
Unicode::convertToUtf8
(see #2361711: Remove usage of drupal_convert_to_utf8()).Also, this seems to depend on server configurations. In some environments I have seen this happen (so far, only in a Vagrant instance I've tested with), but in others it doesn't happen. It would be good to clarify this. If you can give some examples of what happens – with details on the module(s) you have enabled – that may help us identify the root cause.
Thanks!
Comment #3
scottrigby@scottfalkingham OK, I tested this, and misread your solution. You're converting from utf8 to html entities, not the other way around (this is what
drupal_convert_to_utf8
is for, as the name implies). So your solution works, but it's better to use the Drupal 7 core function decode_entities() (the Drupal 8 equivalent is \Drupal\Component\Utility\Html::decodeEntities()).Comment #4
scottrigbyComment #5
scottrigbyOK, I made a GitHub PR for this: PR #53 ready for review.
Comment #6
anbarasan.r CreditAttribution: anbarasan.r at NBCUniversal commentedComment #8
scottrigbyThanks @scottfalkingham & @anbarasan.r. Merged into 7.x-1.x.
Comment #10
Les Lim@decode_entities()@ doesn't fix the issue in our case. We're seeing an issue where multibyte UTF-8 characters like em-dashes are being transformed into ISO-8859-1 gibberish on our production servers.
I believe this is an issue with DOMDocument::saveXML(), from this StackExchange answer: http://stackoverflow.com/a/20675396
We've had to go back to the original commenter's strategy of converting all multibyte UTF-8 charatcters into HTML entities, which makes the encoding issue moot. I'm not sure you would want to decode the HTML entities in the first place; the documentation for that function notes that it could end up un-sanitizing output.
Comment #11
sadiefp CreditAttribution: sadiefp commentedThe previous pull request and patch did not fix the character encoding issue on our Acquia environments. We didn't have any issues with special characters on tugboat instances, drupalvm and other local test environments. So the issue may be unique to Acquia's servers.
This is the patch we used to fix. (I realize fb_instant_articles uses a PR workflow, but I'm not sure this issue requires a PR. This may have only continued to be an issue for our sites. Let me know if you'd like me to create a pull request.)
Comment #12
BladeduI confirm that patch #11 fixed the issue (for me) on Acquia servers.
Comment #13
sadiefp CreditAttribution: sadiefp commentedPR created for #11: https://github.com/BurdaMagazinOrg/module-fb_instant_articles/pull/83
Comment #15
scottrigbyComment #16
scottrigbyThis is merged. Thanks @sadiefp, @Bladedu, and @Les Lim!
@Les Lim or anyone - if this continues to cause any problems, please let us know and we can continue to work on this. For now this has been verified and is part of the 7.x-2.x branch.