When publishing Instant Articles, funky accented characters were appearing in place of non-breaking spaces and mdashes . The issue turned out to be an issue with how the HTML was loaded into the DOMDocument.

See the following link for the solution and more information.
http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

scottfalkingham created an issue. See original summary.

scottrigby’s picture

Status: Needs review » Needs work

Hi @scottfalkingham,

While we use d.o for issue tracking, we are using GitHub PRs rather than a patch workflow (see the README or project page for info).

For this patch, in D7 we should use drupal_convert_to_utf8() (in D8, the equivalent is Unicode::convertToUtf8 (see #2361711: Remove usage of drupal_convert_to_utf8()).

Also, this seems to depend on server configurations. In some environments I have seen this happen (so far, only in a Vagrant instance I've tested with), but in others it doesn't happen. It would be good to clarify this. If you can give some examples of what happens – with details on the module(s) you have enabled – that may help us identify the root cause.

Thanks!

scottrigby’s picture

@scottfalkingham OK, I tested this, and misread your solution. You're converting from utf8 to html entities, not the other way around (this is what drupal_convert_to_utf8 is for, as the name implies). So your solution works, but it's better to use the Drupal 7 core function decode_entities() (the Drupal 8 equivalent is \Drupal\Component\Utility\Html::decodeEntities()).

scottrigby’s picture

Assigned: Unassigned » scottrigby
Status: Needs work » Active
scottrigby’s picture

Assigned: scottrigby » Unassigned
Status: Active » Needs review

OK, I made a GitHub PR for this: PR #53 ready for review.

anbarasan.r’s picture

Status: Needs review » Reviewed & tested by the community

  • scottrigby committed 5791f61 on 7.x-1.x
    Issue #2711165: DOMDocument LoadHTML Bug
    
  • scottrigby committed e4869ca on 7.x-1.x
    Merge pull request #53 from scottrigby/2711165
    
    Issue #2711165 by...
scottrigby’s picture

Status: Reviewed & tested by the community » Fixed

Thanks @scottfalkingham & @anbarasan.r. Merged into 7.x-1.x.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Les Lim’s picture

Status: Closed (fixed) » Needs review
FileSize
877 bytes

@decode_entities()@ doesn't fix the issue in our case. We're seeing an issue where multibyte UTF-8 characters like em-dashes are being transformed into ISO-8859-1 gibberish on our production servers.

I believe this is an issue with DOMDocument::saveXML(), from this StackExchange answer: http://stackoverflow.com/a/20675396

We've had to go back to the original commenter's strategy of converting all multibyte UTF-8 charatcters into HTML entities, which makes the encoding issue moot. I'm not sure you would want to decode the HTML entities in the first place; the documentation for that function notes that it could end up un-sanitizing output.

sadiefp’s picture

Version: 7.x-1.x-dev » 7.x-2.x-dev
Status: Needs review » Patch (to be ported)
FileSize
961 bytes

The previous pull request and patch did not fix the character encoding issue on our Acquia environments. We didn't have any issues with special characters on tugboat instances, drupalvm and other local test environments. So the issue may be unique to Acquia's servers.

This is the patch we used to fix. (I realize fb_instant_articles uses a PR workflow, but I'm not sure this issue requires a PR. This may have only continued to be an issue for our sites. Let me know if you'd like me to create a pull request.)

Bladedu’s picture

I confirm that patch #11 fixed the issue (for me) on Acquia servers.

sadiefp’s picture

Status: Patch (to be ported) » Needs review

  • scottrigby authored 2627b4b on 7.x-2.x
    Merge pull request #83 from sadiefp/7.x-2.x
    
    Issue #2711165: DOMDocument...
  • 4a68f9a committed on 7.x-2.x
    Issue #2711165 by sadiefp, Bladedu: Fix DOMDocument LoadHTML Bug.
    *...
scottrigby’s picture

Assigned: Unassigned » scottrigby
scottrigby’s picture

Assigned: scottrigby » Unassigned
Status: Needs review » Fixed

This is merged. Thanks @sadiefp, @Bladedu, and @Les Lim!

@Les Lim or anyone - if this continues to cause any problems, please let us know and we can continue to work on this. For now this has been verified and is part of the 7.x-2.x branch.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.