Upgrading from 10.1 to 10.2(.2) change the serialize function (Drupal/Component/Utility/Html.php).

With this new version, accent seems to be removed.
Ex.
<a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a>

return
<a href="https://www.mywebsite.com/services/identite">Identit</a>

CommentFileSizeAuthor
#3 3416204-3.patch1.06 KBMistrae
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Mistrae created an issue. See original summary.

Mistrae’s picture

Issue summary: View changes
Mistrae’s picture

FileSize
1.06 KB

I created a patch to revert the lastest change if anyone need it fixed before a better solution can be found.

longwave’s picture

Status: Active » Postponed (maintainer needs more info)

&eacute; is normalised to é but should not be stripped:

> \Drupal\Component\Utility\Html::normalize('<a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a>');
= "<a href="https://www.mywebsite.com/services/identite">Identité</a>"

Can you provide an example similar to the above that fails?

cilefen’s picture

Mistrae’s picture

@longwave, serialize not normalize.

Ex.

DOMDocument with:

<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html>
    <html>
      <body>
        <a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a>
      </body>
    </html>

Run Html::serialize and get:
<a href="https://www.mywebsite.com/services/identite">Identit</a>

longwave’s picture

Well, normalize() just calls load() then serialize(). Can you give a full code snippet that fails please?

Mistrae’s picture

Here is the full code that can recreate the error:

$html_dom = \Drupal\Component\Utility\Html::load(\Drupal\Core\Render\Markup::create('Identité'));
$body = $html_dom->getElementsByTagName('body');
$node = $body->item(0);
$child = $node->childNodes->item(0);
$text = $child->textContent;
$text = htmlentities($text, ENT_QUOTES, 'UTF-8');
$element = $html_dom->createElement('a', $text);
$node->replaceChild($element, $child);
\Drupal\Component\Utility\Html::serialize($html_dom)
longwave’s picture

> $dom = new DOMDocument(); $dom->loadHTML('<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html><html><body><a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a></body></html>');
= true

> \Drupal\Component\Utility\Html::serialize($dom);
= "<a href="https://www.mywebsite.com/services/identite">Identité</a>"
Mistrae’s picture

If I input the text directly yes it work. Maybe it's the htmlentities that doesn't work with the new function.

longwave’s picture

$text = htmlentities($text, ENT_QUOTES, 'UTF-8');

This is the problem. If you remove this line, the issue goes away.

longwave’s picture

This might be an upstream bug in \Masterminds\HTML5\Serializer\Traverser::node().

In this case what has happened is we have injected an entity reference directly into the DOM, $node->nodeType is XML_ENTITY_REF_NODE, but the switch statement does not handle this case.

Mistrae’s picture

OK thanks, just to be clear, does that mean that since 10.2 we cannot use htmlentities with serialize and that will be considered as won't fix or should something be done here ?

longwave’s picture

Title: Serialize function strips accents » [PP-upstream] Serialize function strips accents
Status: Postponed (maintainer needs more info) » Postponed

Thanks for reporting! I have reported this upstream at https://github.com/Masterminds/html5-php/issues/244 with a slightly modified example, let's wait to see what the maintainer there has to say. If they decline to fix we can still override in Drupal and serialize entity references correctly.

Version: 10.2.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

gaurav.kapoor’s picture

In one of the websites, we are using smart trim and wrapping the generated summary around a link (linked to the respective node). Special characters such as german umaluts 'ä, ö, ü and ß' are then not showing up in the generated trimmed text. Patch from #3 resolved the issue.