Problem/Motivation

Hey, After updating to Drupal 10.2.5, I've encountered an issue in one of my projects where our automated tests related to the HTML filter are failing. The problem arises with HTML content that includes a non-breaking space (` `) within a URL, which should not be part of the URL.

Steps to reproduce

In our content, we have a `p` element with a URL that includes a non-breaking space at the end:

    For details of the Last Post Service: https://www.shrine.org.au/last-post-service-31-july 

Current output:

  For details of the Last Post Service: https://www.shrine.org.au/last-post-service-31-july 

Expected output:

  For details of the Last Post Service: https://www.shrine.org.au/last-post-service-31-july 

Screenshot

Drupal 10.2.5

Image 1

issue

Drupal 10.1.7

Image 2

no issue, and it looks good.
issue

it also happens on the situation below

<h2>
    Another Heading 2
</h2>
<p>
    &nbsp;
</p>
<h3>
    Another Heading 3
</h3>

Drupal 10.2.5

Image 3

10.2.x

Drupal 10.1.7

Image 4

no issue, and it looks good.
10.1.x

Temporary workaround:
I managed to resolve this issue temporarily by adding the following line in core/lib/Drupal/Component/Utility/Html::serialize just before the return statement:
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

I am looking for guidance or confirmation on whether this is an issue with the recent updates and if there is a planned fix for future releases. Any suggestions on alternative solutions or patches would also be greatly appreciated.

Thank you!

Issue fork drupal-3445910

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

yovince created an issue. See original summary.

yovince’s picture

cilefen’s picture

Component: other » base system
Status: Active » Needs work

From which Drupal version did you upgrade? Can you please identify the commit in Drupal core that changed behavior? We need that to better understand the situation. The release notes are helpful for that. If you are adept with Git, a git bisect operation will find the commit quickly.

yovince’s picture

hey @cliefen
thanks for your response.

We upgraded from Drupal 10.1.7 to 10.2.5. The issue can be found at: https://www.drupal.org/project/drupal/issues/2441811. The commit associated with this issue is 201ae2e35438b7d8f7c831ba8ac33bfc035bbb0a. and the merge request: https://git.drupalcode.org/project/drupal/-/merge_requests/4274/diffs

cilefen’s picture

Please update your examples above. Wrap them in the <code> tag because they all look the same.

yovince’s picture

Issue summary: View changes
yovince’s picture

Thanks, It should look better now

yovince’s picture

Issue summary: View changes
StatusFileSize
new68.97 KB
yovince’s picture

Issue summary: View changes
StatusFileSize
new67.8 KB
yovince’s picture

Issue summary: View changes
StatusFileSize
new40.02 KB
new53.7 KB
cilefen’s picture

Your patch does not apply. You should test on and patch the development branch.

yovince’s picture

Ah, it can be applied to the 10.2.x branch, which is the development branch, right? I think the issue is that when I uploaded the patch file on this page, the Test with options didn't include 10.2.x. So, the test program tried to apply it to 10.1.x, which caused it to fail?

yovince’s picture

cilefen’s picture

You should make a pull request instead.

mfb’s picture

So the URL filter, which can be found in the _filter_url() function in filter.module, was developed in the context of previous versions of Drupal, where HTML was serialized by calling saveXML(), thus outputting literal UTF-8 characters, since XML doesn't support most HTML entities. After HTML 5 support was added, HTML serialization results in HTML entities rather than literal UTF-8 characters.

I think the URL filter could be overhauled to support this use case? Parsing HTML with regex is generally frowned upon, but that's what this filter does. If HTML was parsed properly then text nodes could be modified, regardless of HTML entity encoding? Tweaking the serialization rules might be possible too, to partially restore previous behavior, but that seems really hacky.

yovince’s picture

Title: Issue with HTML `&nbsp;` not being correctly filtered out from URLs » Issue with HTML `&nbsp;` not being correctly filtered out
yovince’s picture

Issue summary: View changes
yovince’s picture

hey @mfb, thank you for your reply! The issue is not just with URLs not being correctly filtered, but also with certain situations not being appropriately filtered. Please take a look at Image 3 and Image 4.
I have changed the title of this ticket.

yovince’s picture

hey @cilefen

You should make a pull request instead.

thanks, I have created a PR.

xjm’s picture

Version: 10.2.x-dev » 11.x-dev

To address this issue, we should first create a merge request against 11.x. The easiest thing to do is probably to close the current MR and create a new one against 11.x. Thanks!

yovince’s picture

hi @xjm,
Thank you for your reply. I have created a merge request against the 11.x branch

arun.k’s picture

hi @yovince we can solve this issue by doing in this way

$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

// Remove empty paragraphs, including those with non-breaking spaces.
$text = preg_replace('/

( |\s|
)*<\/p>/', '', $html);

return $text;

Please check this i hope it works.

mfb’s picture

As can be seen in the test failures, the approach in the merge request doesn't work; you cannot simply decode HTML entities when serializing HTML - the result would be both invalid and unsafe.

yovince changed the visibility of the branch 3445910-issue-with-html-non-breaking-space to hidden.

yovince’s picture

thanks @mfb for your comments.
I committed a MR that seems to have fixed my issue. I also added a patch to prevent problems if the pull request is closed or updated.

yovince’s picture

Issue summary: View changes
sea2709’s picture

I encounter this issue as well. I notice that in some cases, instead of putting a real space, the editor puts a &nbsp;

The patch #27 works on my project. I'm a little bit concerned about if this is the root cause of this issue. I think the issue is from the editor more than the filtering process.

trackleft2’s picture

Status: Needs work » Needs review

Looks like this needs to be back in Needs Review.

@sea2709, there may be an issue in the ckeditor5 issue queue about this. https://github.com/ckeditor/ckeditor5/issues?q=sort%3Aupdated-desc+is%3A...

smustgrave’s picture

Status: Needs review » Needs work
Issue tags: +Needs issue summary update, +Needs tests

So can issue summary be updated to use the standard template, specifically "Proposed solution". Also will need some test coverage please.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.