Problem/Motivation
Hey, After updating to Drupal 10.2.5, I've encountered an issue in one of my projects where our automated tests related to the HTML filter are failing. The problem arises with HTML content that includes a non-breaking space (` `) within a URL, which should not be part of the URL.
Steps to reproduce
In our content, we have a `p` element with a URL that includes a non-breaking space at the end:
For details of the Last Post Service: https://www.shrine.org.au/last-post-service-31-july
Current output:
For details of the Last Post Service: https://www.shrine.org.au/last-post-service-31-july
Expected output:
For details of the Last Post Service: https://www.shrine.org.au/last-post-service-31-july
Screenshot
Drupal 10.2.5
Image 1
Image 1

Drupal 10.1.7
Image 2
Image 2
no issue, and it looks good.

it also happens on the situation below
<h2>
Another Heading 2
</h2>
<p>
</p>
<h3>
Another Heading 3
</h3>
Drupal 10.2.5
Image 3
Image 3

Drupal 10.1.7
Image 4
Image 4
no issue, and it looks good.

Temporary workaround:
I managed to resolve this issue temporarily by adding the following line in core/lib/Drupal/Component/Utility/Html::serialize just before the return statement:
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
I am looking for guidance or confirmation on whether this is an issue with the recent updates and if there is a planned fix for future releases. Any suggestions on alternative solutions or patches would also be greatly appreciated.
Thank you!
| Comment | File | Size | Author |
|---|---|---|---|
| #27 | drupal_core_non-breaking_space_issue_3445910_27.patch | 1000 bytes | yovince |
Issue fork drupal-3445910
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
yovinceComment #3
cilefen commentedFrom which Drupal version did you upgrade? Can you please identify the commit in Drupal core that changed behavior? We need that to better understand the situation. The release notes are helpful for that. If you are adept with Git, a
git bisectoperation will find the commit quickly.Comment #4
yovincehey @cliefen
thanks for your response.
We upgraded from Drupal 10.1.7 to 10.2.5. The issue can be found at: https://www.drupal.org/project/drupal/issues/2441811. The commit associated with this issue is
201ae2e35438b7d8f7c831ba8ac33bfc035bbb0a. and the merge request: https://git.drupalcode.org/project/drupal/-/merge_requests/4274/diffsComment #5
cilefen commentedPlease update your examples above. Wrap them in the
<code>tag because they all look the same.Comment #6
yovinceComment #7
yovinceThanks, It should look better now
Comment #8
yovinceComment #9
yovinceComment #10
yovinceComment #11
cilefen commentedYour patch does not apply. You should test on and patch the development branch.
Comment #12
yovinceAh, it can be applied to the
10.2.xbranch, which is the development branch, right? I think the issue is that when I uploaded the patch file on this page, the Test with options didn't include10.2.x. So, the test program tried to apply it to10.1.x, which caused it to fail?Comment #13
yovinceComment #14
cilefen commentedYou should make a pull request instead.
Comment #15
mfbSo the URL filter, which can be found in the _filter_url() function in filter.module, was developed in the context of previous versions of Drupal, where HTML was serialized by calling saveXML(), thus outputting literal UTF-8 characters, since XML doesn't support most HTML entities. After HTML 5 support was added, HTML serialization results in HTML entities rather than literal UTF-8 characters.
I think the URL filter could be overhauled to support this use case? Parsing HTML with regex is generally frowned upon, but that's what this filter does. If HTML was parsed properly then text nodes could be modified, regardless of HTML entity encoding? Tweaking the serialization rules might be possible too, to partially restore previous behavior, but that seems really hacky.
Comment #16
yovinceComment #17
yovinceComment #18
yovincehey @mfb, thank you for your reply! The issue is not just with URLs not being correctly filtered, but also with certain situations not being appropriately filtered. Please take a look at Image 3 and Image 4.
I have changed the title of this ticket.
Comment #19
yovincehey @cilefen
thanks, I have created a PR.
Comment #20
xjmTo address this issue, we should first create a merge request against 11.x. The easiest thing to do is probably to close the current MR and create a new one against 11.x. Thanks!
Comment #22
yovincehi @xjm,
Thank you for your reply. I have created a merge request against the 11.x branch
Comment #23
arun.k commentedhi @yovince we can solve this issue by doing in this way
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
// Remove empty paragraphs, including those with non-breaking spaces.
$text = preg_replace('/
( |\s|
)*<\/p>/', '', $html);
return $text;
Please check this i hope it works.
Comment #24
mfbAs can be seen in the test failures, the approach in the merge request doesn't work; you cannot simply decode HTML entities when serializing HTML - the result would be both invalid and unsafe.
Comment #27
yovincethanks @mfb for your comments.
I committed a MR that seems to have fixed my issue. I also added a patch to prevent problems if the pull request is closed or updated.
Comment #28
yovinceComment #29
sea2709 commentedI encounter this issue as well. I notice that in some cases, instead of putting a real space, the editor puts a
The patch #27 works on my project. I'm a little bit concerned about if this is the root cause of this issue. I think the issue is from the editor more than the filtering process.
Comment #30
trackleft2Looks like this needs to be back in Needs Review.
@sea2709, there may be an issue in the ckeditor5 issue queue about this. https://github.com/ckeditor/ckeditor5/issues?q=sort%3Aupdated-desc+is%3A...
Comment #31
smustgrave commentedSo can issue summary be updated to use the standard template, specifically "Proposed solution". Also will need some test coverage please.