Some content authors use non-breaking spaces in title fields to control how text is displayed. These spaces and others should be treated as normal spaces, otherwise there are some ugly URLs happening and website users who don't understand
Here's a list of all "visual" spaces:
   	CHARACTER TABULATION THREE-PER-EM SPACE FOUR-PER-EM SPACE SIX-PER-EM SPACE FIGURE SPACE PUNCTUATION SPACE HAIR SPACE NARROW NO-BREAK SPACE MEDIUM MATHEMATICAL SPACE IDEOGRAPHIC SPACE
ZERO WIDTH SPACE (​) is not visually represented and should simply be removed without any substitution choices.
I'm working around this by copying the HTML-rendered spaces into the "Strings to remove" settings field. It doesn't produce the desired effect of hyphenated strings, but it's better than seeing %C2%A0 in URLs.
Issue fork pathauto-2986375
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #4
anairamzapHello, I was struggling with no-breaking spaces ending up in Path Aliases and I found this (ancient!) ticket.
I've added a MR that should fix this behaviour, simply adding the unicode modifier to the preg_replace that replaces the spaces with the separator. That should cover, AFAIK, all possible spaces. And will replace those with the configured separator.
I'm also changing the ticket to "Bug Report" as I think the replacement for spaces in urls should cover all possible spaces, not only the "regular" space, specially since the string is a user input (Title field) and depending on how that input is entered, different "types" of spaces could end up there.
Setting as Needs Review so someone can take a look.
Thanks!
m
----
Reference of spaces: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
I've tested with:
- nbsp
- nnbsp
- numsp
Comment #5
berdirSeems sensible, should be easy to extend an existing test and add another case that has a few of these in it, checking the result?
Wondering how this relates to #3181986: Add support for unicode soft hyphens on punctuation logic, I think that's not in the list the behavior would not change?
Comment #6
anairamzapHi, thanks for the quick review.
I've just added a couple (6 types of visual spaces) checks to a new test
testCleanStringSpaces()method inPathautoKernelTest(). I've created a new method since I noted that we need the "transliteration" config set to FALSE in order to get the behaviour described in this ticket.Not sure if we should also add a test to
testCleanString()that uses the default configuration (transliteration: TRUE).Let me know if that's needed and I can add them in there as well.
Regarding the ticket for soft hyphens: I'm not sure how is related with this. We are not changing the punctuation, just adding the unicode mod to the preg_replace.
Cheers,
m
Comment #8
mably commentedRebased MR.
Comment #9
mably commentedSummary
Non-breaking spaces and other Unicode whitespace characters in entity titles were URL-encoded (
%C2%A0) instead of being replaced by the separator in aliases.Problem
Content authors often use special space characters (nbsp, thin space, em space, etc.) in titles for display purposes. When Pathauto generated aliases, the regex
/\s+/inAliasCleaner::cleanString()only matched ASCII whitespace — it did not recognize Unicode whitespace like non-breaking spaces (U+00A0), narrow no-break spaces (U+202F), figure spaces (U+2007), etc. These characters were left as-is and then URL-encoded, producing ugly aliases like/my%C2%A0titleinstead of/my-title.Fix
/u(unicode) modifier to the whitespace replacement regex, changing/\s+/to/\s+/u. This makes\smatch all Unicode whitespace characters, which are then correctly replaced by the configured separator.testCleanStringSpaces()covering six Unicode space types — nbsp, nnbsp, numsp, thinsp, emsp, and ensp — each verified to be replaced by the separator. Transliteration is disabled in the test to isolate the regex behavior.Comment #10
mably commented