Some content authors use non-breaking spaces in title fields to control how text is displayed. These spaces and others should be treated as normal spaces, otherwise there are some ugly URLs happening and website users who don't understand

Here's a list of all "visual" spaces:

  •  
  •  
  •  
  •  
  • 	 CHARACTER TABULATION
  •   THREE-PER-EM SPACE
  •   FOUR-PER-EM SPACE
  •   SIX-PER-EM SPACE
  •   FIGURE SPACE
  •   PUNCTUATION SPACE
  •   HAIR SPACE
  •   NARROW NO-BREAK SPACE
  •   MEDIUM MATHEMATICAL SPACE
  •   IDEOGRAPHIC SPACE



ZERO WIDTH SPACE (​) is not visually represented and should simply be removed without any substitution choices.

I'm working around this by copying the HTML-rendered spaces into the "Strings to remove" settings field. It doesn't produce the desired effect of hyphenated strings, but it's better than seeing %C2%A0 in URLs.

Issue fork pathauto-2986375

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

fatmarker created an issue. See original summary.

anairamzap made their first commit to this issue’s fork.

anairamzap’s picture

Category: Feature request » Bug report
Status: Active » Needs review

Hello, I was struggling with no-breaking spaces ending up in Path Aliases and I found this (ancient!) ticket.

I've added a MR that should fix this behaviour, simply adding the unicode modifier to the preg_replace that replaces the spaces with the separator. That should cover, AFAIK, all possible spaces. And will replace those with the configured separator.

I'm also changing the ticket to "Bug Report" as I think the replacement for spaces in urls should cover all possible spaces, not only the "regular" space, specially since the string is a user input (Title field) and depending on how that input is entered, different "types" of spaces could end up there.

Setting as Needs Review so someone can take a look.

Thanks!
m

----

Reference of spaces: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

I've tested with:
- nbsp
- nnbsp
- numsp

berdir’s picture

Seems sensible, should be easy to extend an existing test and add another case that has a few of these in it, checking the result?

Wondering how this relates to #3181986: Add support for unicode soft hyphens on punctuation logic, I think that's not in the list the behavior would not change?

anairamzap’s picture

Hi, thanks for the quick review.
I've just added a couple (6 types of visual spaces) checks to a new test testCleanStringSpaces() method in PathautoKernelTest(). I've created a new method since I noted that we need the "transliteration" config set to FALSE in order to get the behaviour described in this ticket.

Not sure if we should also add a test to testCleanString() that uses the default configuration (transliteration: TRUE).

Let me know if that's needed and I can add them in there as well.

Regarding the ticket for soft hyphens: I'm not sure how is related with this. We are not changing the punctuation, just adding the unicode mod to the preg_replace.

Cheers,
m

mably made their first commit to this issue’s fork.

mably’s picture

Rebased MR.

mably’s picture

Assigned: Unassigned » berdir

Summary

Non-breaking spaces and other Unicode whitespace characters in entity titles were URL-encoded (%C2%A0) instead of being replaced by the separator in aliases.

Problem

Content authors often use special space characters (nbsp, thin space, em space, etc.) in titles for display purposes. When Pathauto generated aliases, the regex /\s+/ in AliasCleaner::cleanString() only matched ASCII whitespace — it did not recognize Unicode whitespace like non-breaking spaces (U+00A0), narrow no-break spaces (U+202F), figure spaces (U+2007), etc. These characters were left as-is and then URL-encoded, producing ugly aliases like /my%C2%A0title instead of /my-title.

Fix

  • AliasCleaner.php: Added the /u (unicode) modifier to the whitespace replacement regex, changing /\s+/ to /\s+/u. This makes \s match all Unicode whitespace characters, which are then correctly replaced by the configured separator.
  • PathautoKernelTest.php: Added testCleanStringSpaces() covering six Unicode space types — nbsp, nnbsp, numsp, thinsp, emsp, and ensp — each verified to be replaced by the separator. Transliteration is disabled in the test to isolate the regex behavior.
mably’s picture

Category: Bug report » Feature request