Punctuation processed before replacing strings [#3311669]

Problem/Motivation

On French-language sites, the title of the products may contain punctuations in a word, I am thinking particularly of apostrophes.
It seems that this punctuation is treated before the replacement of the character strings, it results that a character string such as "pipe d'admission" is replaced by "pipe-dadmission".
it doesn't seem possible to order the processes so that the replacement of the character strings is treated before the treatment of the punctuation.

Steps to reproduce

Create products containing strings like "pipe d'admission", add "d" to strings to remove and proceed to the automatic generation of url aliases.

Issue fork pathauto-3311669

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3311669-punctuation-processed-before changes, plain diff MR !156
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

25 September 2022 at 08:48

alexdezark created an issue. See original summary.

Comment #2

mably commented 7 February 2026 at 19:38

Version:	8.x-1.11	» 8.x-1.x-dev
Category:	Support request	» Feature request

Claude analysis of #3311669

The issue is caused by the processing order in AliasCleaner::cleanString(). Currently the steps are:

Remove HTML tags
Replace/remove punctuation
Transliterate
Reduce to letters/numbers
Remove ignored words

For "pipe d'admission" with apostrophe set to "Remove" and "d" in the ignored words list:

Step 2 removes the apostrophe → "pipe dadmission"
Step 5 can't find "d" as a standalone word anymore → result is "pipe-dadmission"

The expected result is "pipe-admission": the ignored word "d" should be removed before the apostrophe merges it with "admission".

Proposed fix

Moving the ignored words removal before punctuation replacement would fix this specific case, since \b matches the word boundary between "d" and the apostrophe. However, changing the order globally could have unintended side effects for other languages and configurations.

A safer approach would be to split the punctuation step in two: first replace punctuation characters configured as "Remove" or "Replace by separator" with spaces (preserving word boundaries), then remove ignored words, then clean up separators. This way "d'admission" becomes "d admission" before the ignored words step, "d" is correctly matched and removed, and the final result is "pipe-admission".

Comment #3

13 February 2026 at 19:30

mably opened merge request !156

Comment #4

mably commented 13 February 2026 at 19:38

Impact analysis of current MR processing ignore words before punctuation.

The change moves ignored words removal before punctuation replacement in cleanString(). This works because \b (word boundary) already matches between a word character and a punctuation character, so the ignored words regex handles raw text correctly.

Romance languages (French, Italian, etc.) — the fix target

\bd\b correctly matches "d" in "d'admission" because the apostrophe is a non-word character, creating a word boundary. Same for l', c', s', n', j' in French and l', dell' in Italian.

Potential concern: English contractions

If a user adds "it" to the ignored words list and the title is "it's complicated":

Before: apostrophe removed → "its complicated" → \bit\b doesn't match "its" → "its-complicated"
After: \bit\b matches "it" in "it's" (apostrophe is a word boundary) → "'s complicated" → "s-complicated"

However this is unlikely to be a real problem because the default ignored words (a, an, as, at, before, but, by, for, from, is, in, into, like, of, off, on, onto, per, since, than, the, this, that, to, up, via, with) are not typical English contraction prefixes. You don't see "the'll" or "from's".

Potential concern: transliteration ordering

Ignored words now run before transliteration, so they match against the original (possibly accented) text. Example: German "über die Brücke" with "die" as ignored → \bdie\b still matches fine because "die" is ASCII and surrounded by spaces. The result is the same.

The only edge case would be someone adding a transliterated form as an ignored word (e.g. "ueber" to catch transliterated "über"). This is a very unusual configuration — people type ignored words as they appear in their content.

Bottom line

The change is safe for realistic usage. \b handles punctuation boundaries correctly, which is the whole point of the fix. No default ignored word triggers the English contraction edge case.

Comment #5

mably commented 13 February 2026 at 20:48

Status:

Active

» Needs review

Comment #6

18 February 2026 at 07:54

anybody made their first commit to this issue’s fork.

Comment #7

anybody

German

Porta Westfalica

commented 18 February 2026 at 07:54

Status:

Needs review

» Reviewed & tested by the community

LGTM, tests are green and logically this also makes a lot of sense to me!

Punctuation processed before replacing strings