Problem/Motivation
On French-language sites, the title of the products may contain punctuations in a word, I am thinking particularly of apostrophes.
It seems that this punctuation is treated before the replacement of the character strings, it results that a character string such as "pipe d'admission" is replaced by "pipe-dadmission".
it doesn't seem possible to order the processes so that the replacement of the character strings is treated before the treatment of the punctuation.
Steps to reproduce
Create products containing strings like "pipe d'admission", add "d" to strings to remove and proceed to the automatic generation of url aliases.
Issue fork pathauto-3311669
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
mably commentedClaude analysis of #3311669
The issue is caused by the processing order in
AliasCleaner::cleanString(). Currently the steps are:For "pipe d'admission" with apostrophe set to "Remove" and "d" in the ignored words list:
The expected result is "pipe-admission": the ignored word "d" should be removed before the apostrophe merges it with "admission".
Proposed fix
Moving the ignored words removal before punctuation replacement would fix this specific case, since
\bmatches the word boundary between "d" and the apostrophe. However, changing the order globally could have unintended side effects for other languages and configurations.A safer approach would be to split the punctuation step in two: first replace punctuation characters configured as "Remove" or "Replace by separator" with spaces (preserving word boundaries), then remove ignored words, then clean up separators. This way "d'admission" becomes "d admission" before the ignored words step, "d" is correctly matched and removed, and the final result is "pipe-admission".
Comment #4
mably commentedImpact analysis of current MR processing ignore words before punctuation.
The change moves ignored words removal before punctuation replacement in
cleanString(). This works because\b(word boundary) already matches between a word character and a punctuation character, so the ignored words regex handles raw text correctly.Romance languages (French, Italian, etc.) — the fix target
\bd\bcorrectly matches "d" in "d'admission" because the apostrophe is a non-word character, creating a word boundary. Same for l', c', s', n', j' in French and l', dell' in Italian.Potential concern: English contractions
If a user adds "it" to the ignored words list and the title is "it's complicated":
\bit\bdoesn't match "its" → "its-complicated"\bit\bmatches "it" in "it's" (apostrophe is a word boundary) → "'s complicated" → "s-complicated"However this is unlikely to be a real problem because the default ignored words (a, an, as, at, before, but, by, for, from, is, in, into, like, of, off, on, onto, per, since, than, the, this, that, to, up, via, with) are not typical English contraction prefixes. You don't see "the'll" or "from's".
Potential concern: transliteration ordering
Ignored words now run before transliteration, so they match against the original (possibly accented) text. Example: German "über die Brücke" with "die" as ignored →
\bdie\bstill matches fine because "die" is ASCII and surrounded by spaces. The result is the same.The only edge case would be someone adding a transliterated form as an ignored word (e.g. "ueber" to catch transliterated "über"). This is a very unusual configuration — people type ignored words as they appear in their content.
Bottom line
The change is safe for realistic usage.
\bhandles punctuation boundaries correctly, which is the whole point of the fix. No default ignored word triggers the English contraction edge case.Comment #5
mably commentedComment #7
anybodyLGTM, tests are green and logically this also makes a lot of sense to me!