Problem/Motivation

On French-language sites, the title of the products may contain punctuations in a word, I am thinking particularly of apostrophes.
It seems that this punctuation is treated before the replacement of the character strings, it results that a character string such as "pipe d'admission" is replaced by "pipe-dadmission".
it doesn't seem possible to order the processes so that the replacement of the character strings is treated before the treatment of the punctuation.

Steps to reproduce

Create products containing strings like "pipe d'admission", add "d" to strings to remove and proceed to the automatic generation of url aliases.

Issue fork pathauto-3311669

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

alexdezark created an issue. See original summary.

mably’s picture

Version: 8.x-1.11 » 8.x-1.x-dev
Category: Support request » Feature request

Claude analysis of #3311669

The issue is caused by the processing order in AliasCleaner::cleanString(). Currently the steps are:

  1. Remove HTML tags
  2. Replace/remove punctuation
  3. Transliterate
  4. Reduce to letters/numbers
  5. Remove ignored words

For "pipe d'admission" with apostrophe set to "Remove" and "d" in the ignored words list:

  • Step 2 removes the apostrophe → "pipe dadmission"
  • Step 5 can't find "d" as a standalone word anymore → result is "pipe-dadmission"

The expected result is "pipe-admission": the ignored word "d" should be removed before the apostrophe merges it with "admission".

Proposed fix

Moving the ignored words removal before punctuation replacement would fix this specific case, since \b matches the word boundary between "d" and the apostrophe. However, changing the order globally could have unintended side effects for other languages and configurations.

A safer approach would be to split the punctuation step in two: first replace punctuation characters configured as "Remove" or "Replace by separator" with spaces (preserving word boundaries), then remove ignored words, then clean up separators. This way "d'admission" becomes "d admission" before the ignored words step, "d" is correctly matched and removed, and the final result is "pipe-admission".

mably’s picture

Impact analysis of current MR processing ignore words before punctuation.

The change moves ignored words removal before punctuation replacement in cleanString(). This works because \b (word boundary) already matches between a word character and a punctuation character, so the ignored words regex handles raw text correctly.

Romance languages (French, Italian, etc.) — the fix target

\bd\b correctly matches "d" in "d'admission" because the apostrophe is a non-word character, creating a word boundary. Same for l', c', s', n', j' in French and l', dell' in Italian.

Potential concern: English contractions

If a user adds "it" to the ignored words list and the title is "it's complicated":

  • Before: apostrophe removed → "its complicated" → \bit\b doesn't match "its" → "its-complicated"
  • After: \bit\b matches "it" in "it's" (apostrophe is a word boundary) → "'s complicated" → "s-complicated"

However this is unlikely to be a real problem because the default ignored words (a, an, as, at, before, but, by, for, from, is, in, into, like, of, off, on, onto, per, since, than, the, this, that, to, up, via, with) are not typical English contraction prefixes. You don't see "the'll" or "from's".

Potential concern: transliteration ordering

Ignored words now run before transliteration, so they match against the original (possibly accented) text. Example: German "über die Brücke" with "die" as ignored → \bdie\b still matches fine because "die" is ASCII and surrounded by spaces. The result is the same.

The only edge case would be someone adding a transliterated form as an ignored word (e.g. "ueber" to catch transliterated "über"). This is a very unusual configuration — people type ignored words as they appear in their content.

Bottom line

The change is safe for realistic usage. \b handles punctuation boundaries correctly, which is the whole point of the fix. No default ignored word triggers the English contraction edge case.

mably’s picture

Status: Active » Needs review

anybody made their first commit to this issue’s fork.

anybody’s picture

Status: Needs review » Reviewed & tested by the community

LGTM, tests are green and logically this also makes a lot of sense to me!