Motivation

A URL path alias should be something that you can easily type on a keyboard.There are lots of characters that we don't want to allow users to enter in URL path aliases, because they are either hard to type or hard to see.

The Unicode Punctuation Block (http://graphemica.com/blocks/general-punctuation) contains lots of these characters. With the exception of the hyphen these do not overlap with existing processing, and it contains problematic characters like zero-width spaces and RTL markers that aren't easy to visually identify but can fundamentally alter a string.

Removing these characters will not only be beneficial to users, but may also prevent security issues (such as homography attacks using invisible characters).

Implementation

I've implemented this in the same way as the alphanumeric filter, by adding an optional checkbox to the settings page. In normal operation this will have no effect on the generated path unless explicitly checked, so should have no backwards-compatibility issues.

Suitability

The other potential candidate for this is the Transliteration module, but this seems more relevant to Pathauto as it deals with characters that wouldn't normally be transliterated and specifically need to be removed rather than transformed.

I'm aware that you could also strip these with the alphanumeric filter, but this may be overzealous for some use cases, and I don't believe the application of the two overlap.

Related issues

There are a number of issues that this would address:

Potential extensions

In the future this could also be extended to exclude other troublesome characters, like the intial set from the Latin-1 Supplement block (http://graphemica.com/blocks/latin-1-supplement). This would fix issues like #2308909: How to edit ® in pathauto? and #2682721: Is possible to add « and » to the Punctuation in the settings.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

ben.kyriakou created an issue. See original summary.

ben.kyriakou’s picture

Patch attached. This contains the setting and accompanying test extension. I've also added some code for generating utf-8 characters that I lifted from stackoverflow, as I don't believe there's anything in Drupal to aid with this and I didn't want to hard-code a huge list of characters.

Status: Needs review » Needs work

The last submitted patch, 2: pathauto-strip_punctuation-2759949-1.patch, failed testing.

The last submitted patch, 2: pathauto-strip_punctuation-2759949-1.patch, failed testing.