By default, Pathauto partially includes HTML-encoded characters in the URL, such as & for & or &#039 ; for '. Only partially, however, because Pathauto strips them of the preffix & and suffix ; and treats the remaining characters as a word. This seems odd to me.

Also, only directional quotation marks (‘ ’ “ ”) are affected by the 'remove/replace quotation marks' option, because standard quotation marks are converted to HTML characters upon entry.

As an example, the title Monkey "Borrows" Bloke's Burger & Eats It becomes the less obvious monkey-quot-borrows-quot-bloke-039-s-burger-amp-eats-it.

My own solution has been to insert the following line as the first action of pathauto_cleanstring() in pathauto.inc -- a regex which should catch all named and numbered HTML-encoded characters:
$string = preg_replace('/&[a-zA-Z0-9#]+?;/', '', $string);

This produces: monkey-borrows-blokes-burger-eats-it.

If implementing this is out of the question, my request is to include this as a togglable feature in the Pathauto admin section.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

greggles’s picture

That seems like a fine idea to me to make it toggleable.

Any chance you can provide this as a patch?

meatbites’s picture

Version: 5.x-2.0-beta2 » 5.x-2.x-dev
Assigned: Unassigned » meatbites

Sure, I'll try my best. It'll be my first Drupal patch, so bear with me.

In a bid to allow special characters for various languages, how about I make three options? They'd be 'Allow all,' 'Only remove punctuation' (a set list including quot, apos, amp, lt, gt, 039, and etc.), and 'Remove all' (anything that looks like &name; or &#123...;). I believe the second option will still let the i18n-ascii.txt parser do its job. I presume 'allow all' should be the default, at least to begin with, so any bugs can be ironed out.

Well, let's see how I go. :o)

meatbites’s picture

Status: Active » Needs review
FileSize
4.43 KB

Here she is. It all appears to be working perfectly at my end using the latest dev build.

willdashwood’s picture

I believe I've got the same problem, where pound characters are left in the path. This means something like "£39.99 per year or £3.99 per month" becomes "%C2%A339.99-year-or-%C2%A33.99-month".

I tried adding $string = preg_replace('/&[a-zA-Z0-9#]+?;/', '', $string); as suggested but that didn't seem to work for me. I'm hopeless at regular expressions, do you know what I can add/change to make it filter out pound signs, and dollar signs for that matter?

Many thanks!

suzanne.aldrich’s picture

Any update or patch to restore pathauto to its original behavior with punctuation, or allow toggleable option to remove all punctuation?

I think that any new features introduced into pathauto should have been made toggleable in the first place. It wasn't nice to sneak this in without a warning; now I have to edit a bunch of aliases.

Pathauto is a cool module and I'm glad it's being so well maintained; any chance it will become part of core?

greggles’s picture

The new features are fundamental, not simple toggles. Token module is the source of the major problems we have and I can´t make that a toggle. If the problem is so painful for you, please downgrade to 5.x-1.x

If you wish to see improved functionality/robustness of pathauto the solution is to write code yourself, make the issue queue easier to handle so that I can focus on code, or sponsor more time for someone to work on the code.

AFAIK, pathauto isn´t destined for core.

I have a fix for this problem but it has to wait for private reviews which I hope to get by today or tomorrow.

suzanne.aldrich’s picture

Well for being a dev version, it seems pretty stable for me, and Tokens work great, too. This is saying a lot on my installation, since I use pgsql. So I think I'll wait around for the fix because it seems like you're close and my site is not that important (except to me, of course)

trueMarketing’s picture

Category: feature » bug

We're running Drupal 5.2 and installed the latest Pathauto stable release - pathauto-5.x-2.0-beta2.tar.

After having to install the Token module to get Pathauto working, we noticed that URLs are not being stripped of characters like - ?, ;, $, etc.

From the looks of this thread it looks like we are going to have to downgrade to Drupal 5.1 and install pathauto again?

Bummer.

greggles’s picture

Status: Needs review » Closed (duplicate)

See http://drupal.org/node/143831 for more information about this specific issue and the fix.

It's taking longer to get the go-ahead than I had hoped...

agilpwc’s picture

FileSize
576 bytes

This patch fixes the non removal of punctuation for me.

You just need to move a '}' a few lines up in pathauto.inc in the latest dev. Specifically from line 104 up to line 100. Of course that patch is attached.

That way the // Preserve alphanumerics, everything else becomes a separator. // code gets executed in you are not using transliterate.

ceejayoz’s picture

Subscribing.

greggles’s picture

A year later and the module has changed quite a bit. If you have this problem, it's because you aren't using the raw tokens. I know it is confusing which ones to use, but that's the state of the token module right now is that we have some suboptimal situations due to the many users of the module...

To counteract that I've added checking and advice within the pathauto admin/settings page, but this is largely just even more confusing.

szy’s picture

Version: 5.x-2.x-dev » 7.x-1.x-dev
Assigned: meatbites » Unassigned
Status: Closed (duplicate) » Active

@Greggles, thanks for great work, first.

Is it correct behaviour of Pathauto:

CCK text field with "Filtered text" option and value <b>Simple title</b> of a&nbsp;node
is 'pathautoed' like this: 'bsimple-titleb-of-anbspnode', instead of 'simple-title-of-a-node'.

Is it the same you were talking about year ago? Shouldn't be stripped of HTML tags
and entities, as it is 'filtered text' field?

Szy.

greggles’s picture

Status: Active » Postponed (maintainer needs more info)

@szy - what is your pattern for this alias?

szy’s picture

Both are wrong:

a/[nid]/[field_my_cck_field-raw] --> /a/123456/bsimple-titleb-of-anbspnode

a/[nid]/[field_my_cck_field-formatted] --> /a/123456/pbsimple-titleb-of-anbspnodep

Field value is <b>Simple title</b> of a&nbsp;node. Tag <b> has been added
to 'Filtered HTML' filter's allowed tags.

The second field (and the last one) of this node comes from CCK Redirection
- this node is an advertising text link (if it matters).

Drupal 6.8
Content Construction Kit (CCK) 6.x-2.x-dev (2008-dec.-09)
Pathauto 6.x-2.x-dev (2008-dec.-14)

Thanks,
Szy.

greggles’s picture

This is a somewhat complex issue. In general, Pathauto is not designed to work with fields that use formatted text (as you've found). I'm not sure about the best way to solve it.

Is it possible to make that text field not use HTML formatting?

I guess the question is, what do we *want* this to do? And is that something we can do universally and have it actually work properly in all cases.

szy’s picture

Status: Postponed (maintainer needs more info) » Closed (works as designed)

I can agree with you, 'somehow' ;), but on the other hand I still think that it would be
useful to have possibility to filter HTML elements in HTML formatted field.

As I said, it is an advertising link. I want it to look like this...

Drupal
is a registered
trademark
of Dries Buytaert

... to be sure it doesn't look like this in e.g. narrow div:

Drupal is a
registered
trademark of
Dries Buytaert

As you see I need HTML formatting here, so I will leave this field as it is and I'll add
another 'clear' text field just to have clear URL. Things are going to be duplicated,
but I see I have no choice :]

Thanks,
Szy.

greggles’s picture

Category: bug » feature
Status: Closed (works as designed) » Postponed

I'm not sure it's a "no choice" situation but definitely a postponed feature request.

Dave Reid’s picture

Title: Optional HTML-encoded characters » Strip HTML tags from raw tokens

Should CCK be providing un-HTML tokens? Should we be running strip_tags?

Dave Reid’s picture

Version: 7.x-1.x-dev » 6.x-1.x-dev
DamienMcKenna’s picture

Bouncing the idea around here an awkward consensus is that it kinda comes down to using a combination of:

  • allow the &, >, < and semicolon characters,
  • run html_entity_decode to convert all of the HTML entities to their true character value,
  • run strip_tags on the string to strip out any nested tags,
  • let transliteration happen last to clean up any entities that remain.

I haven't dug into the code yet to see if this is even feasible, it's kind of an awkward use case at the best of times.

Bartezz’s picture

subscribing

Daniel Wentsch’s picture

Subscribing.

Dave Reid’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Assigned: Unassigned » Dave Reid
Status: Postponed » Active

Just ran into this again with work where node titles had tags in them. I think it makes sense to move forward with this for D7, especially since we can explicitly say that we want un-sanitized versions of tokens, so the raw tags should be there to remove.

Dave Reid’s picture

Status: Active » Needs review
FileSize
1.34 KB
Dave Reid’s picture

Confirmed all Pathauto tests passed locally, so this gets a green from the 'DaveReidAutomaticTestBot'.

greggles’s picture

Indeed this makes sense. Tokens for path aliases should not have html tags in them.

Dave Reid’s picture

Version: 7.x-1.x-dev » 6.x-2.x-dev
Status: Needs review » Patch (to be ported)

Committed #25 to Git: http://drupalcode.org/project/pathauto.git/commit/882d290

FYI I didn't even bother to make this a variable setting as it's not something anyone would in their right mind turn off, nor is there a valid use case.

Dave Reid’s picture

Version: 6.x-2.x-dev » 6.x-1.x-dev
Dave Reid’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Status: Patch (to be ported) » Fixed
szy’s picture

Great news, thanks!

Szy.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Dave Reid’s picture

Status: Closed (fixed) » Active

Reopening for a follow-up: this needs to add handling for anything that was also run through check_plain() using decode_entities().

Dave Reid’s picture

Status: Active » Needs review
FileSize
1.03 KB
Dave Reid’s picture

Status: Needs review » Closed (fixed)