By default, Pathauto partially includes HTML-encoded characters in the URL, such as &
for & or ' ;
for '. Only partially, however, because Pathauto strips them of the preffix &
and suffix ;
and treats the remaining characters as a word. This seems odd to me.
Also, only directional quotation marks (‘ ’ “ ”) are affected by the 'remove/replace quotation marks' option, because standard quotation marks are converted to HTML characters upon entry.
As an example, the title Monkey "Borrows" Bloke's Burger & Eats It becomes the less obvious monkey-quot-borrows-quot-bloke-039-s-burger-amp-eats-it.
My own solution has been to insert the following line as the first action of pathauto_cleanstring() in pathauto.inc -- a regex which should catch all named and numbered HTML-encoded characters:
$string = preg_replace('/&[a-zA-Z0-9#]+?;/', '', $string);
This produces: monkey-borrows-blokes-burger-eats-it.
If implementing this is out of the question, my request is to include this as a togglable feature in the Pathauto admin section.
Comment | File | Size | Author |
---|---|---|---|
#34 | 167786-fix-check-plain-token-values.patch | 1.03 KB | Dave Reid |
#25 | 167786-pathauto-strip-tags.patch | 1.34 KB | Dave Reid |
#10 | pathauto.inc__0.patch | 576 bytes | agilpwc |
#3 | pathauto_encoded_characters.patch | 4.43 KB | meatbites |
Comments
Comment #1
gregglesThat seems like a fine idea to me to make it toggleable.
Any chance you can provide this as a patch?
Comment #2
meatbites CreditAttribution: meatbites commentedSure, I'll try my best. It'll be my first Drupal patch, so bear with me.
In a bid to allow special characters for various languages, how about I make three options? They'd be 'Allow all,' 'Only remove punctuation' (a set list including quot, apos, amp, lt, gt, 039, and etc.), and 'Remove all' (anything that looks like &name; or {...;). I believe the second option will still let the i18n-ascii.txt parser do its job. I presume 'allow all' should be the default, at least to begin with, so any bugs can be ironed out.
Well, let's see how I go. :o)
Comment #3
meatbites CreditAttribution: meatbites commentedHere she is. It all appears to be working perfectly at my end using the latest dev build.
Comment #4
willdashwood CreditAttribution: willdashwood commentedI believe I've got the same problem, where pound characters are left in the path. This means something like "£39.99 per year or £3.99 per month" becomes "%C2%A339.99-year-or-%C2%A33.99-month".
I tried adding
$string = preg_replace('/&[a-zA-Z0-9#]+?;/', '', $string);
as suggested but that didn't seem to work for me. I'm hopeless at regular expressions, do you know what I can add/change to make it filter out pound signs, and dollar signs for that matter?Many thanks!
Comment #5
suzanne.aldrich CreditAttribution: suzanne.aldrich commentedAny update or patch to restore pathauto to its original behavior with punctuation, or allow toggleable option to remove all punctuation?
I think that any new features introduced into pathauto should have been made toggleable in the first place. It wasn't nice to sneak this in without a warning; now I have to edit a bunch of aliases.
Pathauto is a cool module and I'm glad it's being so well maintained; any chance it will become part of core?
Comment #6
gregglesThe new features are fundamental, not simple toggles. Token module is the source of the major problems we have and I can´t make that a toggle. If the problem is so painful for you, please downgrade to 5.x-1.x
If you wish to see improved functionality/robustness of pathauto the solution is to write code yourself, make the issue queue easier to handle so that I can focus on code, or sponsor more time for someone to work on the code.
AFAIK, pathauto isn´t destined for core.
I have a fix for this problem but it has to wait for private reviews which I hope to get by today or tomorrow.
Comment #7
suzanne.aldrich CreditAttribution: suzanne.aldrich commentedWell for being a dev version, it seems pretty stable for me, and Tokens work great, too. This is saying a lot on my installation, since I use pgsql. So I think I'll wait around for the fix because it seems like you're close and my site is not that important (except to me, of course)
Comment #8
trueMarketing CreditAttribution: trueMarketing commentedWe're running Drupal 5.2 and installed the latest Pathauto stable release - pathauto-5.x-2.0-beta2.tar.
After having to install the Token module to get Pathauto working, we noticed that URLs are not being stripped of characters like - ?, ;, $, etc.
From the looks of this thread it looks like we are going to have to downgrade to Drupal 5.1 and install pathauto again?
Bummer.
Comment #9
gregglesSee http://drupal.org/node/143831 for more information about this specific issue and the fix.
It's taking longer to get the go-ahead than I had hoped...
Comment #10
agilpwc CreditAttribution: agilpwc commentedThis patch fixes the non removal of punctuation for me.
You just need to move a '}' a few lines up in pathauto.inc in the latest dev. Specifically from line 104 up to line 100. Of course that patch is attached.
That way the // Preserve alphanumerics, everything else becomes a separator. // code gets executed in you are not using transliterate.
Comment #11
ceejayoz CreditAttribution: ceejayoz commentedSubscribing.
Comment #12
gregglesA year later and the module has changed quite a bit. If you have this problem, it's because you aren't using the raw tokens. I know it is confusing which ones to use, but that's the state of the token module right now is that we have some suboptimal situations due to the many users of the module...
To counteract that I've added checking and advice within the pathauto admin/settings page, but this is largely just even more confusing.
Comment #13
szy CreditAttribution: szy commented@Greggles, thanks for great work, first.
Is it correct behaviour of Pathauto:
CCK text field with "Filtered text" option and value
<b>Simple title</b> of a node
is 'pathautoed' like this: 'bsimple-titleb-of-anbspnode', instead of 'simple-title-of-a-node'.
Is it the same you were talking about year ago? Shouldn't be stripped of HTML tags
and entities, as it is 'filtered text' field?
Szy.
Comment #14
greggles@szy - what is your pattern for this alias?
Comment #15
szy CreditAttribution: szy commentedBoth are wrong:
a/[nid]/[field_my_cck_field-raw] --> /a/123456/bsimple-titleb-of-anbspnode
a/[nid]/[field_my_cck_field-formatted] --> /a/123456/pbsimple-titleb-of-anbspnodep
Field value is
<b>Simple title</b> of a node
. Tag<b>
has been addedto 'Filtered HTML' filter's allowed tags.
The second field (and the last one) of this node comes from CCK Redirection
- this node is an advertising text link (if it matters).
Drupal 6.8
Content Construction Kit (CCK) 6.x-2.x-dev (2008-dec.-09)
Pathauto 6.x-2.x-dev (2008-dec.-14)
Thanks,
Szy.
Comment #16
gregglesThis is a somewhat complex issue. In general, Pathauto is not designed to work with fields that use formatted text (as you've found). I'm not sure about the best way to solve it.
Is it possible to make that text field not use HTML formatting?
I guess the question is, what do we *want* this to do? And is that something we can do universally and have it actually work properly in all cases.
Comment #17
szy CreditAttribution: szy commentedI can agree with you, 'somehow' ;), but on the other hand I still think that it would be
useful to have possibility to filter HTML elements in HTML formatted field.
As I said, it is an advertising link. I want it to look like this...
... to be sure it doesn't look like this in e.g. narrow div:
As you see I need HTML formatting here, so I will leave this field as it is and I'll add
another 'clear' text field just to have clear URL. Things are going to be duplicated,
but I see I have no choice :]
Thanks,
Szy.
Comment #18
gregglesI'm not sure it's a "no choice" situation but definitely a postponed feature request.
Comment #19
Dave ReidShould CCK be providing un-HTML tokens? Should we be running strip_tags?
Comment #20
Dave ReidComment #21
DamienMcKennaBouncing the idea around here an awkward consensus is that it kinda comes down to using a combination of:
I haven't dug into the code yet to see if this is even feasible, it's kind of an awkward use case at the best of times.
Comment #22
Bartezz CreditAttribution: Bartezz commentedsubscribing
Comment #23
Daniel Wentsch CreditAttribution: Daniel Wentsch commentedSubscribing.
Comment #24
Dave ReidJust ran into this again with work where node titles had tags in them. I think it makes sense to move forward with this for D7, especially since we can explicitly say that we want un-sanitized versions of tokens, so the raw tags should be there to remove.
Comment #25
Dave ReidComment #26
Dave ReidConfirmed all Pathauto tests passed locally, so this gets a green from the 'DaveReidAutomaticTestBot'.
Comment #27
gregglesIndeed this makes sense. Tokens for path aliases should not have html tags in them.
Comment #28
Dave ReidCommitted #25 to Git: http://drupalcode.org/project/pathauto.git/commit/882d290
FYI I didn't even bother to make this a variable setting as it's not something anyone would in their right mind turn off, nor is there a valid use case.
Comment #29
Dave ReidCommitted to 6.x-2.x as-is: http://drupalcode.org/project/pathauto.git/commit/3200bbc
Comment #30
Dave ReidCommitted to 6.x-1.x as is: http://drupalcode.org/project/pathauto.git/commit/104e4fe
Comment #31
szy CreditAttribution: szy commentedGreat news, thanks!
Szy.
Comment #33
Dave ReidReopening for a follow-up: this needs to add handling for anything that was also run through check_plain() using decode_entities().
Comment #34
Dave ReidComment #35
Dave ReidTested and committed #34 to all three branches.
http://drupalcode.org/project/pathauto.git/commit/c1904d0 (7.x-1.x)
http://drupalcode.org/project/pathauto.git/commit/8f4814f (6.x-2.x)
http://drupalcode.org/project/pathauto.git/commit/1c9630d (6.x-1.x)