Since PHP 4.4.0 and 5.1.0, there is better way to create clean URL. Function pathauto_cleanstring() could look simple like this (pretending that '-' is a separator, which is better for Google than the default underscore '_'):

<?php
function pathauto_cleanstring($string)
{
   
$url = $string;
   
$url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
   
$url = trim($url, "-");
   
$url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
   
$url = strtolower($url);
   
$url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
   
return $url;
}
?>

The obvious advantage is that you can never forget some translation pair (for example, in 4.7 release, there is currently missing š=>s conversion).

The script comes from Jakub Vrana and was originally published here: http://php.vrana.cz/vytvoreni-pratelskeho-url.php (article is Czech only).

Files: 
CommentFileSizeAuthor
#8 pathauto_1.module21.38 KBgreggles
#7 pa_iconv_cleanstring.patch5.05 KBgreggles

Comments

greggles’s picture

Status:Active» Postponed

this seems great, but will have to wait until those versions of php become standard - which they currently aren't http://drupal.org/requirements

nicholasThompson’s picture

According to the docs, preg_replace has been present since 3.0.9 and iconv since 4.0.5. The others are pretty core commands since V3.

greggles’s picture

@nicholasThompson - do you have a citation?

the iconv was the item that didn't look fully supported to me.

nicholasThompson’s picture

IconV: http://uk.php.net/manual/en/function.iconv.php
Preg_replace: http://uk.php.net/manual/en/function.preg-replace.php

Commands like 'iconv_ strlen()' are PHP5, but the command iconv() are PHP4.0.5 and above.

FiReaNG3L’s picture

In my experience, iconv //TRANSLIT isn't foolproof; some characters (some of the really odd ones) will just disappear instead of being replaced. I think a translitteration table like we've been using is safer (as long as its complete).

greggles’s picture

Fireangel - it seems like obscure problems in the iconv code are things where we could add bugs to the php project to have them fixed.

I'm quite interested in this solution because the other solutions are difficult to update and get right.

greggles’s picture

Status:Postponed» Needs review
StatusFileSize
new5.05 KB

Here's an (untested) patch that implements this in a slightly different order and with some extra items that pathauto needs to consider like user configuration for apostrophes and maxlength.

I'm curious about the trim($url,$separator) because the current pathauto has always used a ctype_alnum and preg_replace function. The ctype_alnum isn't supported on all platforms (see http://drupal.org/node/20289).

Tests and critiques welcome (I'm about to do one myself).

greggles’s picture

StatusFileSize
new21.38 KB

Perhaps iconv is really bad or I'm just not doing something right in the patch, but when I used Ä Ü ' - stuff -- as a test URL the output was simply stuff.

Also, the patch didn't apply for me - I'm not sure of the exact problem, but apologies in advance if it doesn't work for others. I've attached the entire module file as a workaround for now.

greggles’s picture

Another possible method is the one provided in the accents module starting around line 20 of the .module file:

http://drupal.org/project/accents
http://cvs.drupal.org/viewcvs/drupal/contributions/modules/accents/accen...

However, I'm not sure if that removes all of the characters that people want to remove...

greggles’s picture

Status:Needs review» Postponed

I'm postponing this because I couldn't get it to work and there are other techniques in this issue queue that will work better.

greggles’s picture

Status:Postponed» Closed (won't fix)

Given http://drupal.org/node/61815 has been applied I think we can mark this as won't fix. That feels like the best solution to me for now. If we need to revisit this idea we can re-open this issue at that time.