Hi all!
After I got penalized by google I found out the following:

A have a page:
exapmle.com/example = node/123

I change this url in my browser and submit:
exapmle.com/ëxample
exapmle.com/exämple
exapmle.com/examplë

and I open the same page again and again with all those urls!

All the variations of the non asc-ii characters in the url are opening the same page - which leads to duplicate content.

http://drupal.org/project/issues/pathauto

has the following duplicate urls with non asc-ii character variations:

http://drupal.org/project/issues/p%C3%A4th%C3%A4uto
http://drupal.org/project/issues/patha%C3%BCto
http://drupal.org/project/issues/pathaut%C3%B6

i have pathauto, pathredirect and global redirect modules installed. All modules are up to date. Drupal-5.10.

Is there anyone with similar problem (except drupal.org and me:)?
Thanks!

Comments

odisei’s picture

Anyone please?

greggles’s picture

Project: Pathauto » Drupal core
Version: 5.x-2.3 » 7.x-dev
Component: I18n stuff » menu system
Priority: Critical » Normal

IMO, this is not a pathauto "problem" but a core path/menu system bug.

I confirmed that this duplicate content issue is still present in 6.x and so I assume also 7.x. It should ideally be fixed there first and then backported as appropriate.

odisei’s picture

Thanks man for your reply.
I really appreciate it very much.
After some further testing, I found out that this strange behavior occurs even on google pack download page. Check it out @
http://pack.google.com/intl/en/pack_installer.html?nopers

and the variations

http://pack.google.com/intl/en/pack_installer.html?n%F6pers
http://pack.google.com/intl/en/pack_installer.html?nop%EBrs
http://pack.google.com/intl/en/pack_installer.html?noper%9A

BUT, when I try to change chars BEFORE the "?" in the url, it throws a page not found error.
http://pack.google.com/intl/en/p%C3%A4ck_installer.html?nopers

Maybe it isn't Drupal error after all?
Or maybe even Google switched to Drupal?!? :)))

R.Muilwijk’s picture

I have been looking into this problem and it is still valid for drupal 7. How come this query (pulled it from query log):
SELECT * FROM menu_router WHERE path IN ('admin/appearance/üpdate', 'admin/appearance/%', 'admin/%/üpdate', 'admin/appearance', 'admin/%', 'admin') ORDER BY fit DESC LIMIT 0, 1

returns:
admin/appeareance/update

DamZ helped me out on IRC letting me know about the case sensitivity of the utf8_general_ci that was used on my install. What is a proper fix to get this right?

R.Muilwijk’s picture

EDIT: Thought I found a solution but it was not correct, removed.

1kenthomas’s picture

An IRC user in .hu with nick 'bence__' dropped in #drupal-consultants and offered a vague bounty "to solve this problem" earlier today, then dropped offline.

The following is @bence__:

I want to be quite frank about two issues here:

First, the -consultants channel is populated by people who usually have many projects, and ignore most passersby because their time is valuable. We're glad to try to help people with work/projects find someone who can do the work-- for money-- but dropping by the channel, using people's time, and then disappearing, gives the impression that you do not respect people's time and expertise.

I'm sorry if you dropped offline due to a connection issue, or the like, but such behavior discourages qualified consultants from bothering to reply to people on the channel, which helps no one and has led to the channel becoming ineffective.

Second, by the end, it seemed like you expected to be able to offer a $50 bounty, to be paid "upon community review and acceptance" of a "solution." This is IMHO simply unrealistic and not likely to get a qualified consultant to produce a solution.

At this point, looking at the above, it is not clear whether you have a particular issue with encoding on your server, for instance, or are hitting this issue in D7. The solution, without further examination, is not clear.

For $150US in advance, I'd be willing to dedicate an hour or two at looking at your situation. Several other experienced people in the channel, might or might not be motivated by that. Without that bounty, I doubt you will find anyone with experience, willing to address this issue.

An additional payment, depending on the situation and what you need done, may be appropriate/necessary to solve your particular issue.

On that note, it is entirely unimpressive and a warning sign, that you haven't bothered to document/describe the issue you're having, and instead came to #-consultants claiming you needed this issue fixed. If you're not capable of doing it yourself, then how to you *know* this is the issue you're experiencing, exactly?

You don't, and while I appreciate you came to the channel "not knowing how much this should cost," you have to take into account that while this may indeed be the issue documented above, you might have an entirely different issue, configuration or otherwise, that is not the above, or will not easily be solved.

Most consultants on the channel tend to steer clear of such requests, as far as I can tell from four years of observation, because they are an easy way to loose hours or days of work and not get paid anything.

I hope to work to put some better structures in place in-channel or elsewhere to make such situations "work out" in the future, but at this point I simply want to point out the dynamic and issues.

You have an issue with your site, evidently, that looks like the above, and may or may not be resolved by fixing the issue above. You need to hire someone to fix it. Since no one can know how long that will take, or what it would involve, without more research and time actually looking at it, no one who's not inherently interesting in fixing the problem, is going to do it for free.

Therefore, if you want it fixed, you need to hire a consultant, on reasonable and fair terms. One way to do that, would be to purchase and hour or two of someone's time, to evaluate the issue and suggest further courses of action, if needed.

I'd expect that to cost $150US. This is certainly not the only possible way, to proceed. Please feel free to come back in-channel, with that in mind.

1kenthomas’s picture

Title: Duplicate URLs of non asc-ii characters » Paths with non-ASCII characters do not redirect, create "duplicate content" issues

Title change.

1kenthomas’s picture

Is there still an issue here? I will close within 3 months if no further comments.

beto_beto’s picture

hello

i am using D6 with i8ln when i try to copy the URL from my site
as ex : www.xxx.com/cat/[title-raw]

it's appear like that

http://www.xxxx.com/cat/%D8%A7%D9%83%D8%AA%D8%B4%D8%A7%D9 .... etc .

there is any suggestion please ?