Before it was working fine

But suddenly
In my site,
Google is start indexing "www.mysite.com/comment/reply/4432"
instead of "www.mysite.com/acutal-path"

In my Robots.txt
It is clearly mentioned that

User-agent: *
Crawl-delay: 10
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/

But still it is indexing so... What to do ?


ar-jan’s picture

The robots.txt tells Google not to crawl those urls. But the node pages are probably outputting links to those comment urls, so it's still going to notice that those comments urls exist, and put them in the index despite not visiting them. Also see my answer to this here.

So basically, use a meta robots tag "noindex,follow" on those the /comment/ paths instead of relying on robots.txt. You can do this with a bit of custom code like in the linked answer, or use metatag module with context. Note that you'll have to allow the /comment/ urls again in your robots.txt, otherwise the bot can't see your noindex tag.

Yet another approach is by returning a http 410 code to search bots only, as in this answer. Wiki on the 410 http status:

Indicates that the resource requested is no longer available and will not be available again.[2] This should be used when a resource has been intentionally removed and the resource should be purged. Upon receiving a 410 status code, the client should not request the resource again in the future. Clients such as search engines should remove the resource from their indices. Most use cases do not require clients and search engines to purge the resource, and a "404 Not Found" may be used instead.

So technically this will work, but I don't know if there are any SEO drawbacks to pretending the resource is removed when it's not actually removed for normal visitors.

rajesh190888’s picture

problem is solved.
I just put disallow: comment/* in robots.txt