robots.txt: have Disallow-ed /node/ and /comment/ Any problem with that?

My robots.txt looks like this:

User-agent: *
Disallow: /node/
Disallow: /comment/

Reasons:

  • "Disallow: /node/": Search engines were indexing the pages under 2 URIs: 1 the default URI (for e.g. http://thoughtfulchaos.com/node/501) and the aliased URI (for e.g. http://thoughtfulchaos.com/varun/blog/2005/08/28/uneventful-week) generated by path and pathauto modules. This could possibly lead to penalisation for serving the same content under multiple URIs. Also because the unaliased URI is generally shorter search results often showed the unaliased URI and not the aliased URI. And in case bots were crawling almost twice as much as they actually needed to.
  • "Disallow: /comment/": The 'reply' look of a comment looks like this http://thoughtfulchaos.com/comment/reply/559/232. Search bots crawled all such 'reply' links. This is not necessary since comments are part of the node pages anyways.

I am also thinking of Disallow-ing '/archive/' so that links from the 'Browse Archives' block are not crawled.

Can the SEO gurus out there tell me if I am doing anything wrong? Have I overlooked something? Is there a better way to do something like this?

Comments

varunvnair’s picture

A few minutes after I posted this question I added Disallow: /archive/ to my robots.txt file.

The reason is that the 'Browse Archives' block has a link that points to a page for the posts made in the previous month but same day as today. This link _always_ exists. My blog came into existence somewhere in the middle of 2005. However the 'Browse Archives' block has links pointing to even before this date i.e. older than the oldest node on the site. (In principle) the block will have infinite links.

This is bad for SEO because bots crawl thousands of pages that have essentially no content. I think this is a very serious issue with this module.

My Drupal-powered Blog: ThoughtfulChaos

varunvnair’s picture

This is proving to be a very interesting experience. I have Disallow-ed '/taxonomy/' since they too had aliased URIs.

I just had a look at the pages indexed by Google for my blog. And it is terribly terribly polluted. It is full of URIs with '/node/', '/comment/' and '/taxonomy/'. I think path and pathauto should have a warning that alerts admins to this problem and recommends them to use appropriate rules in robots.txt . This is best done when a site is new and has not been crawled.

My Drupal-powered Blog: ThoughtfulChaos

murph’s picture

I am using multisite installation and google has indexed all my pages from one site with /book/print/... urls as well as the aliased urls! Despite having /disallow book/print/ in my robots.txt

Any ideas why this might have happened? Do i need to specify book/print/* in my robots.txt file?

I would like to also disallow node/, however i have other sites using aggregator2 that use the unaliased urls...

varunvnair’s picture

Could you point me to sites? Having a look at the actual robots.txt will help in pinpointing the problem.

Also pages indexed earlier will take some time to fade out of Google indexes. So if you have introduced the rules forbidding crawling of /book/print/ recently wait for some time.

I think the correct syntax should be:

Disallow: /book/print/

It is recommended that a capital 'D' be used in Disallow and there should be a leading slash before 'book'.

Wildcards such as * are not a part of the robots exclusion standard. However some search engines understand some wildcards in a limited way. You will have read up about the robots.txt standard and the search engine specific documentation for this.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My Drupal-powered Blog: ThoughtfulChaos

murph’s picture

google didn't bypass the book/print pages is that my robots.txt didn't have the critical line User-agent: *! I can't believe i left that out, but nevermind. I had just the disallow rules in the robots.txt file. Also, i was using the syntax specified in the drupal manual "Disallow: book/print" instead of, as you say, "Disallow: book/print/"

Anyway, I have now implemented the robots.php/ .htaccess setup suggested on http://drupal.org/node/22265, because i wanted to specify that certain sites disallow node/ (ones using only aliases) and not for sites using rss...

casperl’s picture

The link posted below concerns the excessive bandwidth generated by search engines traversing the duplicated URL's. The robots.txt solution in this post solves a part of the problem.

http://drupal.org/node/45094

Also note that certain modules will also cause search engines to traverse the entire site multiple times. Primary is the SiteMenu module that displays the entire taxonomy as a menu structure.

AFAIK sites may be penalised for spamming if more than two links to the same pages on the site exists on a single page. Thus a menu structure, plus primary/secondary links plus a module such as SiteMenu or NiceMenus (NiceMenus is Javascript based and may be exempt) may get your site penalised by Search Engines.

Thus, the addtions to robots.txt in this post are vital!

Casper Labuschagne
Where am I on the Drupal map on Frapper?

dandellion’s picture

tracker pages should be disallowed too.

wwwoliondorcom’s picture

Hello,

I would like to know if it's really useful to add Disallow /node/ in Robots.txt as in the Sitemap submitted to Google only the link generated by Pathauto is showed, and not the link with /node/ ?

Thanks.

Rakhi’s picture

Hi Friends,

Apart from custom URL, every URL has many duplicate URLs as taxonomy, node and search url as my website has search feature. I have disallowed all node, taxonomy in the robots.txt file. My site use Drupal 6.15 version so please let me know if I am doing right or not. Also, what effect the end slash has, should I use or not???

http://www.open-source-development.com/robots.txt

Regards,
R