robots.txt: have Disallow-ed '/node/' and '/comment/' Any problem with that?

By varunvnair on 18 Jan 2006 at 06:21 UTC

robots.txt: have Disallow-ed /node/ and /comment/ Any problem with that?

User-agent: *
Disallow: /node/
Disallow: /comment/

Reasons:

"Disallow: /node/": Search engines were indexing the pages under 2 URIs: 1 the default URI (for e.g. http://thoughtfulchaos.com/node/501) and the aliased URI (for e.g. http://thoughtfulchaos.com/varun/blog/2005/08/28/uneventful-week) generated by path and pathauto modules. This could possibly lead to penalisation for serving the same content under multiple URIs. Also because the unaliased URI is generally shorter search results often showed the unaliased URI and not the aliased URI. And in case bots were crawling almost twice as much as they actually needed to.
"Disallow: /comment/": The 'reply' look of a comment looks like this http://thoughtfulchaos.com/comment/reply/559/232. Search bots crawled all such 'reply' links. This is not necessary since comments are part of the node pages anyways.

I am also thinking of Disallow-ing '/archive/' so that links from the 'Browse Archives' block are not crawled.

Can the SEO gurus out there tell me if I am doing anything wrong? Have I overlooked something? Is there a better way to do something like this?

Comments

Have Disallow-ed '/archive/' too

varunvnair commented 18 January 2006 at 06:44

A few minutes after I posted this question I added Disallow: /archive/ to my robots.txt file.

The reason is that the 'Browse Archives' block has a link that points to a page for the posts made in the previous month but same day as today. This link _always_ exists. My blog came into existence somewhere in the middle of 2005. However the 'Browse Archives' block has links pointing to even before this date i.e. older than the oldest node on the site. (In principle) the block will have infinite links.

This is bad for SEO because bots crawl thousands of pages that have essentially no content. I think this is a very serious issue with this module.

My Drupal-powered Blog: ThoughtfulChaos

Have Disallow-ed '/taxonomy/' too

varunvnair commented 18 January 2006 at 10:47

This is proving to be a very interesting experience. I have Disallow-ed '/taxonomy/' since they too had aliased URIs.

I just had a look at the pages indexed by Google for my blog. And it is terribly terribly polluted. It is full of URIs with '/node/', '/comment/' and '/taxonomy/'. I think path and pathauto should have a warning that alerts admins to this problem and recommends them to use appropriate rules in robots.txt . This is best done when a site is new and has not been crawled.

My Drupal-powered Blog: ThoughtfulChaos

I have a similar problem

murph commented 5 February 2006 at 21:52

I am using multisite installation and google has indexed all my pages from one site with /book/print/... urls as well as the aliased urls! Despite having /disallow book/print/ in my robots.txt

Any ideas why this might have happened? Do i need to specify book/print/* in my robots.txt file?

I would like to also disallow node/, however i have other sites using aggregator2 that use the unaliased urls...

Could you point me to your site?

varunvnair commented 6 February 2006 at 05:56

Could you point me to sites? Having a look at the actual robots.txt will help in pinpointing the problem.

Also pages indexed earlier will take some time to fade out of Google indexes. So if you have introduced the rules forbidding crawling of /book/print/ recently wait for some time.

I think the correct syntax should be:

Disallow: /book/print/

It is recommended that a capital 'D' be used in Disallow and there should be a leading slash before 'book'.

Wildcards such as * are not a part of the robots exclusion standard. However some search engines understand some wildcards in a limited way. You will have read up about the robots.txt standard and the search engine specific documentation for this.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My Drupal-powered Blog: ThoughtfulChaos

I believe the reason

murph commented 6 February 2006 at 10:16

google didn't bypass the book/print pages is that my robots.txt didn't have the critical line User-agent: *! I can't believe i left that out, but nevermind. I had just the disallow rules in the robots.txt file. Also, i was using the syntax specified in the drupal manual "Disallow: book/print" instead of, as you say, "Disallow: book/print/"

Anyway, I have now implemented the robots.php/ .htaccess setup suggested on http://drupal.org/node/22265, because i wanted to specify that certain sites disallow node/ (ones using only aliases) and not for sites using rss...

Another effect is excessive bandwidth...

casperl commented 12 May 2006 at 12:48

The link posted below concerns the excessive bandwidth generated by search engines traversing the duplicated URL's. The robots.txt solution in this post solves a part of the problem.

http://drupal.org/node/45094

Also note that certain modules will also cause search engines to traverse the entire site multiple times. Primary is the SiteMenu module that displays the entire taxonomy as a menu structure.

AFAIK sites may be penalised for spamming if more than two links to the same pages on the site exists on a single page. Thus a menu structure, plus primary/secondary links plus a module such as SiteMenu or NiceMenus (NiceMenus is Javascript based and may be exempt) may get your site penalised by Search Engines.

Thus, the addtions to robots.txt in this post are vital!

Casper Labuschagne
Where am I on the Drupal map on Frapper?

tracker pages should be

dandellion commented 5 October 2006 at 00:40

tracker pages should be disallowed too.

Still useful to add Disallow /node/ ?

wwwoliondorcom commented 24 October 2007 at 12:19

Hello,

I would like to know if it's really useful to add Disallow /node/ in Robots.txt as in the Sitemap submitted to Google only the link generated by Pathauto is showed, and not the link with /node/ ?

Thanks.

Duplicate URLs with node and taxonomy

Rakhi commented 7 May 2012 at 11:26

Hi Friends,

Apart from custom URL, every URL has many duplicate URLs as taxonomy, node and search url as my website has search feature. I have disallowed all node, taxonomy in the robots.txt file. My site use Drupal 6.15 version so please let me know if I am doing right or not. Also, what effect the end slash has, should I use or not???

http://www.open-source-development.com/robots.txt

Regards,
R