This code should be removed from robots.txt

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/

This code should be added to these pages instead:
<meta name="robots" content="noindex">

There are two main reasons for doing this.

  • First, without a "noindex" meta tag, search engines will still list these pages in search results, they just won't crawl them.
  • Second, by allowing robots to follow on these pages, the PageRank on the linked pages (primary menu, blocks) will increase.

See: http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

yonailo’s picture

subscribing

yonailo’s picture

subscribing

RobLoach’s picture

Interesting... Would this use both robots.txt and the meta tag, or just the meta tags?

Related: #495608: Move parts of robotstxt module into core.

j0nathan’s picture

Subscribing.

pillarsdotnet’s picture

userok’s picture

By allowing those links to be crawled, wouldn't that impact on site performance?
eg, on every instance of /comment/reply, the link needs to be crawled first, in order to access the meta tag 'noindex'.

I'm a bit hazy about indexing so I could be completely wrong.

Roger34’s picture

I am not sure if this post belongs to this page, since it did not get a reply elsewhere (http://drupal.org/node/22265)I am posting here:

I use default robots.tx in drupal 6.22. But google webmaster central, performance overview shows that prohibited directories are also accessed. Do you suggest that I will be better off adding paths in meta tag? I have the following listed in the google webmaster central under example page loading time:
/ad​min​/co​nte​nt/​add 1.9
/no​de/​add​/st​ory 2.3
/no​de/​add​/ar​tic​le 3.1
/rss.x​ml 0.6
/no​de/​15008/e​dit 2.2
/ad​min​/se​tti​ngs 0.9
/ad​min​/re​por​ts/​sta​tus 1.6
/ad​min​/re​por​ts/​sta​tus​/ru​n-c​ron 120.01

In addition to Disallow: /admin/
do I also need to specify for example: /ad​min​/re​por​ts/​sta​tus​/ru​n-c​ron. I am sure no one wants the crawler to spend 120 seconds on cron.
I also do not like google to crawl node/add/

Would appreciate a reply.

ar-jan’s picture

I like this idea. Re #6: yes, I think so, the page would have to be crawled. So apart from any possible performance impact, for very large sites (thousands of pages) this would mean the crawler spending time on irrelevent /comment/reply pages, crawling time that should be spent on actual content. (Unless noindex pages are 'free' as far as far as crawling effort is concerned?)

@Roger: no, this is not the place to ask, this is an issue about changing the way Drupal core works. But: the default robots.txt prevents search engines from even crawling those pages. Everything under /admin/ is already disallowed by that line, so you shouldn't have to add that particular cron page. Better check that your robots.txt is accessible. For further questions head back to the forums or IRC.

philbar’s picture

Issue summary: View changes

Update for clarity

jhedstrom’s picture

Version: 8.0.x-dev » 8.1.x-dev
Issue summary: View changes

Moving to 8.1.

iantresman’s picture

Just a note that I think that some of the paths in the robots.txt file appear to be incorrect, and do not block the required pages. See "robots.txt paths incorrect"

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

chapf’s picture

Today, after checking the access logs of a drupal based website I administer I was rather surprised to see googlebot constantly crawl some of the pages that are listed in the stock robots.txt file.

So I found this issue here and then read up on some of the documentation google provides for its crawler and they clearly state that robots.txt isn't any good in preventing a page from being indexed or crawled or listed in the search results. Basically useless for the purpose I think many people still assume it fulfills!! (Source: https://support.google.com/webmasters/answer/6062608?hl=en)

Now I don't mean to be annoying but is there any concrete plan to move forward on this problem or should I have a look into contrib modules? Seems wrong to me since this little file is core functionality and there is also other open issues for robots.txt listed above.

nodestroy’s picture

from my point of view we should completely remove that urls from robots.txt and replace that with a x-robots-tag implementation - if thats possible with compatibility in mind.

listing a specific page in robots.txt is no guarantee to prevent it from indexing. see an example below from drupal.org

no2e’s picture

@nodestroy:

robots.txt is used to prevent crawling, not indexing. Your screenshots show exactly this. The page is indexed, but not crawled (hence why the search result snippet refers to the robots.txt as reason why no relevant snippet could be shown).

X-Robots-Tag (and the equivalent meta-robots) prevents indexing, not crawling. To allow bots to notice this, they have to crawl the page, of course.

In both cases, "prevent" is of course not meant in a technical sense. Both ways require that the bot and the search engine are polite.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.