after #1702486: Change behaviour of profile listings: new sort order and/or block from robots.txt has been addressed to some degree, I wonder if we should block further listing pages from robots.

These listing pages are all expensive to generate and do not provide much info by themselves.

All nodes are linked to by the tracker which is generated in a more efficient way and not as costly.

The obvious idea is to block all paged pages except the tracker.

We could even keep the first page of each paged page in the index, ie we block all pages that match /node?page= but not /node itself.

I am not an SEO expert and I am not sure if this change would have any consequences.

Please discuss.

Comments

killes@www.drop.org’s picture

I've had a look at the stats that google has about the number of indexed pages on d.o.

Last August, there were about 5 Mio indexec pages, and apparently it was sort of constant. Then there was a jump to 9 Mio in early November. Then there was a steady increase till about June, topping at over 18 Mio indexed pages. Then a slight decrease to about 14 Mio.

This is obviously nuts, we don't really have that much content. If we add user profiles and nodes, we have less than 2.5 Mio pages.

The bloated number is probably due to all these pages link lists. There are alone about 200k results for

site:drupal.org/taxonomy

Ben Finklea’s picture

The question is, how deeply would googlebot have to delve to find an arbitrary node? Linked from the profile pages may not be ideal. (not sure)

If any of those nodes are more than 5-6 levels deep from the home page (or they're not linked from anywhere. i.e. orphans) then it's best to either leave things as is or create a sitemap so that Googlebot will find them.

Then there is the fly by the seat of your pants option: Look to see how many pages are indexed and daily traffic. Ban the bot. Wait a week and check again. Kieran would kill me for recommending this...but, if it's as you describe then it shouldn't have an effect at all.

Also consider that there are other ways to handle performance issues. That's not my area, though.

killes@www.drop.org’s picture

Ben, _all_ nodes are linked through the tracker page: https://drupal.org/tracker?page=37704

This would be the only paged page that we'd not disallow in robots.txt to ensure that all nodes can be found.

Another option would be to add a "noindex" header to the paged pages.

My concern is not so much performance, but index bloat on google's site which they might think is deliberately done by us.

apaderno’s picture

Project: Drupal.org site moderators » Drupal.org infrastructure
Component: Site organization » Other
Issue summary: View changes
drumm’s picture

Status: Active » Closed (won't fix)

We haven’t been having problems with crawlers that pay attention to robots.txt lately.