On our production site with Apache Solr, we decided to turn off indexing of search/* altogether from robots.tx, because:

* spidering was loading our site
* Google results included many (many!) irrelevant (IMO) results for page after page of facet combinations

We also added an XML sitemap to make Google index us correctly.

I think that we should think a bit about what to do about the very real possibility of spamming search engines through add/remove facet links.

This article provides some quick poniters and links to other sources: http://www.netconcepts.com/faceted-navigation-article/

Comments

pwolanin’s picture

yes, i agree this is critical, but is something we need to probably address as a core patch.

greggles’s picture

The core robots.txt disallows search by default.

Google's webmaster guidelines recommend this as well:

Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.

greggles’s picture

My comment wasn't clear on my recommended course of action: I'd say this is 'by design' or 'won't fix'.

janusman’s picture

Status: Active » Closed (won't fix)

Was unaware robots.txt disallowed /search/*

Marking won't fix.

cpliakas’s picture

Related discussion posted against Facet API at #1370342: Implement a setting to add "rel=nofollow" to facet links. This gets more complex with D7 where searches can reside outside of the "search/*" path. In addition.

cpliakas’s picture