I started getting 'disk full' warnings due to huge growth in the size of my drupal DB, mainly due to amazing bloat in the tables cache_amazon_store and cache_amazon_store_searches. Watchdog dblog is lousy with lengthy warnings like the one below. There are so many of these, my DB grows by hundreds of MB in size. I've 'controlled' that (at least temporarily) by reducing the cache times drastically and renaming the URL (from amazon_store to something else, to lose the bots, at least for a while).
To me, it looks like it starts this way. The bot visits a single amazon item, like:
http://mywebsite.com/amazon_store/item/0060930349
Then, the bot follows the Author link to reach:
http://mywebsite.com/amazon_store?Author=Paul%20Johnson&SearchIndex=Books
and by following the pager links, the bot gets to a
http://mywebsite.com/amazon_store?page=2&Author=Paul%20Johnson&SearchInd...
Looks to me like those dang bots end up indexing a large fraction of Amazon.com (with my site as a kind of proxy). Which is fine, except my DB can't handle all those entries.
But here's the real reason I'm reporting. Take a look at that dblog entry below and you'll see "page=982490" in the URL!! Such a large page number throws the error, but I can't fathom how the bot gets that far. I thought maybe the 'last' link in the pager, but so far I can't reproduce getting such a large page number.
Looking through other issues I've seen discussions of caching issues and possible use of 'nofollow' tag to discourage bots. But I've not seen anyone report this sort of cache bloat, and the error associated with giant, mysterious page numbers. So, I thought I'd report this mainly to see if anyone else has seen similar, or has suggestions on how to avoid.
Thanks for all the hard work! Any suggestions appreciated.
-- Cronin
Details
Type amazon
Date Friday, July 6, 2012 - 11:32
User Anonymous (not verified)
Location http://cvining.com/amazon_store?page=982490&author=Sarah%20Vowell
Referrer
Message There was an error accessing amazon. Message=Amazon error returned. Code=AWS.ParameterOutOfRange}, Message=The value you specified for ItemPage is invalid. Valid values must be between 1 and 10.
//, results=SimpleXMLElement Object ( [OperationRequest] => SimpleXMLElement Object ( [HTTPHeaders] => SimpleXMLElement Object ( [Header] => SimpleXMLElement Object ( [@attributes] => Array ( [Name] => UserAgent [Value] => Drupal (+http://drupal.org/) ) ) )
[Message snipped for brevity]
Severity notice
Hostname 66.249.71.219
Operations
Comments
Comment #1
rfayDon't all those links have rel="nofollow"? That should keep the googlebot out. Also, don't forget your robots.txt.
Comment #2
willvincent commentedThey should all have rel="nofollow" on them. But yes, best course of action might be to simply disallow bot access to that entire path in robots.txt
at any rate, I don't see this as a bug with the module, necessarily.