What is the best way to keep Boost from trying to cache Unpublished nodes? My sites Recent Log Entries are filled with Boost-related Access Denied messages -- so while Boost is not actually caching the nodes, it is swamping legitimate Access Denied messages in the logs.
The unpublished nodes are Feed items and so there are several hundred pending review at any moment. Unfortunately, I cannot seem to use either the logging or the Boost options to tune out the noise.
Thanks for any suggestions.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mikeytown2’s picture

Long story short if the node has a URL alias, Boost will try to crawl it if you have "Crawl All URL's in the url_alias table" selected. Looks like I should try to do some sort of join on the url_alias table and the node table to check if the node is published. I would want this to be yet another option since I've used the crawler in the past to hit millions of nodes and the extra overhead of the join would slow down an operation like that quite a bit.

Query needed

SELECT *
FROM url_alias
LEFT JOIN node ON Concat( 'node/', CAST( node.nid AS CHAR ) ) = url_alias.src
WHERE node.status = 1
OR node.status IS NULL

In case your wondering, boost will auto remove entries from the cache table if it detects a 403. This occurs when a node that was published goes unpublished.

mikeytown2’s picture

Status: Active » Needs review
FileSize
1.17 KB

Let me know if this does it for you

mikeytown2’s picture

Using SUBSTRING might make it run faster... yes it does. This runs at an acceptable level 1.2M rows in less then a second. Previous query timed out after 300 seconds.

SELECT *
FROM url_alias
LEFT JOIN node ON node.nid = CAST( substring(url_alias.src, 6) AS UNSIGNED )
WHERE node.status = 1
OR node.status IS NULL

I can do this without the option :)

mikeytown2’s picture

FileSize
1.16 KB

in case your wondering the original query was doing a join on node/555

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE url_alias ALL NULL NULL NULL NULL 1284788  
1 SIMPLE node ALL NULL NULL NULL NULL 644862 Using where

The new query does a join on 555

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE url_alias ALL NULL NULL NULL NULL 1284788  
1 SIMPLE node eq_ref PRIMARY PRIMARY 4 func 1 Using where

This plus casting the url_alias as an integer made all the difference.

talatnat’s picture

Thank you, that was quick! Yes, I did have Boost set to crawl the URL_alias table. The patch got applied successfully, and I am testing on a local setup. I will let you know how things work out.

talatnat’s picture

Thank you, it seems to work for me -- I don't see Access Denied notices for unpublished nodes.

This is related more to my specific case where the URL_alias table is overly large. I think it would be a nice feature to have some fine-grained control on *what to crawl* (similar to the current feature of "Statically cache specific pages: a) Cache every page except the listed pages. b) Cache only the listed pages. c) Cache pages for which the following PHP code returns TRUE (PHP-mode, experts only)). For example, I have URL-aliases for taxonomy/term/* but they are not linked to specific views so they yield Page Not Found errors. I presumed that like node/* with aliases, they would not be crawled, but they are

Also, at the moment, I gather the Rewrite Rules are hardcoded to avoid crawling admin, user, etc. pages?

If I am not doing something wrong, and the above is current expected behaviour, I can open a Feature Request for *what to crawl*.

Addition: Could an alternative to URL_alias table be sitemap.xml from xmlsitemaps module? The latter has some controls on types of content to include in sitemap, relevance, and so on.

mikeytown2’s picture

Status: Needs review » Reviewed & tested by the community

Crawl from sitemap.xml sounds like a good idea... put in a request for that.

What to crawl can sorta be set in a limited way with the boost configuration block. Setting push to FALSE is supposed to do it; but I think it would only work for things that are already in the cache. It was one of those "this might be a good idea" that never really took off from my point of view (I never use it).

The what to crawl feature would be an extension of the boost configuration block, allowing presets per things like node type and view name, as well as a custom way to handle expiration so nodes of this type will expire X etc... #622820: Expiration Grid - road map for this module

How large is your URL_alias table? I have a test db that has 1,284,788 entries in it; just curious if your running an even larger site.

talatnat’s picture

Actually, there are around 110,000 entries the URL_alias table, so it is much smaller. The issue is that many of the URL_aliases are being maintained to keep old links, paths from breaking. So, I am just looking for a more-efficient way of crawling and then storing the content.

I'll post a Feature Request re: sitemap.xml, *what to crawl". Thanks again.

mikeytown2’s picture

Status: Reviewed & tested by the community » Fixed

committed

brianmercer’s picture

I just started using the crawler and ran into this issue.

Patch at #4 works nicely for me also for unpublished nodes, thanks.

On a related note, I have pathauto making aliases for user/[username] but the "access user profiles" permission off for anonymous users. So I get an "access denied" watchdog entry for each user/[uid] and each user/[uid]/track and each user/[uid]/track/feed.

mikeytown2’s picture

so I need to do a query for

SELECT count(*) FROM permission WHERE rid = 1 AND perm LIKE '%access user profiles%'

If I get 1 back then crawl user/* if 0 then do not crawl user/*. Sound right?

brianmercer’s picture

Whatever you think is best. User pages aren't nodes yet so you can't run them through node_access, I assume.

mikeytown2’s picture

Status: Fixed » Active
gooddesignusa’s picture

I also noticed boost hitting my user pages. I do have path auto set up to make aliases for the user pages. I tried adding user/* inside the "Cache every page except the listed pages." textarea but no luck :(

I guess i could just shut off path auto for user names if there isn't any other way around this atm.

Anonymous’s picture

Can the patch from #4 be used with the latest dev, or would it be wiser to wait until there is a patch or new dev that addresses both the unpublished nodes and the user profiles (as I have the same issue here)?

mikeytown2’s picture

#4 is already in the latest dev.

Anonymous’s picture

Odd; I am using the latest dev and it still crawls unpublished nodes...

mikeytown2’s picture

Status: Active » Needs review
FileSize
1.84 KB

patch is for user/* pages.
@TfR75 do you have something else controlling node access? are you using multiple languages?

mikeytown2’s picture

Status: Needs review » Active

committed above patch

mikeytown2’s picture

Status: Active » Postponed (maintainer needs more info)
Anonymous’s picture

I have patched the latest dev with the patch supplied and the it seems neither user pages nor unpublished nodes are still crawled. However, crawling now produces loads (i.e. 5-20 per minute of crawling (my site took almost 40 minutes to crawl) of php error messages, all saying:

Column 'rid' in where clause is ambiguous query: SELECT * FROM permission INNER JOIN role USING (rid) WHERE (name = 'anonymous user' OR rid = 1) AND perm LIKE '%access user profiles%' in /SERVERPATH/sites/all/modules/boost/boost.module on line 5571.

Location for those error message is, for instance: http://www.exampledomain.com/boost-crawler?nocache=1&key=ff48b7be882ecc0...

Anonymous’s picture

Status: Postponed (maintainer needs more info) » Active

This may not be of relevance, but I just wanted to add it. It seems that with the dev version my database is now a lot smaller as several MB have now gone from the tables, with boost_cache_settings and boost_crawler being empty. With the 6.x-1.18 beforehand I had several MB in boost tables. I assume this is the expected behaviour now, I'd just thought it best to mention as I had only noted after the cron run with the patched version, just in case.

mikeytown2’s picture

FileSize
982 bytes

patch to fix the above query... its been committed.

This is to be expected; I'm limiting what I keep track of in the boost relationship table, keeps mysql happier on large sites.

Anonymous’s picture

Applied the patch, emptied the cache, ran cron - which resulted in the following error message:

Column 'language' in field list is ambiguous query: SELECT dst, language FROM url_alias AS ua LEFT JOIN node AS n ON n.nid = CAST(substring(ua.src, 6) AS UNSIGNED) WHERE (n.status = 1 OR n.status IS NULL) AND ua.src NOT LIKE 'user/%' LIMIT 0, 3333 in /www/htdocs/tfr/winerambler/sites/all/modules/boost/boost.module on line 5582.

As a result the crawler stopped, no pages were cached at all.

mikeytown2’s picture

FileSize
1.28 KB

This has been committed; keep the bug reports up

Anonymous’s picture

With the latest nightly build, the crawler gets through my site without any error message at all. So as far as I am concerned this can be closed... Thanks again for the amazing speed of fixing issues!

mikeytown2’s picture

Status: Active » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.