Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
What is the best way to keep Boost from trying to cache Unpublished nodes? My sites Recent Log Entries are filled with Boost-related Access Denied messages -- so while Boost is not actually caching the nodes, it is swamping legitimate Access Denied messages in the logs.
The unpublished nodes are Feed items and so there are several hundred pending review at any moment. Unfortunately, I cannot seem to use either the logging or the Boost options to tune out the noise.
Thanks for any suggestions.
Comment | File | Size | Author |
---|---|---|---|
#25 | boost-787464.patch | 1.28 KB | mikeytown2 |
#23 | boost-787464.patch | 982 bytes | mikeytown2 |
#18 | boost-787464.patch | 1.84 KB | mikeytown2 |
#4 | boost-787464.patch | 1.16 KB | mikeytown2 |
#2 | boost-787464.patch | 1.17 KB | mikeytown2 |
Comments
Comment #1
mikeytown2 CreditAttribution: mikeytown2 commentedLong story short if the node has a URL alias, Boost will try to crawl it if you have "Crawl All URL's in the url_alias table" selected. Looks like I should try to do some sort of join on the url_alias table and the node table to check if the node is published. I would want this to be yet another option since I've used the crawler in the past to hit millions of nodes and the extra overhead of the join would slow down an operation like that quite a bit.
Query needed
In case your wondering, boost will auto remove entries from the cache table if it detects a 403. This occurs when a node that was published goes unpublished.
Comment #2
mikeytown2 CreditAttribution: mikeytown2 commentedLet me know if this does it for you
Comment #3
mikeytown2 CreditAttribution: mikeytown2 commentedUsing SUBSTRING might make it run faster... yes it does. This runs at an acceptable level 1.2M rows in less then a second. Previous query timed out after 300 seconds.
I can do this without the option :)
Comment #4
mikeytown2 CreditAttribution: mikeytown2 commentedin case your wondering the original query was doing a join on
node/555
The new query does a join on
555
This plus casting the url_alias as an integer made all the difference.
Comment #5
talatnat CreditAttribution: talatnat commentedThank you, that was quick! Yes, I did have Boost set to crawl the URL_alias table. The patch got applied successfully, and I am testing on a local setup. I will let you know how things work out.
Comment #6
talatnat CreditAttribution: talatnat commentedThank you, it seems to work for me -- I don't see Access Denied notices for unpublished nodes.
This is related more to my specific case where the URL_alias table is overly large. I think it would be a nice feature to have some fine-grained control on *what to crawl* (similar to the current feature of "Statically cache specific pages: a) Cache every page except the listed pages. b) Cache only the listed pages. c) Cache pages for which the following PHP code returns TRUE (PHP-mode, experts only)). For example, I have URL-aliases for taxonomy/term/* but they are not linked to specific views so they yield Page Not Found errors. I presumed that like node/* with aliases, they would not be crawled, but they are
Also, at the moment, I gather the Rewrite Rules are hardcoded to avoid crawling admin, user, etc. pages?
If I am not doing something wrong, and the above is current expected behaviour, I can open a Feature Request for *what to crawl*.
Addition: Could an alternative to URL_alias table be sitemap.xml from xmlsitemaps module? The latter has some controls on types of content to include in sitemap, relevance, and so on.
Comment #7
mikeytown2 CreditAttribution: mikeytown2 commentedCrawl from sitemap.xml sounds like a good idea... put in a request for that.
What to crawl can sorta be set in a limited way with the boost configuration block. Setting push to FALSE is supposed to do it; but I think it would only work for things that are already in the cache. It was one of those "this might be a good idea" that never really took off from my point of view (I never use it).
The what to crawl feature would be an extension of the boost configuration block, allowing presets per things like node type and view name, as well as a custom way to handle expiration so nodes of this type will expire X etc... #622820: Expiration Grid - road map for this module
How large is your URL_alias table? I have a test db that has 1,284,788 entries in it; just curious if your running an even larger site.
Comment #8
talatnat CreditAttribution: talatnat commentedActually, there are around 110,000 entries the URL_alias table, so it is much smaller. The issue is that many of the URL_aliases are being maintained to keep old links, paths from breaking. So, I am just looking for a more-efficient way of crawling and then storing the content.
I'll post a Feature Request re: sitemap.xml, *what to crawl". Thanks again.
Comment #9
mikeytown2 CreditAttribution: mikeytown2 commentedcommitted
Comment #10
brianmercer CreditAttribution: brianmercer commentedI just started using the crawler and ran into this issue.
Patch at #4 works nicely for me also for unpublished nodes, thanks.
On a related note, I have pathauto making aliases for user/[username] but the "access user profiles" permission off for anonymous users. So I get an "access denied" watchdog entry for each user/[uid] and each user/[uid]/track and each user/[uid]/track/feed.
Comment #11
mikeytown2 CreditAttribution: mikeytown2 commentedso I need to do a query for
If I get 1 back then crawl user/* if 0 then do not crawl user/*. Sound right?
Comment #12
brianmercer CreditAttribution: brianmercer commentedWhatever you think is best. User pages aren't nodes yet so you can't run them through node_access, I assume.
Comment #13
mikeytown2 CreditAttribution: mikeytown2 commentedComment #14
gooddesignusa CreditAttribution: gooddesignusa commentedI also noticed boost hitting my user pages. I do have path auto set up to make aliases for the user pages. I tried adding user/* inside the "Cache every page except the listed pages." textarea but no luck :(
I guess i could just shut off path auto for user names if there isn't any other way around this atm.
Comment #15
Anonymous (not verified) CreditAttribution: Anonymous commentedCan the patch from #4 be used with the latest dev, or would it be wiser to wait until there is a patch or new dev that addresses both the unpublished nodes and the user profiles (as I have the same issue here)?
Comment #16
mikeytown2 CreditAttribution: mikeytown2 commented#4 is already in the latest dev.
Comment #17
Anonymous (not verified) CreditAttribution: Anonymous commentedOdd; I am using the latest dev and it still crawls unpublished nodes...
Comment #18
mikeytown2 CreditAttribution: mikeytown2 commentedpatch is for user/* pages.
@TfR75 do you have something else controlling node access? are you using multiple languages?
Comment #19
mikeytown2 CreditAttribution: mikeytown2 commentedcommitted above patch
Comment #20
mikeytown2 CreditAttribution: mikeytown2 commentedComment #21
Anonymous (not verified) CreditAttribution: Anonymous commentedI have patched the latest dev with the patch supplied and the it seems neither user pages nor unpublished nodes are still crawled. However, crawling now produces loads (i.e. 5-20 per minute of crawling (my site took almost 40 minutes to crawl) of php error messages, all saying:
Column 'rid' in where clause is ambiguous query: SELECT * FROM permission INNER JOIN role USING (rid) WHERE (name = 'anonymous user' OR rid = 1) AND perm LIKE '%access user profiles%' in /SERVERPATH/sites/all/modules/boost/boost.module on line 5571.
Location for those error message is, for instance: http://www.exampledomain.com/boost-crawler?nocache=1&key=ff48b7be882ecc0...
Comment #22
Anonymous (not verified) CreditAttribution: Anonymous commentedThis may not be of relevance, but I just wanted to add it. It seems that with the dev version my database is now a lot smaller as several MB have now gone from the tables, with boost_cache_settings and boost_crawler being empty. With the 6.x-1.18 beforehand I had several MB in boost tables. I assume this is the expected behaviour now, I'd just thought it best to mention as I had only noted after the cron run with the patched version, just in case.
Comment #23
mikeytown2 CreditAttribution: mikeytown2 commentedpatch to fix the above query... its been committed.
This is to be expected; I'm limiting what I keep track of in the boost relationship table, keeps mysql happier on large sites.
Comment #24
Anonymous (not verified) CreditAttribution: Anonymous commentedApplied the patch, emptied the cache, ran cron - which resulted in the following error message:
Column 'language' in field list is ambiguous query: SELECT dst, language FROM url_alias AS ua LEFT JOIN node AS n ON n.nid = CAST(substring(ua.src, 6) AS UNSIGNED) WHERE (n.status = 1 OR n.status IS NULL) AND ua.src NOT LIKE 'user/%' LIMIT 0, 3333 in /www/htdocs/tfr/winerambler/sites/all/modules/boost/boost.module on line 5582.
As a result the crawler stopped, no pages were cached at all.
Comment #25
mikeytown2 CreditAttribution: mikeytown2 commentedThis has been committed; keep the bug reports up
Comment #26
Anonymous (not verified) CreditAttribution: Anonymous commentedWith the latest nightly build, the crawler gets through my site without any error message at all. So as far as I am concerned this can be closed... Thanks again for the amazing speed of fixing issues!
Comment #27
mikeytown2 CreditAttribution: mikeytown2 commented