Boost trying to crawl Unpublished nodes [#787464]

Comment	File	Size	Author
#25	boost-787464.patch	1.28 KB	mikeytown2

#23	boost-787464.patch	982 bytes	mikeytown2

#18	boost-787464.patch	1.84 KB	mikeytown2

#4	boost-787464.patch	1.16 KB	mikeytown2

#2	boost-787464.patch	1.17 KB	mikeytown2

Comment #1

mikeytown2 CreditAttribution: mikeytown2 commented 2 May 2010 at 01:57

Long story short if the node has a URL alias, Boost will try to crawl it if you have "Crawl All URL's in the url_alias table" selected. Looks like I should try to do some sort of join on the url_alias table and the node table to check if the node is published. I would want this to be yet another option since I've used the crawler in the past to hit millions of nodes and the extra overhead of the join would slow down an operation like that quite a bit.

Query needed

SELECT *
FROM url_alias
LEFT JOIN node ON Concat( 'node/', CAST( node.nid AS CHAR ) ) = url_alias.src
WHERE node.status = 1
OR node.status IS NULL

In case your wondering, boost will auto remove entries from the cache table if it detects a 403. This occurs when a node that was published goes unpublished.

Log in or register to post comments

Comment #2

mikeytown2 CreditAttribution: mikeytown2 commented 2 May 2010 at 02:04

Status:

Active

» Needs review

File	Size
boost-787464.patch	1.17 KB

Let me know if this does it for you

Log in or register to post comments

Comment #3

mikeytown2 CreditAttribution: mikeytown2 commented 2 May 2010 at 04:05

Using SUBSTRING might make it run faster... yes it does. This runs at an acceptable level 1.2M rows in less then a second. Previous query timed out after 300 seconds.

SELECT *
FROM url_alias
LEFT JOIN node ON node.nid = CAST( substring(url_alias.src, 6) AS UNSIGNED )
WHERE node.status = 1
OR node.status IS NULL

I can do this without the option :)

Log in or register to post comments

Comment #4

mikeytown2 CreditAttribution: mikeytown2 commented 2 May 2010 at 04:50

File	Size
boost-787464.patch	1.16 KB

in case your wondering the original query was doing a join on node/555

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	url_alias	ALL	NULL	NULL	NULL	NULL	1284788
1	SIMPLE	node	ALL	NULL	NULL	NULL	NULL	644862	Using where

The new query does a join on 555

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	url_alias	ALL	NULL	NULL	NULL	NULL	1284788
1	SIMPLE	node	eq_ref	PRIMARY	PRIMARY	4	func	1	Using where

This plus casting the url_alias as an integer made all the difference.

Log in or register to post comments

Comment #5

talatnat CreditAttribution: talatnat commented 3 May 2010 at 07:58

Thank you, that was quick! Yes, I did have Boost set to crawl the URL_alias table. The patch got applied successfully, and I am testing on a local setup. I will let you know how things work out.

Log in or register to post comments

Comment #6

talatnat CreditAttribution: talatnat commented 6 May 2010 at 02:25

Thank you, it seems to work for me -- I don't see Access Denied notices for unpublished nodes.

This is related more to my specific case where the URL_alias table is overly large. I think it would be a nice feature to have some fine-grained control on *what to crawl* (similar to the current feature of "Statically cache specific pages: a) Cache every page except the listed pages. b) Cache only the listed pages. c) Cache pages for which the following PHP code returns TRUE (PHP-mode, experts only)). For example, I have URL-aliases for taxonomy/term/* but they are not linked to specific views so they yield Page Not Found errors. I presumed that like node/* with aliases, they would not be crawled, but they are

Also, at the moment, I gather the Rewrite Rules are hardcoded to avoid crawling admin, user, etc. pages?

If I am not doing something wrong, and the above is current expected behaviour, I can open a Feature Request for *what to crawl*.

Addition: Could an alternative to URL_alias table be sitemap.xml from xmlsitemaps module? The latter has some controls on types of content to include in sitemap, relevance, and so on.

Log in or register to post comments

Comment #7

mikeytown2 CreditAttribution: mikeytown2 commented 6 May 2010 at 03:57

Status:

Needs review

» Reviewed & tested by the community

Crawl from sitemap.xml sounds like a good idea... put in a request for that.

What to crawl can sorta be set in a limited way with the boost configuration block. Setting push to FALSE is supposed to do it; but I think it would only work for things that are already in the cache. It was one of those "this might be a good idea" that never really took off from my point of view (I never use it).

The what to crawl feature would be an extension of the boost configuration block, allowing presets per things like node type and view name, as well as a custom way to handle expiration so nodes of this type will expire X etc... #622820: Expiration Grid - road map for this module

How large is your URL_alias table? I have a test db that has 1,284,788 entries in it; just curious if your running an even larger site.

Log in or register to post comments

Comment #8

talatnat CreditAttribution: talatnat commented 6 May 2010 at 08:26

Actually, there are around 110,000 entries the URL_alias table, so it is much smaller. The issue is that many of the URL_aliases are being maintained to keep old links, paths from breaking. So, I am just looking for a more-efficient way of crawling and then storing the content.

I'll post a Feature Request re: sitemap.xml, *what to crawl". Thanks again.

Log in or register to post comments

Comment #9

mikeytown2 CreditAttribution: mikeytown2 commented 7 May 2010 at 01:25

Status:

Reviewed & tested by the community

» Fixed

committed

Log in or register to post comments

Comment #10

brianmercer CreditAttribution: brianmercer commented 12 May 2010 at 22:02

I just started using the crawler and ran into this issue.

Patch at #4 works nicely for me also for unpublished nodes, thanks.

On a related note, I have pathauto making aliases for user/[username] but the "access user profiles" permission off for anonymous users. So I get an "access denied" watchdog entry for each user/[uid] and each user/[uid]/track and each user/[uid]/track/feed.

Log in or register to post comments

Comment #11

mikeytown2 CreditAttribution: mikeytown2 commented 12 May 2010 at 22:07

so I need to do a query for

SELECT count(*) FROM permission WHERE rid = 1 AND perm LIKE '%access user profiles%'

If I get 1 back then crawl user/* if 0 then do not crawl user/*. Sound right?

Log in or register to post comments

Comment #12

brianmercer CreditAttribution: brianmercer commented 12 May 2010 at 22:44

Whatever you think is best. User pages aren't nodes yet so you can't run them through node_access, I assume.

Log in or register to post comments

Comment #13

mikeytown2 CreditAttribution: mikeytown2 commented 23 May 2010 at 17:00

Status:

Fixed

» Active

Log in or register to post comments

Comment #14

gooddesignusa CreditAttribution: gooddesignusa commented 10 June 2010 at 17:43

I also noticed boost hitting my user pages. I do have path auto set up to make aliases for the user pages. I tried adding user/* inside the "Cache every page except the listed pages." textarea but no luck :(

I guess i could just shut off path auto for user names if there isn't any other way around this atm.

Log in or register to post comments

Comment #15

Anonymous (not verified) CreditAttribution: Anonymous commented 19 September 2010 at 11:55

Can the patch from #4 be used with the latest dev, or would it be wiser to wait until there is a patch or new dev that addresses both the unpublished nodes and the user profiles (as I have the same issue here)?

Log in or register to post comments

Comment #16

mikeytown2 CreditAttribution: mikeytown2 commented 19 September 2010 at 16:01

#4 is already in the latest dev.

Log in or register to post comments

Comment #17

Anonymous (not verified) CreditAttribution: Anonymous commented 19 September 2010 at 16:03

Odd; I am using the latest dev and it still crawls unpublished nodes...

Log in or register to post comments

Comment #18

mikeytown2 CreditAttribution: mikeytown2 commented 19 September 2010 at 16:20

Status:

Active

» Needs review

File	Size
boost-787464.patch	1.84 KB

patch is for user/* pages.
@TfR75 do you have something else controlling node access? are you using multiple languages?

Log in or register to post comments

Comment #19

mikeytown2 CreditAttribution: mikeytown2 commented 19 September 2010 at 16:34

Status:

Needs review

» Active

committed above patch

Log in or register to post comments

Comment #20

mikeytown2 CreditAttribution: mikeytown2 commented 19 September 2010 at 16:39

Status:

Active

» Postponed (maintainer needs more info)

Log in or register to post comments

Comment #21

Anonymous (not verified) CreditAttribution: Anonymous commented 19 September 2010 at 18:46

I have patched the latest dev with the patch supplied and the it seems neither user pages nor unpublished nodes are still crawled. However, crawling now produces loads (i.e. 5-20 per minute of crawling (my site took almost 40 minutes to crawl) of php error messages, all saying:

Column 'rid' in where clause is ambiguous query: SELECT * FROM permission INNER JOIN role USING (rid) WHERE (name = 'anonymous user' OR rid = 1) AND perm LIKE '%access user profiles%' in /SERVERPATH/sites/all/modules/boost/boost.module on line 5571.

Location for those error message is, for instance: http://www.exampledomain.com/boost-crawler?nocache=1&key=ff48b7be882ecc0...

Log in or register to post comments

Comment #22

Anonymous (not verified) CreditAttribution: Anonymous commented 19 September 2010 at 19:04

Status:

Postponed (maintainer needs more info)

» Active

This may not be of relevance, but I just wanted to add it. It seems that with the dev version my database is now a lot smaller as several MB have now gone from the tables, with boost_cache_settings and boost_crawler being empty. With the 6.x-1.18 beforehand I had several MB in boost tables. I assume this is the expected behaviour now, I'd just thought it best to mention as I had only noted after the cron run with the patched version, just in case.

Log in or register to post comments

Comment #23

mikeytown2 CreditAttribution: mikeytown2 commented 20 September 2010 at 00:31

File	Size
boost-787464.patch	982 bytes

patch to fix the above query... its been committed.

This is to be expected; I'm limiting what I keep track of in the boost relationship table, keeps mysql happier on large sites.

Log in or register to post comments

Comment #24

Anonymous (not verified) CreditAttribution: Anonymous commented 20 September 2010 at 00:48

Applied the patch, emptied the cache, ran cron - which resulted in the following error message:

Column 'language' in field list is ambiguous query: SELECT dst, language FROM url_alias AS ua LEFT JOIN node AS n ON n.nid = CAST(substring(ua.src, 6) AS UNSIGNED) WHERE (n.status = 1 OR n.status IS NULL) AND ua.src NOT LIKE 'user/%' LIMIT 0, 3333 in /www/htdocs/tfr/winerambler/sites/all/modules/boost/boost.module on line 5582.

As a result the crawler stopped, no pages were cached at all.

Log in or register to post comments

Comment #25

mikeytown2 CreditAttribution: mikeytown2 commented 20 September 2010 at 01:21

File	Size
boost-787464.patch	1.28 KB

This has been committed; keep the bug reports up

Log in or register to post comments

Comment #26

Anonymous (not verified) CreditAttribution: Anonymous commented 21 September 2010 at 22:58

With the latest nightly build, the crawler gets through my site without any error message at all. So as far as I am concerned this can be closed... Thanks again for the amazing speed of fixing issues!