Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats [#538460]

Comment	File	Size	Author
#14	boost-538460.6.patch	3.27 KB	mikeytown2

#13	boost-538460.5.patch	1.54 KB	mikeytown2

#9	boost-538460.3.patch	15.45 KB	mikeytown2

#3	boost-538460.2.patch	10.33 KB	mikeytown2

#2	boost-538460.1.patch	6.02 KB	mikeytown2

#1	boost-538460.patch	3.56 KB	mikeytown2

Comment #1

mikeytown2 CreditAttribution: mikeytown2 commented 3 August 2009 at 07:03

Status:

Active

» Needs review

File	Size
boost-538460.patch	3.56 KB

This does 5 URL's at a time.

Log in or register to post comments

Comment #2

mikeytown2 CreditAttribution: mikeytown2 commented 3 August 2009 at 10:21

File	Size
boost-538460.1.patch	6.02 KB

This does 5 at a time and calls it's self for unlimited URL crawlage.

Log in or register to post comments

Comment #3

mikeytown2 CreditAttribution: mikeytown2 commented 4 August 2009 at 02:05

File	Size
boost-538460.2.patch	10.33 KB

Fine grained control over html, xml & json.

Log in or register to post comments

Comment #4

mikeytown2 CreditAttribution: mikeytown2 commented 4 August 2009 at 10:08

Status:

Needs review

» Needs work

In order for this to be smart, I need a boost_crawler table. Once the crawler is done it will truncate the table, signaling to the worker thread(s) that it is done. This way I can grab URL's from boost_cache, url_alias, & nodes; and I should be able to make it use multiple threads. I can also then only crawl pages that are not cached currently.

id - serial - PK
extension - varchar 8
url - varchar 255

index on ID & extension.

Log in or register to post comments

Comment #5

capellic

he/him/his

Austin, Texas

CreditAttribution: capellic commented 4 August 2009 at 16:09

Exciting to see progress on this!!!!

BTW, just wondering if my issue over here at the Google Analytics module is cause for concern when pre-caching. Certainly this issue wouldn't only be limited to GA, but to any sort of stat tracking module.

#538626: Exclude hits when a specific arg string is present

Log in or register to post comments

Comment #6

mikeytown2 CreditAttribution: mikeytown2 commented 4 August 2009 at 19:31

@capellic
The crawler loads the html, it doesn't load the elements inside like javascript or images, so you don't have to worry about 3rd party stats. It will count on Drupal stats though, since that counts when the html is loaded.

Log in or register to post comments

Comment #7

mikeytown2 CreditAttribution: mikeytown2 commented 5 August 2009 at 08:51

Just tested the above patch on a live server and it works as advertised; which means the core of the crawler code works in drupal core. Thats good, it's like the batch API but it doesn't need a browser to call it's self. So my two step php hack works, now I need to make the crawler smarter.

Roadmap:

Copy the URL's from the boost_cache table into the boost_crawler table. Pass a URL variable to keep track of the progress for large sites; transfer 10,000 URLs at a time to the boost_crawler table via LIMIT. Need to set ORDER BY filename in the SQL.
Allow for a stop signal to be given

- Ship RC1 -

Support multiple threads. This will be built with this in mind, but wont be enabled until it's been tested.
Use the url_alias table to populate the boost_cache; that will assume all content in there is a html doc, unless it ends in .xml or /feed.
Enable showing stats on the crawler
Allow admin to better fine-tune some parameters of the crawler
Have a real crawler, like what I have in #363077: Add spider to crawler - Cache entire site with new install.; auto discovers content not in the boost_cache or url_alias table, but is in a <a href=""></a> html tag.

Log in or register to post comments

Comment #8

mikeytown2 CreditAttribution: mikeytown2 commented 5 August 2009 at 11:06

Wow, thats some slow code... #3 took over twice as long to crawl a site as the crawler does. Some of it has to do with the sleep(1) but thats pretty awful. Doing a bootstrap to crawl every 10 pages is another source. I knew this would be a little bit slower, but I wasn't expecting this. For this to work well, there's a lot of work to do. I'll try refactoring the code (I made sure I wouldn't get a timeout), but if I can't get it's performance to improve, I'll be shipping RC1 without the cron crawler.

Log in or register to post comments

Comment #9

mikeytown2 CreditAttribution: mikeytown2 commented 6 August 2009 at 00:49

Status:

Needs work

» Needs review

File	Size
boost-538460.3.patch	15.45 KB

Still need to add in a stop crawler button & code. But this should be faster and it defaults to using 2 threads. Loads 25 url's per run.

Log in or register to post comments

Comment #10

mikeytown2 CreditAttribution: mikeytown2 commented 6 August 2009 at 08:21

Status:

Needs review

» Active

Committed the above code with a stop button.

Log in or register to post comments

Comment #11

mikeytown2 CreditAttribution: mikeytown2 commented 6 August 2009 at 08:31

Status:

Active

» Fixed

Log in or register to post comments

Comment #12

capellic

he/him/his

Austin, Texas

CreditAttribution: capellic commented 6 August 2009 at 12:28

You're on fire! Just a question about item #1 in the roadmap.

Copy the URL's from the boost_cache table into the boost_crawler table.

Will this require me to use HTML file caching? Right now I am using my Pre-Cache module only to trigger Drupal core's database cache and do not have Boost HTML file caching turned on. Why not? Because the pages that aren't pre-cached aren't as popular so 1) they don't need to be cached and 2) I don't want the user to have the additional penalty of having to wait for the HTML file to be written to disk.

I realize that part of this is moot because the crawler removes concern #2, but I am still a bit concerned about having the crawler hit all nodes due to aforementioned performance issues - actually -- it's more of a preference to see the two be able to work independently from one another for flexibility's sake.

Log in or register to post comments

Comment #13

mikeytown2 CreditAttribution: mikeytown2 commented 6 August 2009 at 13:23

Status:

Fixed

» Needs work

File	Size
boost-538460.5.patch	1.54 KB

Right now the crawler is tided to boost quite heavily, so it will only hit URL's that it can grab from the boost_cache table. If you enable boost, hit the url's you want then disable the html file cache; the crawler will still grab those url's in the future. Right now boost could be split off into about 3 different modules, but it's much easier to develop for it when it's all in one codebase. Your request gives me another table to get URL's from which is cache_page
http://api.drupal.org/api/function/page_set_cache
It's a temp table so this wouldn't be that useful.

Here's the first bug... crawler finished and only did 1/2 the URL's here's an attempted fix. It's amazing how this doesn't work the same on all systems... gotta find code that does.

Log in or register to post comments

Comment #14

mikeytown2 CreditAttribution: mikeytown2 commented 6 August 2009 at 14:53

Status:

Needs work

» Needs review

File	Size
boost-538460.6.patch	3.27 KB

Figured out whats up. Each thread needs to wait a random amount of time otherwise things get messy.

Log in or register to post comments

Comment #15

mikeytown2 CreditAttribution: mikeytown2 commented 6 August 2009 at 14:57

Status:

Needs review

» Fixed

committed this

Log in or register to post comments

Comment #16

giorgio79 CreditAttribution: giorgio79 commented 13 August 2009 at 12:20

Hey Mikey,

Nice stuff.

Would you know the crawl rate? Something like how many pages per sec is crawled?
Could I request a feature to set time between crawl requests, like 4-5 secs between each?

Cheers,
G

Log in or register to post comments

Comment #17

mikeytown2 CreditAttribution: mikeytown2 commented 13 August 2009 at 17:22

Status:

Fixed

» Active

Being able to get the crawl rate would be semi possible. It would be the average since the crawler started counting all threads & it would be quite inaccurate. Increased accuracy means a much slower crawler, since each thread would write to the database after each page was crawled. Having a "throttle" would also be possible using the usleep() function inside the crawler; but why wait 5 seconds between each request? Being able to set the number of threads is also needed.

Log in or register to post comments

Comment #18

giorgio79 CreditAttribution: giorgio79 commented 13 August 2009 at 17:46

The 5 sec was just random, if this could be set, that is, the secs to wait between requests would be nice.

Just like the number of threads running.

I am happy with 1 thread with 2-5 secs between requests. :)

I have 10 000 nodes, and I am afraid of hammering the server if I turn the feature on as it is at the moment :)

Also not sure about the php timeout for this many nodes.
Is this done when cron runs? Or when the submit button is pressed to clear cache?

Log in or register to post comments

Comment #19

mikeytown2 CreditAttribution: mikeytown2 commented 13 August 2009 at 18:08

php/cpu timeout is taken care of; it shouldn't happen, ever. Crawler starts right after cron is run.

Log in or register to post comments

Comment #20

g10tto CreditAttribution: g10tto commented 13 August 2009 at 20:14

I have a site requiring user authentication for 95% of its content. Since Boost only caches pages viewed by anonymous users, is it possible to still use this crawler (using default Drupal and Views caching) with my site?

Log in or register to post comments

Comment #21

mikeytown2 CreditAttribution: mikeytown2 commented 13 August 2009 at 20:37

@g10tto
The crawler hit's pages as an anonymous user, so if that page returns a 403, it doesn't really help anyone. Drupal's cache (what you see on the performance page) is for anonymous users as well. At the bottom of the boost project page are links to other performance/caching methods.

Log in or register to post comments

Comment #22

g10tto CreditAttribution: g10tto commented 13 August 2009 at 21:44

@mikey Thanks for the tip.

Is there another known crawler program for Drupal, or a way to implement a 3rd party crawler? I'm just curious, because it seems like the crawler functionality would especially come in handy on larger news-related sites that have an archive of nodes, and I would be surprised if there were no other implementation until now.

In addition, I have wondered for some time now why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.

Log in or register to post comments

Comment #23

mikeytown2 CreditAttribution: mikeytown2 commented 13 August 2009 at 22:02

#363077: Add spider to crawler - Cache entire site with new install.
Here's a crawler I made thats independent of drupal. It's crawled a site with over 1,000,000 url's; it works, but not easy to setup.

...why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.

I'm assuming your talking about it only caching what anonymous users see. Anonymous is easy; there's only 1 version of each page, and with that, boost is still a very complicated module.

Log in or register to post comments

Comment #24

mikeytown2 CreditAttribution: mikeytown2 commented 16 August 2009 at 07:18

Title:

Auto Regenerate Cache (pre-caching) Preemptive Cron Cache

» Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats

Log in or register to post comments

Comment #25

mikeytown2 CreditAttribution: mikeytown2 commented 21 August 2009 at 02:37

Status:

Active

» Fixed

Log in or register to post comments

Comment #26

capellic

he/him/his

Austin, Texas

CreditAttribution: capellic commented 29 August 2009 at 22:04

Nice! Since I've built my own Pre-Caching module to take care of the 5 to 10 top-level pages on my site, I don't have a use for this at this time. I have a couple of bigger projects coming up and I'll certainly be giving this feature a spin!