This thread aims to combine these 2 together.
#363077: Add spider to crawler - Cache entire site with new install.
#337391: Setting to grab url's from url_alias table.

Ideally the export module would be the way to go; only problem is the batch api doesn't work from cron (see #229905: Batch API assumes client's support of meta refresh). So I need to combine the above two issues. This is the goal of this thread. First step is to grab some URL's from the db on cron and cache them.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mikeytown2’s picture

Status: Active » Needs review
FileSize
3.56 KB

This does 5 URL's at a time.

mikeytown2’s picture

FileSize
6.02 KB

This does 5 at a time and calls it's self for unlimited URL crawlage.

mikeytown2’s picture

FileSize
10.33 KB

Fine grained control over html, xml & json.

mikeytown2’s picture

Status: Needs review » Needs work

In order for this to be smart, I need a boost_crawler table. Once the crawler is done it will truncate the table, signaling to the worker thread(s) that it is done. This way I can grab URL's from boost_cache, url_alias, & nodes; and I should be able to make it use multiple threads. I can also then only crawl pages that are not cached currently.

id - serial - PK
extension - varchar 8
url - varchar 255

index on ID & extension.

capellic’s picture

Exciting to see progress on this!!!!

BTW, just wondering if my issue over here at the Google Analytics module is cause for concern when pre-caching. Certainly this issue wouldn't only be limited to GA, but to any sort of stat tracking module.

#538626: Exclude hits when a specific arg string is present

mikeytown2’s picture

@capellic
The crawler loads the html, it doesn't load the elements inside like javascript or images, so you don't have to worry about 3rd party stats. It will count on Drupal stats though, since that counts when the html is loaded.

mikeytown2’s picture

Just tested the above patch on a live server and it works as advertised; which means the core of the crawler code works in drupal core. Thats good, it's like the batch API but it doesn't need a browser to call it's self. So my two step php hack works, now I need to make the crawler smarter.

Roadmap:

  • Copy the URL's from the boost_cache table into the boost_crawler table. Pass a URL variable to keep track of the progress for large sites; transfer 10,000 URLs at a time to the boost_crawler table via LIMIT. Need to set ORDER BY filename in the SQL.
  • Allow for a stop signal to be given
  • - Ship RC1 -

  • Support multiple threads. This will be built with this in mind, but wont be enabled until it's been tested.
  • Use the url_alias table to populate the boost_cache; that will assume all content in there is a html doc, unless it ends in .xml or /feed.
  • Enable showing stats on the crawler
  • Allow admin to better fine-tune some parameters of the crawler
  • Have a real crawler, like what I have in #363077: Add spider to crawler - Cache entire site with new install.; auto discovers content not in the boost_cache or url_alias table, but is in a <a href=""></a> html tag.
mikeytown2’s picture

Wow, thats some slow code... #3 took over twice as long to crawl a site as the crawler does. Some of it has to do with the sleep(1) but thats pretty awful. Doing a bootstrap to crawl every 10 pages is another source. I knew this would be a little bit slower, but I wasn't expecting this. For this to work well, there's a lot of work to do. I'll try refactoring the code (I made sure I wouldn't get a timeout), but if I can't get it's performance to improve, I'll be shipping RC1 without the cron crawler.

mikeytown2’s picture

Status: Needs work » Needs review
FileSize
15.45 KB

Still need to add in a stop crawler button & code. But this should be faster and it defaults to using 2 threads. Loads 25 url's per run.

mikeytown2’s picture

Status: Needs review » Active

Committed the above code with a stop button.

mikeytown2’s picture

Status: Active » Fixed
capellic’s picture

You're on fire! Just a question about item #1 in the roadmap.

Copy the URL's from the boost_cache table into the boost_crawler table.

Will this require me to use HTML file caching? Right now I am using my Pre-Cache module only to trigger Drupal core's database cache and do not have Boost HTML file caching turned on. Why not? Because the pages that aren't pre-cached aren't as popular so 1) they don't need to be cached and 2) I don't want the user to have the additional penalty of having to wait for the HTML file to be written to disk.

I realize that part of this is moot because the crawler removes concern #2, but I am still a bit concerned about having the crawler hit all nodes due to aforementioned performance issues - actually -- it's more of a preference to see the two be able to work independently from one another for flexibility's sake.

mikeytown2’s picture

Status: Fixed » Needs work
FileSize
1.54 KB

Right now the crawler is tided to boost quite heavily, so it will only hit URL's that it can grab from the boost_cache table. If you enable boost, hit the url's you want then disable the html file cache; the crawler will still grab those url's in the future. Right now boost could be split off into about 3 different modules, but it's much easier to develop for it when it's all in one codebase. Your request gives me another table to get URL's from which is cache_page
http://api.drupal.org/api/function/page_set_cache
It's a temp table so this wouldn't be that useful.

Here's the first bug... crawler finished and only did 1/2 the URL's here's an attempted fix. It's amazing how this doesn't work the same on all systems... gotta find code that does.

mikeytown2’s picture

Status: Needs work » Needs review
FileSize
3.27 KB

Figured out whats up. Each thread needs to wait a random amount of time otherwise things get messy.

mikeytown2’s picture

Status: Needs review » Fixed

committed this

giorgio79’s picture

Hey Mikey,

Nice stuff.

Would you know the crawl rate? Something like how many pages per sec is crawled?
Could I request a feature to set time between crawl requests, like 4-5 secs between each?

Cheers,
G

mikeytown2’s picture

Status: Fixed » Active

Being able to get the crawl rate would be semi possible. It would be the average since the crawler started counting all threads & it would be quite inaccurate. Increased accuracy means a much slower crawler, since each thread would write to the database after each page was crawled. Having a "throttle" would also be possible using the usleep() function inside the crawler; but why wait 5 seconds between each request? Being able to set the number of threads is also needed.

giorgio79’s picture

The 5 sec was just random, if this could be set, that is, the secs to wait between requests would be nice.

Just like the number of threads running.

I am happy with 1 thread with 2-5 secs between requests. :)

I have 10 000 nodes, and I am afraid of hammering the server if I turn the feature on as it is at the moment :)

Also not sure about the php timeout for this many nodes.
Is this done when cron runs? Or when the submit button is pressed to clear cache?

mikeytown2’s picture

php/cpu timeout is taken care of; it shouldn't happen, ever. Crawler starts right after cron is run.

g10tto’s picture

I have a site requiring user authentication for 95% of its content. Since Boost only caches pages viewed by anonymous users, is it possible to still use this crawler (using default Drupal and Views caching) with my site?

mikeytown2’s picture

@g10tto
The crawler hit's pages as an anonymous user, so if that page returns a 403, it doesn't really help anyone. Drupal's cache (what you see on the performance page) is for anonymous users as well. At the bottom of the boost project page are links to other performance/caching methods.

g10tto’s picture

@mikey Thanks for the tip.

Is there another known crawler program for Drupal, or a way to implement a 3rd party crawler? I'm just curious, because it seems like the crawler functionality would especially come in handy on larger news-related sites that have an archive of nodes, and I would be surprised if there were no other implementation until now.

In addition, I have wondered for some time now why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.

mikeytown2’s picture

#363077: Add spider to crawler - Cache entire site with new install.
Here's a crawler I made thats independent of drupal. It's crawled a site with over 1,000,000 url's; it works, but not easy to setup.

...why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.

I'm assuming your talking about it only caching what anonymous users see. Anonymous is easy; there's only 1 version of each page, and with that, boost is still a very complicated module.

mikeytown2’s picture

Title: Auto Regenerate Cache (pre-caching) Preemptive Cron Cache » Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats
mikeytown2’s picture

Status: Active » Fixed
capellic’s picture

Nice! Since I've built my own Pre-Caching module to take care of the 5 to 10 top-level pages on my site, I don't have a use for this at this time. I have a couple of bigger projects coming up and I'll certainly be giving this feature a spin!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.