Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
This thread aims to combine these 2 together.
#363077: Add spider to crawler - Cache entire site with new install.
#337391: Setting to grab url's from url_alias table.
Ideally the export module would be the way to go; only problem is the batch api doesn't work from cron (see #229905: Batch API assumes client's support of meta refresh). So I need to combine the above two issues. This is the goal of this thread. First step is to grab some URL's from the db on cron and cache them.
Comment | File | Size | Author |
---|---|---|---|
#14 | boost-538460.6.patch | 3.27 KB | mikeytown2 |
#13 | boost-538460.5.patch | 1.54 KB | mikeytown2 |
#9 | boost-538460.3.patch | 15.45 KB | mikeytown2 |
#3 | boost-538460.2.patch | 10.33 KB | mikeytown2 |
#2 | boost-538460.1.patch | 6.02 KB | mikeytown2 |
Comments
Comment #1
mikeytown2 CreditAttribution: mikeytown2 commentedThis does 5 URL's at a time.
Comment #2
mikeytown2 CreditAttribution: mikeytown2 commentedThis does 5 at a time and calls it's self for unlimited URL crawlage.
Comment #3
mikeytown2 CreditAttribution: mikeytown2 commentedFine grained control over html, xml & json.
Comment #4
mikeytown2 CreditAttribution: mikeytown2 commentedIn order for this to be smart, I need a boost_crawler table. Once the crawler is done it will truncate the table, signaling to the worker thread(s) that it is done. This way I can grab URL's from boost_cache, url_alias, & nodes; and I should be able to make it use multiple threads. I can also then only crawl pages that are not cached currently.
index on ID & extension.
Comment #5
capellicExciting to see progress on this!!!!
BTW, just wondering if my issue over here at the Google Analytics module is cause for concern when pre-caching. Certainly this issue wouldn't only be limited to GA, but to any sort of stat tracking module.
#538626: Exclude hits when a specific arg string is present
Comment #6
mikeytown2 CreditAttribution: mikeytown2 commented@capellic
The crawler loads the html, it doesn't load the elements inside like javascript or images, so you don't have to worry about 3rd party stats. It will count on Drupal stats though, since that counts when the html is loaded.
Comment #7
mikeytown2 CreditAttribution: mikeytown2 commentedJust tested the above patch on a live server and it works as advertised; which means the core of the crawler code works in drupal core. Thats good, it's like the batch API but it doesn't need a browser to call it's self. So my two step php hack works, now I need to make the crawler smarter.
Roadmap:
- Ship RC1 -
<a href=""></a>
html tag.Comment #8
mikeytown2 CreditAttribution: mikeytown2 commentedWow, thats some slow code... #3 took over twice as long to crawl a site as the crawler does. Some of it has to do with the sleep(1) but thats pretty awful. Doing a bootstrap to crawl every 10 pages is another source. I knew this would be a little bit slower, but I wasn't expecting this. For this to work well, there's a lot of work to do. I'll try refactoring the code (I made sure I wouldn't get a timeout), but if I can't get it's performance to improve, I'll be shipping RC1 without the cron crawler.
Comment #9
mikeytown2 CreditAttribution: mikeytown2 commentedStill need to add in a stop crawler button & code. But this should be faster and it defaults to using 2 threads. Loads 25 url's per run.
Comment #10
mikeytown2 CreditAttribution: mikeytown2 commentedCommitted the above code with a stop button.
Comment #11
mikeytown2 CreditAttribution: mikeytown2 commentedComment #12
capellicYou're on fire! Just a question about item #1 in the roadmap.
Will this require me to use HTML file caching? Right now I am using my Pre-Cache module only to trigger Drupal core's database cache and do not have Boost HTML file caching turned on. Why not? Because the pages that aren't pre-cached aren't as popular so 1) they don't need to be cached and 2) I don't want the user to have the additional penalty of having to wait for the HTML file to be written to disk.
I realize that part of this is moot because the crawler removes concern #2, but I am still a bit concerned about having the crawler hit all nodes due to aforementioned performance issues - actually -- it's more of a preference to see the two be able to work independently from one another for flexibility's sake.
Comment #13
mikeytown2 CreditAttribution: mikeytown2 commentedRight now the crawler is tided to boost quite heavily, so it will only hit URL's that it can grab from the boost_cache table. If you enable boost, hit the url's you want then disable the html file cache; the crawler will still grab those url's in the future. Right now boost could be split off into about 3 different modules, but it's much easier to develop for it when it's all in one codebase. Your request gives me another table to get URL's from which is cache_page
http://api.drupal.org/api/function/page_set_cache
It's a temp table so this wouldn't be that useful.
Here's the first bug... crawler finished and only did 1/2 the URL's here's an attempted fix. It's amazing how this doesn't work the same on all systems... gotta find code that does.
Comment #14
mikeytown2 CreditAttribution: mikeytown2 commentedFigured out whats up. Each thread needs to wait a random amount of time otherwise things get messy.
Comment #15
mikeytown2 CreditAttribution: mikeytown2 commentedcommitted this
Comment #16
giorgio79 CreditAttribution: giorgio79 commentedHey Mikey,
Nice stuff.
Would you know the crawl rate? Something like how many pages per sec is crawled?
Could I request a feature to set time between crawl requests, like 4-5 secs between each?
Cheers,
G
Comment #17
mikeytown2 CreditAttribution: mikeytown2 commentedBeing able to get the crawl rate would be semi possible. It would be the average since the crawler started counting all threads & it would be quite inaccurate. Increased accuracy means a much slower crawler, since each thread would write to the database after each page was crawled. Having a "throttle" would also be possible using the usleep() function inside the crawler; but why wait 5 seconds between each request? Being able to set the number of threads is also needed.
Comment #18
giorgio79 CreditAttribution: giorgio79 commentedThe 5 sec was just random, if this could be set, that is, the secs to wait between requests would be nice.
Just like the number of threads running.
I am happy with 1 thread with 2-5 secs between requests. :)
I have 10 000 nodes, and I am afraid of hammering the server if I turn the feature on as it is at the moment :)
Also not sure about the php timeout for this many nodes.
Is this done when cron runs? Or when the submit button is pressed to clear cache?
Comment #19
mikeytown2 CreditAttribution: mikeytown2 commentedphp/cpu timeout is taken care of; it shouldn't happen, ever. Crawler starts right after cron is run.
Comment #20
g10tto CreditAttribution: g10tto commentedI have a site requiring user authentication for 95% of its content. Since Boost only caches pages viewed by anonymous users, is it possible to still use this crawler (using default Drupal and Views caching) with my site?
Comment #21
mikeytown2 CreditAttribution: mikeytown2 commented@g10tto
The crawler hit's pages as an anonymous user, so if that page returns a 403, it doesn't really help anyone. Drupal's cache (what you see on the performance page) is for anonymous users as well. At the bottom of the boost project page are links to other performance/caching methods.
Comment #22
g10tto CreditAttribution: g10tto commented@mikey Thanks for the tip.
Is there another known crawler program for Drupal, or a way to implement a 3rd party crawler? I'm just curious, because it seems like the crawler functionality would especially come in handy on larger news-related sites that have an archive of nodes, and I would be surprised if there were no other implementation until now.
In addition, I have wondered for some time now why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.
Comment #23
mikeytown2 CreditAttribution: mikeytown2 commented#363077: Add spider to crawler - Cache entire site with new install.
Here's a crawler I made thats independent of drupal. It's crawled a site with over 1,000,000 url's; it works, but not easy to setup.
I'm assuming your talking about it only caching what anonymous users see. Anonymous is easy; there's only 1 version of each page, and with that, boost is still a very complicated module.
Comment #24
mikeytown2 CreditAttribution: mikeytown2 commentedComment #25
mikeytown2 CreditAttribution: mikeytown2 commentedComment #26
capellicNice! Since I've built my own Pre-Caching module to take care of the 5 to 10 top-level pages on my site, I don't have a use for this at this time. I have a couple of bigger projects coming up and I'll certainly be giving this feature a spin!