Active
Project:
Boost
Version:
6.x-1.x-dev
Component:
Cron Crawler
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
22 Feb 2010 at 06:37 UTC
Updated:
22 Jan 2011 at 09:19 UTC
Jump to comment: Most recent file
In the following function boost_crawler_number_of_threads can become negative and start adding new threads. I saw the process going to -480. Solved it for the moment by adding a check on boost_crawler_number_of_threads in the indicated place, but haven't thought much about it yet.
// Spin up threads on demand
while ($threads > 0 && $this_thread == 1) {
// put some sanity tests here
db_lock_table('variable');
$thread_time = _boost_variable_get('boost_crawler_thread_num_' . $threads);
if (!$thread_time || $thread_time + BOOST_MAX_THREAD_TIME < BOOST_TIME) {
_boost_variable_set('boost_crawler_thread_num_' . $threads, BOOST_TIME);
db_unlock_tables();
boost_async_call_crawler($self, $threads, _boost_variable_get('boost_crawler_number_of_threads'), $expire);
if (BOOST_VERBOSE >= 5) {
watchdog('boost', 'Crawler - Thread @num of @total started', array('@num' => $threads, '@total' => _boost_variable_get('boost_crawler_number_of_threads')));
}
_boost_set_time_limit(0);
}
db_unlock_tables();
$threads--;
}
Comments
Comment #1
apemantus commentedJust to say I have just seen this running 6.x-1.18 for the first time - see attached. Running cron has not expired/overwritten any old pages. I haven't tried the patch above yet.
(I'm also not quite sure how to stop it -1600 now and counting on a site with a couple of hundred pages...)
Comment #2
mikeytown2 commentedThe crawler code is very complex because it's doing something that PHP was never designed to do. I should create a different code path that can take advantage of lock.inc (new in Drupal 6.16). In short I should rewrite the crawler again (this will be round #4) & see if I can make the logic for it simpler.
New architecture ideas.
* Define a maximum page generation time. Crawler checks-in after 1/2 max page time; and at the end of the crawler run. If worker hasn't checked in after 1.5x max page generation time; then worker is considered dead.
* Each worker gets a slot.
* Each worker can startup another worker if a slot is open. Use lock.inc when starting up a worker. Slot naming convention: Thread # - letter(a or b or c ... z). Examples: 1-a, 2-a, 3-a. If 1-a dies then new thread for that slot becomes 1-b. if 1-b dies then new thread starts up in slot 1-c. If z is reached then go back to a.
* If worker sees that it's own slot has been filled by another worker (worker was considered dead) then it dies.
* Worker keeps track of its last check-in timestamp and the one recorded in the database; if the do not match then worker assumes it's been replaced and it will not spawn a new worker.
Main issue is its very hard to keep track of all the crawler threads. Right now it does a good job, but the stats are way off. If you can come up with a fix (I didn't see one in the original post) I'll add it in. The only way to fix this from my point of view is to rewrite the crawler & try a different way of managing the workers.
Comment #3
mikeytown2 commentedComment #4
szy commentedChanging this to 'normal' priority, as dead POST calls are waste
of resources.
More screens and info in duplicate here:
-> #824562: Crawler - Thread 1 of -89 started
Sorry, I cannot help you with a patch.
Szy.
Comment #5
locomo commentedsubscribe
Comment #6
marios88 commentedTo fix this i prevented the crawler from restarting to get the "stubborn urls"
There are 2 spots where you have to add the " 1 == 2 " search for "'boost_crawler_number_of_tries" in boost.module