In the following function boost_crawler_number_of_threads can become negative and start adding new threads. I saw the process going to -480. Solved it for the moment by adding a check on boost_crawler_number_of_threads in the indicated place, but haven't thought much about it yet.

    // Spin up threads on demand
    while ($threads > 0 && $this_thread == 1) {
   
     // put some sanity tests here

      db_lock_table('variable');
      $thread_time = _boost_variable_get('boost_crawler_thread_num_' . $threads);
      if (!$thread_time || $thread_time + BOOST_MAX_THREAD_TIME < BOOST_TIME) {
        _boost_variable_set('boost_crawler_thread_num_' . $threads, BOOST_TIME);
        db_unlock_tables();
        boost_async_call_crawler($self, $threads, _boost_variable_get('boost_crawler_number_of_threads'), $expire);
        if (BOOST_VERBOSE >= 5) {
            watchdog('boost', 'Crawler - Thread @num of @total started', array('@num' => $threads, '@total' => _boost_variable_get('boost_crawler_number_of_threads')));
        }
        _boost_set_time_limit(0);
      }
      db_unlock_tables();
      $threads--;
    }
CommentFileSizeAuthor
#1 boost.jpg455.08 KBapemantus

Comments

apemantus’s picture

StatusFileSize
new455.08 KB

Just to say I have just seen this running 6.x-1.18 for the first time - see attached. Running cron has not expired/overwritten any old pages. I haven't tried the patch above yet.

(I'm also not quite sure how to stop it -1600 now and counting on a site with a couple of hundred pages...)

mikeytown2’s picture

Category: bug » feature

The crawler code is very complex because it's doing something that PHP was never designed to do. I should create a different code path that can take advantage of lock.inc (new in Drupal 6.16). In short I should rewrite the crawler again (this will be round #4) & see if I can make the logic for it simpler.

New architecture ideas.
* Define a maximum page generation time. Crawler checks-in after 1/2 max page time; and at the end of the crawler run. If worker hasn't checked in after 1.5x max page generation time; then worker is considered dead.
* Each worker gets a slot.
* Each worker can startup another worker if a slot is open. Use lock.inc when starting up a worker. Slot naming convention: Thread # - letter(a or b or c ... z). Examples: 1-a, 2-a, 3-a. If 1-a dies then new thread for that slot becomes 1-b. if 1-b dies then new thread starts up in slot 1-c. If z is reached then go back to a.
* If worker sees that it's own slot has been filled by another worker (worker was considered dead) then it dies.
* Worker keeps track of its last check-in timestamp and the one recorded in the database; if the do not match then worker assumes it's been replaced and it will not spawn a new worker.

Main issue is its very hard to keep track of all the crawler threads. Right now it does a good job, but the stats are way off. If you can come up with a fix (I didn't see one in the original post) I'll add it in. The only way to fix this from my point of view is to rewrite the crawler & try a different way of managing the workers.

mikeytown2’s picture

Category: feature » bug
Priority: Normal » Minor
szy’s picture

Priority: Minor » Normal

Changing this to 'normal' priority, as dead POST calls are waste
of resources.

More screens and info in duplicate here:

-> #824562: Crawler - Thread 1 of -89 started

Sorry, I cannot help you with a patch.

Szy.

locomo’s picture

subscribe

marios88’s picture

To fix this i prevented the crawler from restarting to get the "stubborn urls"

if (   !boost_crawler_threads_alive()
          && _boost_variable_get('boost_crawler_number_of_tries') < 3
          && boost_crawler_verify($expire)
          && 1==2
          ) {

There are 2 spots where you have to add the " 1 == 2 " search for "'boost_crawler_number_of_tries" in boost.module