Number of total crawler threads can be negative [#721490]

In the following function boost_crawler_number_of_threads can become negative and start adding new threads. I saw the process going to -480. Solved it for the moment by adding a check on boost_crawler_number_of_threads in the indicated place, but haven't thought much about it yet.

    // Spin up threads on demand
    while ($threads > 0 && $this_thread == 1) {
   
     // put some sanity tests here

      db_lock_table('variable');
      $thread_time = _boost_variable_get('boost_crawler_thread_num_' . $threads);
      if (!$thread_time || $thread_time + BOOST_MAX_THREAD_TIME < BOOST_TIME) {
        _boost_variable_set('boost_crawler_thread_num_' . $threads, BOOST_TIME);
        db_unlock_tables();
        boost_async_call_crawler($self, $threads, _boost_variable_get('boost_crawler_number_of_threads'), $expire);
        if (BOOST_VERBOSE >= 5) {
            watchdog('boost', 'Crawler - Thread @num of @total started', array('@num' => $threads, '@total' => _boost_variable_get('boost_crawler_number_of_threads')));
        }
        _boost_set_time_limit(0);
      }
      db_unlock_tables();
      $threads--;
    }

Comment	File	Size	Author
#1	boost.jpg	455.08 KB	apemantus

Comments

Comment #1

apemantus commented 25 March 2010 at 13:10

Status	File	Size
new	boost.jpg	455.08 KB

Just to say I have just seen this running 6.x-1.18 for the first time - see attached. Running cron has not expired/overwritten any old pages. I haven't tried the patch above yet.

(I'm also not quite sure how to stop it -1600 now and counting on a site with a couple of hundred pages...)

Comment #2

mikeytown2 commented 29 March 2010 at 04:25

Category:

bug

» feature

The crawler code is very complex because it's doing something that PHP was never designed to do. I should create a different code path that can take advantage of lock.inc (new in Drupal 6.16). In short I should rewrite the crawler again (this will be round #4) & see if I can make the logic for it simpler.

New architecture ideas.
* Define a maximum page generation time. Crawler checks-in after 1/2 max page time; and at the end of the crawler run. If worker hasn't checked in after 1.5x max page generation time; then worker is considered dead.
* Each worker gets a slot.
* Each worker can startup another worker if a slot is open. Use lock.inc when starting up a worker. Slot naming convention: Thread # - letter(a or b or c ... z). Examples: 1-a, 2-a, 3-a. If 1-a dies then new thread for that slot becomes 1-b. if 1-b dies then new thread starts up in slot 1-c. If z is reached then go back to a.
* If worker sees that it's own slot has been filled by another worker (worker was considered dead) then it dies.
* Worker keeps track of its last check-in timestamp and the one recorded in the database; if the do not match then worker assumes it's been replaced and it will not spawn a new worker.

Main issue is its very hard to keep track of all the crawler threads. Right now it does a good job, but the stats are way off. If you can come up with a fix (I didn't see one in the original post) I'll add it in. The only way to fix this from my point of view is to rewrite the crawler & try a different way of managing the workers.

Comment #3

mikeytown2 commented 29 March 2010 at 04:25

Category:	feature	» bug
Priority:	Normal	» Minor

Comment #4

szy commented 16 June 2010 at 09:56

Priority:

Minor

» Normal

Changing this to 'normal' priority, as dead POST calls are waste
of resources.

More screens and info in duplicate here:

-> #824562: Crawler - Thread 1 of -89 started

Sorry, I cannot help you with a patch.

Szy.

Comment #5

locomo commented 11 July 2010 at 16:04

Comment #6

marios88 commented 22 January 2011 at 09:19

To fix this i prevented the crawler from restarting to get the "stubborn urls"

if (   !boost_crawler_threads_alive()
          && _boost_variable_get('boost_crawler_number_of_tries') < 3
          && boost_crawler_verify($expire)
          && 1==2
          ) {

There are 2 spots where you have to add the " 1 == 2 " search for "'boost_crawler_number_of_tries" in boost.module

Number of total crawler threads can be negative

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

News items

Our community

Documentation

Drupal code base

Governance of community