As discussed in #1419744: Exit processing in httprl_send_request() after specified page request run time it would be great to have a ability to run a callback/hook after a link has been checked.

As you can see in the patch at http://drupal.org/node/380052#comment-5526984, I'm adding 'linkchecker_link' => $link, to the options array to add the $link object and keep it inside httprl responds object. Now I need the ability to execute the function _linkchecker_status_handling($link, $response); after a link has been checked.

I'm open minded to ideas how we can archive this. The linked linkchecker patch does not allow me to run 128 simultaneous requests and this is why it's important to get such a feature in.

Comments

mikeytown2’s picture

Title:Run callback after link has been fetched» Run callback after stream has been fetched

There are a couple of options in order to accomplish this.
Simple version: Function callbacks happen in this process at the end.
Semi-Complex version: Function callbacks happen in this processs in the event loop.
Complex version: Function callbacks happen in another process via a HTTPRL non blocking request in the event loop.

The options array will have a parameter called callback and callback_mode. callback takes an array where the first value the module the file lives in, second is the function name, and other values are parameters you wish to pass to that function. I'm thinking call_user_func_array() will do most of the dirty work for us. callback_mode is the mode; end, event-loop, non-blocking-(bootstrap level).

In the simple version, HTTPRL issues the callback with the first parameter being the request. Fixing redirects (having them pared in the event loop) needs to happen for the complex versions to work.

The fully complex version will have a menu item (or php file) that requires a key. Key changes every cron run, last 2 keys are good. The menu callback will run the requested function. State will need to be passed in and set in the custom callback function.

Note: The complex versions kinda act like node.js.
Note2: For low level bootstrap operations, passing in the module name needs to happen as well.

hass’s picture

I'm not sure if I understood all this...

Simple version sound like not working... Or at least problematic. It means you run all link checks and after the global timeout you run the callback? This may fire 1000 links with status codes and 30.000 without. And run xxx node loads/updates... No good idea...

Semi complex: After a url has downloaded, the event fires and run the callback. In this case we could use the 25ms wait time for running the callback... This was at least my first idea... But in some cases the callback will for sure take much longer.

Complex: i don't understand the details... But i sometimes have very complex and time intensive things to do... Let's say i need to unpublish 100 nodes or update a 301 link in 100 nodes. If this is 100% non-blocking it would be the most perfect solution ever.

My main problem is currently i'm limiting the number of urls i'm pushing into httprl to 8, 16, 32 and so on. After e.g. 8 checks are completed it runs the "callback" and than the next 8 checks. Because of timeout 7 links are completed and the logic waits for one link to timeout before the next 8 links are checked. As this is a blocking logic i have not been able to check more than 188 links. With core i checked ~80 what shows that the win-win with httprl is lot very large. I also tried chunks of 16, 32 without any real number change and only with 128 an peak of 244 checks. Than i reduced the timeout to 10s and was able to run 680 checks. With 5s timeout over 800.

Well we see the blocking has a major affect here... That's why i'd like to push 2500 links into httprl and let 128 checks running in parrallel... Httprl would run them, may have some blockers inside, but is still non-blocking internally and runs let say 125 new and may have 3 waiting for a timeout, but is not blocking further checks + fires the callback after a link check completed... All until global timeout... This is perfect...

mikeytown2’s picture

Getting the complex version working is similar to writing something like node.js; it's possible but it's going to take some time. I think we can come up with something that will work in the mean time. The first step for accomplishing this is for a way to isolate linkchecker from cron; luckily we can use HTTPRL to do this.

What if we had a function called httprl_background_run_func() in a sub module called httprl_background. It would allow you to call a single function via a http request. All the hard work will be done in the sub module. An example related to linkchecker.

<?php
function linkchecker_cron() {
 
// This function takes less than a second to run in this process.
 
httprl_background_run_func(array(), '_linkchecker_cron_multi');
}

function
_linkchecker_cron_multi() {
 
// Do things that take a long time in here. It will not count against your normal cron run time as it is in a different process.
}
?>

The first parameter for httprl_background_run_func() is an array that controls the callback URL, etc; blank array uses the defaults. The second parameter is the function name to run, and any other parameters passed in after that will be parameters for the actual function. I'm thinking of using the locking framework as a way to manage access. httprl_background_run_func creates a uniquely named lock; the receiving function checks for that lock and if it exists it uses it; if lock doesn't exist process doesn't run and a fast 403 is returned.

I think this would be a good first step. What are your thoughts?

mikeytown2’s picture

Thinking about this and I think there is a more generic way to do this that will be beneficial for the wider Drupal ecosystem. Instead of calling it a callback; call it a post processor. Have various post processors like JSON, XML, etc... Linkchecker can implement it's own.

mikeytown2’s picture

Moved redirect and decode logic into the event loop. This has a nice side effect of making the code run slightly faster in most cases (got rid of a couple of foreach loops). Also added in a callback parameter. This is the Semi-Complex version.

This patch has been committed.
6.x http://drupalcode.org/project/httprl.git/commitdiff/34d15686e37beeb845cb...
7.x http://drupalcode.org/project/httprl.git/commitdiff/300c5776dfc58db2adb3...

Complex version would be a callback that hits a specialized url that will then run that function. Use locking framework for temp keys to prevent exploits as well as using the site key. This isn't that hard to code up; simple hook_menu and you're half way there.

Example code using print_r as the callback

<?php
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

error_reporting(-1);
ini_set('display_errors', TRUE);
header('Content-Type: text/plain');

 
// Add in the headers and enable blocking mode.
 
$options = array(
   
'blocking' => TRUE,
   
'method' => 'HEAD',
   
'max_redirects' => 4,
   
'callback' => array('print_r'),
  );

 
$urls = array(
   
'http://www.apple.com/qtactivex/qtplugin.cab',
   
'http://www.fh-muenster.de/FB10/weiterbildung.htm',
  );
  foreach (
$urls as $url) {
   
// Queue up the requests.
   
httprl_request($url, $options);
  }

 
// Execute requests.
 
$responses = httprl_send_request();
?>
mikeytown2’s picture

Status:Active» Fixed

Marking as fixed. Getting a complex version working can be a different issue once we iron out any issues with the above patch.

hass’s picture

Thank You very much. I give it a try asap.

mikeytown2’s picture

Just added this #1524968: Run drupal_alter() in httprl_post_processing()
Passes things in by reference to the callback; which should be a nice feature.

mikeytown2’s picture

This patch will run httprl_post_processing() after almost all stream types have been closed. The only cases where it won't happen is when it's following a redirect and when a Non-Blocking request is made. It's been committed.

mikeytown2’s picture

Background callbacks are now supported #1529246: Run callback in a different process.. Crazy thing is you can pass variables in by reference and it will change. I would like to make this more generalized so some things might be changing.

mikeytown2’s picture

The following patch has been committed. Changes how things work; got rid of a foreach loop at the bottom and made things more consistent with callback and background_callback .

mikeytown2’s picture

The httprl_queue_background_callback function now can pass in options for the httprl_request call. It can also take a URL parameter if you wish to pass this off to a different server.

mikeytown2’s picture

StatusFileSize
new5.98 KB

added in documentation to the readme. This patch has been committed.

hass’s picture

Please let me know when you think this code / API is stable and ready for use in other projects. Do we need an API version?

mikeytown2’s picture

Code is ready for other projects. We are now using the threading part of it in some of our code; its quite useful. The main thing is the API for background_callback and callback is now consistent. Waiting a couple of days till I roll a new version of httprl out. You can check if the "httprl_queue_background_callback" function exists or if you think we need it we can make an API constant.

mikeytown2’s picture

StatusFileSize
new2.85 KB

This patch has been committed. Locks where being released before they where used in the async callback in the new example given in the readme file.

mikeytown2’s picture

Another issue that I've fixed #1543908: Pass by reference not working for httprl_queue_background_callback().

@hass
Let me know if you need some help with #380052: Add support with non-blocking parallel link checking. There are several ways to accomplish your goals.

hass’s picture

The technical best, most reliable and fastest... :-)

mikeytown2’s picture

There are multiple ways to do this. I would split up _linkchecker_status_handling() so we can separate the slow operations from the quick ones. Looking at the code, it looks like 301s is the slow code.

Next step is to see what we want to do for the callback. You can use the callback option or you could use hook_httprl_post_processing_alter; In my opinion callback is the right way to do it.

For 301s I would use httprl_queue_background_callback() with no return or printed keys set; it will run that code in the background.

One final note is that you can now send in a list or URLs into httprl_request() #1460828: Let $url be a string or an array in httprl_request

mikeytown2’s picture

I've added in a new helper function. Thinking about creating a 2.x branch to take full advantage of it
#1555314: Document how to do a SQL query in the background/parallel

Example code below

<?php
 
// D6
  // Run 2 queries and get it's result.
 
$max = db_result(db_query('SELECT MAX(wid) FROM {watchdog}'));
 
$min = db_result(db_query('SELECT MIN(wid) FROM {watchdog}'));
  echo
$max . ' ' . $min;
 
 
// Doing the same thing as above but with a set of arrays.
 
$max = '';
 
$min = '';
 
$args = array(
    array(
     
'type' => 'function',
     
'call' => 'db_query',
     
'args' => array('SELECT MAX(wid) FROM {watchdog}'),
    ),
    array(
     
'type' => 'function',
     
'call' => 'db_result',
     
'args' => array('last' => NULL),
     
'return' => &$max,
    ),
    array(
     
'type' => 'function',
     
'call' => 'db_query',
     
'args' => array('SELECT MIN(wid) FROM {watchdog}'),
    ),
    array(
     
'type' => 'function',
     
'call' => 'db_result',
     
'args' => array('last' => NULL),
     
'return' => &$min,
    ),
  );
 
httprl_run_array($args);
  echo
$max . ' ' . $min;
?>
<?php
 
// D7
  // Run a query and get it's result.
 
$min = db_select('watchdog', 'w')
   ->
fields('w', array('wid'))
   ->
orderBy('wid', 'DESC')
   ->
range(999, 1)
   ->
execute()
   ->
fetchField();
  echo
$min;
 
 
// Doing the same thing as above but with a set of arrays.
 
$min = '';
 
$args = array(
    array(
     
'type' => 'function',
     
'call' => 'db_select',
     
'args' => array('watchdog', 'w',),
     ),
    array(
     
'type' => 'method',
     
'call' => 'fields',
     
'args' => array('w', array('wid')),
    ),
    array(
     
'type' => 'method',
     
'call' => 'orderBy',
     
'args' => array('wid', 'DESC'),
    ),
    array(
     
'type' => 'method',
     
'call' => 'range',
     
'args' => array(999, 1),
    ),
    array(
     
'type' => 'method',
     
'call' => 'execute',
     
'args' => array(),
    ),
    array(
     
'type' => 'method',
     
'call' => 'fetchField',
     
'args' => array(),
     
'return' => &$min,
    ),
  );
 
httprl_run_array($args);
  echo
$min;
?>

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.